How to Extract Plain Text from HTML in Java with Free Spire.Doc

This article explains how to quickly convert HTML files to plain text with Free Spire.Doc for Java. It is well suited for text cleaning, content extraction, and batch parsing workflows. The main advantages are minimal code, fast integration, and no need to hand-write tag parsing logic.

This approach provides a low-cost implementation path for Java text extraction scenarios

In tasks such as log archiving, web content ingestion, and knowledge base cleanup, developers often need to strip HTML tags, styles, and scripts while keeping only readable text.

Compared with manually processing the DOM, removing tags with regular expressions, or maintaining custom parsing rules, using a document parsing library can speed up delivery. This is especially practical for small and medium-sized projects and utility-style programs.

Technical specification snapshot

Parameter Description
Language Java
Input Format HTML
Output Format TXT / String
Core Integration Maven repository / JAR import
Star Count Not provided in the original source
Core Dependency e-iceblue:spire.doc.free:14.3.1
Core APIs loadFromFile(), getText()

The core idea is to parse HTML into a document object model and then extract text

Free Spire.Doc for Java is primarily designed for document processing, but when it loads HTML, it maps elements such as tags, paragraphs, and tables into its own document object model.

After you call loadFromFile(path, FileFormat.Html), the library parses the HTML as rich text. Then, when you execute getText(), it traverses text nodes in the document tree and outputs the plain text result.

In essence, this is not a character-by-character tag removal approach. It is a structure-aware extraction process: understand the document first, then extract the text. That is why it is usually more stable than simple regex-based stripping and better suited for reusable engineering workflows.

import com.spire.doc.Document;
import com.spire.doc.FileFormat;

public class PreviewOnly {
    public static void main(String[] args) {
        Document doc = new Document();
        // Load the file as HTML and let the library parse the structure
        doc.loadFromFile("Sample.html", FileFormat.Html);
        // Retrieve the plain text content directly from the document
        String text = doc.getText();
        System.out.println(text);
    }
}

The example above shows the minimum working flow: load HTML, extract text, and print the result to the console.

You need to complete dependency setup before calling the API in your project

If your project uses Maven, add the repository and dependency to pom.xml. The library comes from the e-iceblue repository, so Maven Central may not always resolve it directly.


<repositories>

<repository>

<id>com.e-iceblue</id>

<name>e-iceblue</name>

<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>

<dependencies>

<dependency>

<groupId>e-iceblue</groupId>

<artifactId>spire.doc.free</artifactId>

<version>14.3.1</version>
    </dependency>
</dependencies>

This configuration allows Maven to download Free Spire.Doc for Java and its dependencies correctly.

If you use Gradle, you can integrate the library with a shorter dependency declaration.

dependencies {
    implementation 'e-iceblue:spire.doc.free:14.3.1@jar'
}

This setup works well for quickly integrating the library into a Gradle project.

The complete example can extract HTML text and save it as a TXT file

The following code covers the full workflow from loading HTML and extracting plain text to writing the result to a file. It works well as a prototype utility class or a command-line entry point.

import com.spire.doc.Document;
import com.spire.doc.FileFormat;
import java.io.FileWriter;
import java.io.IOException;

public class ExtractTextFromHTML {
    public static void main(String[] args) {
        // 1. Create a document object as the HTML parsing container
        Document doc = new Document();

        // 2. Load the source file in HTML format
        doc.loadFromFile("Sample.html", FileFormat.Html);

        // 3. Extract plain text; tags will not appear in the result
        String text = doc.getText();

        // 4. Write the extracted result to a TXT file for archiving or later processing
        try (FileWriter fileWriter = new FileWriter("HTMLText.txt")) {
            fileWriter.write(text);
            System.out.println("Text extraction completed and saved to HTMLText.txt");
        } catch (IOException e) {
            // Catch file write exceptions to avoid silent failures
            System.err.println("Failed to write file: " + e.getMessage());
        }
    }
}

This code extracts visible text from Sample.html and saves it to HTMLText.txt.

This implementation is especially useful for lightweight cleaning and offline processing tasks

If your goal is to extract web page body text and feed it into a search index, vector database, review workflow, or NLP pipeline, this approach is direct and practical.

Its main advantage is that you do not need to maintain complex parsing logic. Your business code only needs to focus on the input path and output destination, which keeps integration costs low.

This solution has natural limits in layout fidelity and complex semantic scenarios

First, getText() focuses on the text result rather than the page’s visual layout. Tables, indentation, and list hierarchy in the original HTML may be flattened into ordinary line breaks.

If your goal is plain text extraction, that behavior is usually acceptable. If your goal is structure preservation, you should consider DOM parsing, XPath, JSoup, or a specialized content extraction solution instead.

Scripts and styles usually do not appear in the final text output

The original article notes that the library typically ignores content inside <script> and <style> when loading HTML, so the final output mainly consists of visible text nodes.

That makes it useful for removing page noise, but it also means it is not suitable for collecting script templates, style rules, event attributes, or other development-oriented analysis data.

The free edition is better suited for small and medium-scale validation and utility use cases

The free edition can handle lightweight HTML text extraction needs, but the original source explicitly notes that it is more appropriate for individual developers and small to medium-sized projects.

If you need to process very large documents, restore complex layout, or support enterprise-scale batch jobs, evaluate throughput, licensing, and output quality before adopting it broadly.

In production, you should add batch-processing and exception-handling capabilities

In real-world projects, HTML sources may include garbled text, corrupted files, missing paths, or empty content. It is a good practice to wrap extraction logic with consistent exception handling and logging.

import java.io.File;

public class BatchCheck {
    public static boolean validate(File file) {
        // Skip the file immediately if it does not exist or is not a regular file
        if (file == null || !file.exists() || !file.isFile()) {
            return false;
        }
        // Files with zero length usually have no extraction value
        return file.length() > 0;
    }
}

This code performs basic validation before batch processing to reduce unnecessary parsing and exception noise.

The conclusion is that this is a practical HTML-to-text extraction solution for fast delivery

If you want low code volume, fast integration, and stable text output, Free Spire.Doc for Java offers a highly practical path.

It does not emphasize preserving page structure. Instead, it prioritizes one goal: turning HTML into consumable text. That makes it especially useful for data cleaning, knowledge organization, search ingestion, and content archiving.

WeChat share prompt

AI Visual Insight: This image is an animated blog-sharing prompt that guides users to share the article through a WeChat UI entry point. It does not contain technical information about the HTML extraction workflow, code structure, or system architecture.

FAQ

1. Why not remove HTML tags with regular expressions directly?

Regular expressions work for extremely simple cases, but they become unreliable when dealing with nested tags, HTML entities, malformed closing tags, and complex structures. This approach parses the document model first and then extracts text, which provides better fault tolerance in engineering scenarios.

2. Can this method preserve table and list structure?

Not completely. getText() is designed for plain text extraction, so tables, indentation, and hierarchy may be flattened. If you need structured output, use DOM parsing or a specialized conversion approach.

3. Is this approach suitable for batch processing large numbers of HTML files?

Yes, for lightweight batch processing and utility workflows. However, you should add file validation, exception logging, and performance evaluation. For large-scale production workloads, you should also review the free edition’s limitations, throughput, and licensing model.

Core Summary: This article reconstructs a lightweight HTML-to-plain-text extraction solution based on Free Spire.Doc for Java. It covers the implementation model, Maven and Gradle dependencies, complete sample code, practical limitations, and common questions, making it a strong fit for Java text cleaning, content parsing, and batch-processing scenarios.