How to extract plain Text from HTML Website easily in Java

I was looking for ways to crawl websites, and to be able to only extract text. The reason I was trying to do this was to get the text from various websites to prepare Text Corpus for Natural Language Processing for a Nepali Language. There were several solutions on the internet, but nothing could be as simple as this one. I wrote this using a JSoup Library. In the example below, I have extracted text from the entire body, but if you want you can extract text for a desired node (and children) easily.

 * @author Kushal Paudyal
 * Created on: 3/9/2017
 * Last Modified on: 3/9/2017
package com.icodejava.research.nlp;


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HtmlTextExtractor {
	public static void main (String args []) throws IOException {
		Document doc = Jsoup.connect("").get();


Tagged , , , , , , , , . Bookmark the permalink.

3 Responses to Calculating Folder Size In Java

  1. Pingback: » Latest Updates

  2. Jamie says:

    This approach uses less memory:

    public static class SizeCounter implements FileFilter
    private long total = 0;
    public SizeCounter(){};
    public boolean accept(File pathname) {
    if ( pathname.isFile()) {
    } else {
    return false;
    public long getTotal()
    return total;

    private static long getFileOrDirectorySize(File file) {
    SizeCounter counter = new SizeCounter();
    return counter.getTotal();

  3. kushalzone says:

    Thank you Jamie for your optimized solution.

Leave a Reply to kushalzone Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.