How to extract plain Text from HTML Website easily in Java

I was looking for ways to crawl websites, and to be able to only extract text. The reason I was trying to do this was to get the text from various websites to prepare Text Corpus for Natural Language Processing for a Nepali Language. There were several solutions on the internet, but nothing could be as simple as this one. I wrote this using a JSoup Library. In the example below, I have extracted text from the entire body, but if you want you can extract text for a desired node (and children) easily.

/**
 * 
 * @author Kushal Paudyal
 * Created on: 3/9/2017
 * Last Modified on: 3/9/2017
 *
 */
package com.icodejava.research.nlp;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HtmlTextExtractor {
	
	public static void main (String args []) throws IOException {
		Document doc = Jsoup.connect("http://swasthyakhabar.com/news-details/3356/2017-03-09").get();
		
		System.out.println(doc.body().text());
	}

}

Tagged , , , , , , , , . Bookmark the permalink.

3 Responses to Calculating Folder Size In Java

  1. Pingback: Sanjaal.com » Latest Updates

  2. Jamie says:

    This approach uses less memory:

    public static class SizeCounter implements FileFilter
    {
    private long total = 0;
    public SizeCounter(){};
    public boolean accept(File pathname) {
    if ( pathname.isFile()) {
    total+=pathname.length();
    } else {
    pathname.listFiles(this);
    }
    return false;
    }
    public long getTotal()
    {
    return total;
    }
    }

    private static long getFileOrDirectorySize(File file) {
    SizeCounter counter = new SizeCounter();
    file.listFiles(counter);
    return counter.getTotal();
    }

  3. kushalzone says:

    Thank you Jamie for your optimized solution.

Leave a Reply

Your email address will not be published. Required fields are marked *