How to extract plain Text from HTML Website easily in Java

I was looking for ways to crawl websites, and to be able to only extract text. The reason I was trying to do this was to get the text from various websites to prepare Text Corpus for Natural Language Processing for a Nepali Language. There were several solutions on the internet, but nothing could be as simple as this one. I wrote this using a JSoup Library. In the example below, I have extracted text from the entire body, but if you want you can extract text for a desired node (and children) easily.

/**
 * 
 * @author Kushal Paudyal
 * Created on: 3/9/2017
 * Last Modified on: 3/9/2017
 *
 */
package com.icodejava.research.nlp;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HtmlTextExtractor {
	
	public static void main (String args []) throws IOException {
		Document doc = Jsoup.connect("http://swasthyakhabar.com/news-details/3356/2017-03-09").get();
		
		System.out.println(doc.body().text());
	}

}

How To Find What Java Version You Are Using?

One of the visitors of my blog asked me a question, “hey Kushal what java version are you using to compile your files” ? Well, I had lots of java versions installed in my computer for many different reasons. Honestly I wasn’t sure which version was I exactly using. So I wrote this little utiltiy that tells you the java version that you are currently using and the vendor of that java. Remember Sun Microsystems is not the sole java vendor. There are many others.

package com.kushal.utils;
/**
 * @author Kushal Paudyal
 * www.sanjaal.com/java
 * Last Modified On 05-20-2009
 */

/**
 * Demonstrates a simple way of getting
 * --Java Version
 * --Java Vendor
 */
public class GetJavaVersionAndVendor {

	public static void main(String args [])
	{
		String version=System.getProperty("java.version");
		String vendor=System.getProperty("java.vendor");

		System.out.println("Java Version Is: "+version);
		System.out.println("Java Vendor Is: "+vendor);
	}

}

———————-
Here is the sample output of this program

Java Version Is: 1.6.0_11
Java Vendor Is: Sun Microsystems Inc.

Computing the total, free and usable disk space easily using JDK 1.6

Note: This tool uses Jdk1.6. It does not work for lower versions of java sdk/jdk. You can run this tool to find the disk space, free space and usable disk space although for several of my test, usable space and free space returned the same value. This capability has been added to jdk1.6 version. Hence it will not work for any lower versions.

package com.kushal.tools;
/**
 * @author Kushal Paudyal
 * This tool use Jdk1.6. Does not work for lower versions.
 *
 * You can run this tool to find the disk space, free space and usable disk space
 * although for several of my test, usable space and free space returned the same
 * value. This capability has been added to jdk1.6 version. Hence it will not work
 * for lower versions.
 */
import java.io.File;

public class DiskSpaceJavaV6 {

	public static void main(String[] args) {
		File file = new File("c:");
		String unit="GB";
		DiskSpaceJavaV6 dspace=new DiskSpaceJavaV6();
		double totalDiskSpace = dspace.getTotalDiskSpace(file,unit );
		double usableSpace = dspace.getUsableSpace(file, unit);
		double freeSpace = dspace.getFreeSpace(file, unit);

		System.out.println("Total Disk Space: " +totalDiskSpace +" "+unit);
		System.out.println("Total Usable Space : " + usableSpace +" "+unit);
		System.out.println("Free Disk Space : " + freeSpace +" "+unit);
	}
	/**
	 * @param file - normally the top level drive e.g. c:
	 * @param unit - target unit for disk space. Allowed values KB, MB, GB.
	 * @return total disk space
	 */
	public double getTotalDiskSpace(File file, String unit) {
		return processUnit(file.getTotalSpace(), unit);
	}
	/**
	 * @param file - normally the top level drive e.g. c:
	 * @param unit - target unit for disk space. Allowed values KB, MB, GB.
	 * @return usable disk space
	 */
	public double getUsableSpace(File file, String unit) {
		return processUnit(file.getUsableSpace(), unit);
	}

	/**
	 * @param file - normally the top level drive e.g. c:
	 * @param unit - target unit for disk space. Allowed values KB, MB, GB.
	 * @return free space
	 */
	public double getFreeSpace(File file, String unit) {
		return processUnit(file.getFreeSpace(), unit);
	}

	/**
	 * @param space - disk space in bytes
	 * @param unit - the target unit. Allowed values: KB, MB, GB.
	 * @return processed value
	 */
	private double processUnit(long space, String unit) {
		String nonNullUnit=makeNonNullUnit(unit);
		if("KB".equalsIgnoreCase(nonNullUnit)) {
				return space/1024;
		} else if("MB".equalsIgnoreCase(nonNullUnit)) {
				return space/(1024*1024);
		} else if("GB".equalsIgnoreCase(nonNullUnit)) {
			return space/(1024*1024*1024);
		} else {
			return space;
		}
	}
	/**
	 * @param anyString
	 * @return non null value of the string
	 */
	private static String makeNonNullUnit(String anyString) {
		if(anyString==null) {
			return "";
		} else {
			return anyString;
		}
	}
}