JavaHtmlTidy is a Java port of the WorldWideWebConsortium's Tidy program for validating and fixing HTML. As a side effect, it can be used as a DOM parser for HTML. See http://sourceforge.net/projects/jtidy/ for more info. Other HTML parsers can be found at: http://java-source.net/open-source/html-parsers