Blogging Roller: Java HTML parsers.

Java HTML parsers.

The LinkbackExtractor that I posted yesterday uses the Swing HTML parser, which is built into Java, but there are other Java-based HTML parsers available. Erik Hatcher suggested the JTidy HTML parser and there is also the HTMLParser project on SourceForge. Know of any others?

Dave Johnson in Java • 🕒 11:22AM Jan 11, 2003

Tags: Java

Comments:

I currently plan to use this one: http://www.quiotix.com/downloads/html-parser/

Posted by Damien Bonvillain on January 11, 2003 at 09:20 PM EST #

I've used both the Swing HTML parser and the JTidy parser, and I've replaced them with the quiotix HTML parser. It is by far superior, because it doesn't try to fix any HTML, and the source code is generated from JavaCC.

Posted by Will Sargent on January 12, 2003 at 07:11 AM EST #

On another side, I don't like to see something like an HTML parser in a graphic toolkit like Swing. I've bad memories of implementations that required to run an X server just because some class somewhere were initializing an AWT peer and was never using it.

Posted by Damien Bonvillain on January 12, 2003 at 02:41 PM EST #

Jelly uses the NekoHTML parser, which turns HTML into SAX events... http://www.apache.org/~andyc/neko/doc/html/

Posted by James Strachan on January 13, 2003 at 06:51 AM EST #

I'm currently working on kind of Zope Page Template implementation in Java and i've tested different HTML parsers and IMHO, NekoHTML is the best one.

Posted by Vincent Faidherbe on January 13, 2003 at 07:19 AM EST #

I've not used it (but am planning to try it out), but there's also John Cowans recently released TagSoup. A parser for "nasty and brutish" HTML.

Posted by Leigh Dodds on January 13, 2003 at 11:13 AM EST #