Blogging Roller

Dave Johnson on social software, open source and Java

Java HTML parsers.

The LinkbackExtractor that I posted yesterday uses the Swing HTML parser, which is built into Java, but there are other Java-based HTML parsers available. Erik Hatcher suggested the JTidy HTML parser and there is also the HTMLParser project on SourceForge. Know of any others?

Comments:

I currently plan to use this one: http://www.quiotix.com/downloads/html-parser/

Posted by Damien Bonvillain on January 11, 2003 at 06:20 PM EST #

I've used both the Swing HTML parser and the JTidy parser, and I've replaced them with the quiotix HTML parser. It is by far superior, because it doesn't try to fix any HTML, and the source code is generated from JavaCC.

Posted by Will Sargent on January 12, 2003 at 04:11 AM EST #

On another side, I don't like to see something like an HTML parser in a graphic toolkit like Swing. I've bad memories of implementations that required to run an X server just because some class somewhere were initializing an AWT peer and was never using it.

Posted by Damien Bonvillain on January 12, 2003 at 11:41 AM EST #

Jelly uses the NekoHTML parser, which turns HTML into SAX events... http://www.apache.org/~andyc/neko/doc/html/

Posted by James Strachan on January 13, 2003 at 03:51 AM EST #

I'm currently working on kind of Zope Page Template implementation in Java and i've tested different HTML parsers and IMHO, NekoHTML is the best one.

Posted by Vincent Faidherbe on January 13, 2003 at 04:19 AM EST #

I've not used it (but am planning to try it out), but there's also John Cowans recently released <a href="http://mercury.ccil.org/~cowan/XML/tagsoup/">TagSoup</a>. A parser for "nasty and brutish" HTML.

Posted by Leigh Dodds on January 13, 2003 at 08:13 AM EST #

Post a Comment:
  • HTML Syntax: Allowed