Dave Johnson on open web technologies, social software and software development
How do you do it? I need to provide some examples to show how to parse RSS with Java and C#. I have written simple parsers using the common XML parsing techniques such as DOM, SAX, and Pull. I have also written some examples that use parser libraries, but I have yet to find a good and free RSS parser library for .Net. Lazy-web, please help me out here.
When you assume...
If you assume that RSS is XML and you are just interested in getting
titles, decriptions, links, and dates then it is pretty easy to write a
simple parser that can handle most forms of RSS including RSS 1.0, RSS
2.0, and some forms of funky RSS. If you to handle more than those
basic elements, then I recommend that you use a parser library.
Parser libraries
Python programmers are blessed with a great newsfeed parser library: Pilgrim's regex-based Universal Feed Parser which can parse any feed, even if it is not valid XML. I don't think Pilgrim's parser will port easily to the Java version of Python Jython, because Jython is missing some important Python libraries and Jython uses a Java regex which is different from Python's built-in regex. The same thing probably goes for the .Net version of Python IronPython. By the way, Lazy-web, would you please port Pilgrim's parser to Jython?
So, Java developers don't have the Universal Feed Parser, but we do have two active projects that are developing full featured RSS (and Atom) parsers: Informa (used by Javablogs.com) and Rome. .Net developers have RSS.Net, but it is incomplete and development seems to have comletely stagnated back in November of 2003.
So how do you parse RSS with .Net? I started looking around and digging into source code. I found that Dare built his C# based RSS parser for RssBandit on top of an SGML parser. Joe built his C# based RSS parser for Aggie using good old System.Xml. I guess you just have to do it by hand, so here goes...
My examples
Now it's time for the lazy web to point and laugh at my feeble efforts to build simple RSS parsers in C#. I have two examples for your ridicule. After you are done laughing, please, .Net heads, help me out and tell me what I am doing wrong and where I can make improvements.
First, here is a simple C# RSS parser method that uses a DOM based approach. It extracts the basic elements of title, description, link, and pubDate from the channel and item levels and it puts them into a dictionary (just like Pilgrim's parser does). It can handle RSS 1.0, RSS 2.0, and some forms of funky RSS. Have a look:
public IDictionary ParseFeed(String fileName) {
XmlDocument feedDoc = new XmlDocument();
feedDoc.Load(fileName);
XmlElement root = feedDoc.DocumentElement;
string defaultNS = null;
string contentNS = "http://purl.org/rss/1.0/modules/content/";
string dcNS = "http://purl.org/dc/elements/1.1/";
string xhtmlNS = "http://www.w3.org/1999/xhtml";
if (root.Name.Equals("rss")) {
defaultNS = null;
}
else {
defaultNS = "http://purl.org/rss/1.0/";
}
XmlElement channel = (XmlElement)root.GetElementsByTagName("channel").Item(0);
IDictionary feedMap = new Hashtable();
feedMap.Add("title", GetChildText(channel,"title",defaultNS));
feedMap.Add("pubDate", GetChildText(channel,"pubDate",defaultNS));
feedMap.Add("dc:date", GetChildText(channel,"date",dcNS));
feedMap.Add("description", GetChildText(channel,"description",defaultNS));
feedMap.Add("link", GetChildText(channel,"link",defaultNS));
XmlNodeList items = null;
if (root.Name.Equals("rss")) {
items = channel.GetElementsByTagName("item");
}
else {
items = root.GetElementsByTagName("item");
}
IList itemList = new ArrayList();
feedMap.Add("items", itemList);
for (int i=0; i<items.Count; i++) {
IDictionary itemMap = new Hashtable();
itemList.Add(itemMap);
XmlElement item = (XmlElement)items.Item(i);
itemMap.Add("title", GetChildText(item,"title",defaultNS));
itemMap.Add("link", GetChildText(item,"link",defaultNS));
itemMap.Add("guid", GetChildText(item,"guid",defaultNS));
itemMap.Add("pubDate", GetChildText(item,"pubDate",defaultNS));
itemMap.Add("dc:date", GetChildText(item,"date",dcNS));
itemMap.Add("description", GetChildText(item,"description",defaultNS));
itemMap.Add("content:encoded", GetChildText(item,"encoded",contentNS));
itemMap.Add("body", GetChildText(item,"body",xhtmlNS));
}
return feedMap;
}
private string GetChildText(XmlElement element, string childName, string namespaceURI) {
string text = null;
XmlNodeList nodeList = null;
if (namespaceURI != null) {
nodeList = element.GetElementsByTagName(childName, namespaceURI);
} else {
nodeList = element.GetElementsByTagName(childName);
}
if (nodeList!=null && nodeList.Item(0)!=null) {
if (nodeList.Item(0).FirstChild!=null) {
text = nodeList.Item(0).FirstChild.Value;
} else {
text = "";
}
}
return text;
}
And here is the same thing, but using a pull-parser based XmlTextReader approach:
public IDictionary ParseFeed(String fileName) {
XmlTextReader reader = new XmlTextReader(fileName);
reader.WhitespaceHandling = WhitespaceHandling.None;
IDictionary feedMap = new Hashtable();
IList items = new ArrayList();
IDictionary itemMap = null;
feedMap.Add("items", items);
while (reader.Read()) {
bool isStart = reader.NodeType.Equals(XmlNodeType.Element);
bool isEnd = reader.NodeType.Equals(XmlNodeType.EndElement);
if (isEnd && reader.Name.Equals("item")) {
itemMap = null;
}
else if (isStart && reader.Name.Equals("item")) {
itemMap = new Hashtable();
items.Add(itemMap);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("title")) {
reader.Read();
itemMap.Add("title", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("link")) {
reader.Read();
itemMap.Add("link", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("description")) {
reader.Read();
itemMap.Add("description", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("content:encoded")) {
reader.Read();
itemMap.Add("content:encoded", reader.Value);
}
else if (itemMap!=null && reader.Name.Equals("body")) {
reader.Read();
itemMap.Add("body", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("pubDate")) {
reader.Read();
itemMap.Add("pubDate", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("dc:date")) {
reader.Read();
itemMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("title")) {
reader.Read();
feedMap.Add("title", reader.Value);
}
else if (isStart && reader.Name.Equals("description")) {
reader.Read();
feedMap.Add("description", reader.Value);
}
else if (isStart && reader.Name.Equals("link")) {
reader.Read();
feedMap.Add("link", reader.Value);
}
else if (isStart && reader.Name.Equals("pubDate")) {
reader.Read();
feedMap.Add("pubDate", reader.Value);
}
else if (isStart && reader.Name.Equals("dc:date")) {
reader.Read();
feedMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("image")) {
// skip images
while (reader.Read()) {
if (reader.Name.Equals("image")
&& reader.NodeType.Equals(XmlNodeType.EndElement)) {
break;
}
}
}
}
return feedMap;
}
Have some better examples of parsing RSS with .Net? Please point me to them.
Dave Johnson in Microsoft
04:55AM Sep 01, 2004
Comments [10]
Tags:
microsoft
I haven't had as much fun watching the hits roll since when Weblogger.com threatened to sue me. Yesterday was much much more fun, of course. Thanks to all who commented, linked, welcomed and trackbacked me. One thing is for sure, you made my mom and dad feel a whole lot better about my leaving the seemingly safe sanctuary of SAS.
I'm venturing into new territory as a blogger. I have always kept my employer a secret. I never wanted anybody to google for HAHT or SAS and end up on my blog. I was a little worried about getting fired for blogging. It still happens even to those who try to be careful. Now, everybody knows who I work for and that changes things for me. On the positive side, blogging about my work with Roller, blogging technologies, Sun, and Java will give me lots of interesting material to work with - and then there's that evangelism thing. On the negative side, there are probably some topics that I had better avoid. Even with a company with a clueful policy on public discourse, you can still screw up and do damage to your career.
I'm confident that I'll do just fine in this new territory. I tend to be conservative in my output, perhaps too conservative. I'm also biased in favor of Sun and always have been. I'm a shareholder too. There's my full disclosure for you. I've been working with Sun hardware and software since the Sun3 timeframe. In fact, I proposed to the woman who became my wife as a direct result of a SPARCstation sale. I went down to Jamaica in '91 to install a SPARCstation-based system and to do a training workshop on the open source GRASS GIS software, got a great job offer at the Univ. of the West Indies, came home and asked Andi to marry me. We had a great honeymoon in Jamaica that lasted about a year and a half. I hope my honeymoon at Sun will last a lot longer than that.
Dave Johnson in Blogging
04:11PM Aug 31, 2004
Comments [1]
Tags:
Blogging
It's official. Roller is now my full time job. I just accepted a job with Sun Microsystems to "design, develop, and deploy the primary blogging system for Sun in conjunction with other engineers" and to evangelize blogging both inside and outside of Sun. Needless to say, I'm thrilled. I'm honored to be working for Sun and with great folks like Will Snow, John Hoffman, Tim Bray, Patrick Chanezon, and Danese Cooper. I'm excited to be working for a company that feels the same was as I do about the value of blogs and wikis, open source software, and encouraging employees to speak with honest and authentic voice to customers, to partners, and to each other.
What does this mean to Roller? Only good things. Sun wants many of the same things for Roller that other Roller users want including high performance, high availability, great user interface, support for standards, and better support for large communities of bloggers. Thanks to Sun I'll be working full time to help make these things happen. Since Roller will continue on as an open source project, you can help too (and I hope you will).
Dave Johnson in Sun
05:09PM Aug 29, 2004
Comments [29]
Tags:
Sun
Dave Johnson in General
07:37AM Aug 28, 2004
Comments [0]
Tags:
microsoft
Over the past couple of years, I've been scanning my photo collection using a HP slide/negative scanner. My dad, who is an excellent photographer, has been scanning his collection as well. So, to add a little life to this tired old blog, I'm going to start taking advantage of my .Mac account (no longer active) and posting each week a photo from my collection or my dad's collection. Here is the first one:
Jamaican carwash - My old VM Golf in the carwash close to Ocho Rios, Jamaica
Dave Johnson in General
07:00PM Aug 27, 2004
Comments [0]
Tags:
family
photo
I've been researching newsfeed formats for various reasons. I've been using Rome to convert to and from various formats and that revealed a problem with Roller's RSS feed. After re-reading the loosey goosey RSS specs, I'm thinking that I it wrong in the Roller RSS feeds. What do you think? Currently, Roller uses the following elements for links:
Dave Johnson in Roller
03:37AM Aug 27, 2004
Comments [3]
Tags:
Roller
Dave Johnson in Roller
03:32AM Aug 27, 2004
Comments [4]
Tags:
Roller
Dave Johnson in General
03:16PM Aug 23, 2004
Comments [0]
Tags:
General
Dave Johnson in Links
03:11PM Aug 23, 2004
Comments [0]
Tags:
Links
Dave Johnson in Links
03:10PM Aug 23, 2004
Comments [0]
Tags:
Links
Dave Johnson in Links
02:29PM Aug 23, 2004
Comments [0]
Tags:
Links
Dave Johnson in Links
02:26PM Aug 23, 2004
Comments [0]
Tags:
Links
Dave Johnson in Blogging
06:34PM Aug 22, 2004
Comments [0]
Tags:
Blogging
Le Danois: A wiki and weblog placed on a USB key, is that possible? The answer seems to be yes. I have put a bundle of Roller weblogger, JSPWiki, HSQLDB (file based database) and Tomcat on the USB key and I am currently testing it.
Dave Johnson in Blogging
10:43AM Aug 21, 2004
Comments [0]
Tags:
Blogging
Dave Johnson in Blogging
05:15PM Aug 19, 2004
Comments [0]
Tags:
Blogging
Dave Johnson in Links
06:50AM Aug 19, 2004
Comments [0]
Tags:
Links
Dave Johnson in Links
06:38AM Aug 19, 2004
Comments [0]
Tags:
Links
Dave Johnson in General
04:55AM Aug 19, 2004
Comments [0]
Tags:
music
Dave Johnson in General
04:27AM Aug 19, 2004
Comments [0]
Tags:
music
Dave Johnson in General
03:56AM Aug 17, 2004
Comments [1]
Tags:
music
« Previous page | Main | Next page »