Parsing RSS with .Net

How do you do it? I need to provide some examples to show how to parse RSS with Java and C#. I have written simple parsers using the common XML parsing techniques such as DOM, SAX, and Pull. I have also written some examples that use parser libraries, but I have yet to find a good and free RSS parser library for .Net. Lazy-web, please help me out here.

When you assume...

If you assume that RSS is XML and you are just interested in getting titles, decriptions, links, and dates then it is pretty easy to write a simple parser that can handle most forms of RSS including RSS 1.0, RSS 2.0, and some forms of funky RSS. If you to handle more than those basic elements, then I recommend that you use a parser library.

Parser libraries

Python programmers are blessed with a great newsfeed parser library: Pilgrim's regex-based Universal Feed Parser which can parse any feed, even if it is not valid XML. I don't think Pilgrim's parser will port easily to the Java version of Python Jython, because Jython is missing some important Python libraries and Jython uses a Java regex which is different from Python's built-in regex. The same thing probably goes for the .Net version of Python IronPython. By the way, Lazy-web, would you please port Pilgrim's parser to Jython?

So, Java developers don't have the Universal Feed Parser, but we do have two active projects that are developing full featured RSS (and Atom) parsers: Informa (used by Javablogs.com) and Rome. .Net developers have RSS.Net, but it is incomplete and development seems to have comletely stagnated back in November of 2003.

So how do you parse RSS with .Net? I started looking around and digging into source code. I found that Dare built his C# based RSS parser for RssBandit on top of an SGML parser. Joe built his C# based RSS parser for Aggie using good old System.Xml. I guess you just have to do it by hand, so here goes...

My examples

Now it's time for the lazy web to point and laugh at my feeble efforts to build simple RSS parsers in C#. I have two examples for your ridicule. After you are done laughing, please, .Net heads, help me out and tell me what I am doing wrong and where I can make improvements.

First, here is a simple C# RSS parser method that uses a DOM based approach. It extracts the basic elements of title, description, link, and pubDate from the channel and item levels and it puts them into a dictionary (just like Pilgrim's parser does). It can handle RSS 1.0, RSS 2.0, and some forms of funky RSS. Have a look:

public IDictionary ParseFeed(String fileName) {
XmlDocument feedDoc = new XmlDocument();
feedDoc.Load(fileName);
XmlElement root = feedDoc.DocumentElement;
string defaultNS = null;
string contentNS = "http://purl.org/rss/1.0/modules/content/";
string dcNS = "http://purl.org/dc/elements/1.1/";
string xhtmlNS = "http://www.w3.org/1999/xhtml";
if (root.Name.Equals("rss")) {
defaultNS = null;
}
else {
defaultNS = "http://purl.org/rss/1.0/";
}
XmlElement channel = (XmlElement)root.GetElementsByTagName("channel").Item(0);
IDictionary feedMap = new Hashtable();
feedMap.Add("title", GetChildText(channel,"title",defaultNS));
feedMap.Add("pubDate", GetChildText(channel,"pubDate",defaultNS));
feedMap.Add("dc:date", GetChildText(channel,"date",dcNS));
feedMap.Add("description", GetChildText(channel,"description",defaultNS));
feedMap.Add("link", GetChildText(channel,"link",defaultNS));

XmlNodeList items = null;
if (root.Name.Equals("rss")) {
items = channel.GetElementsByTagName("item");
}
else {
items = root.GetElementsByTagName("item");
}
IList itemList = new ArrayList();
feedMap.Add("items", itemList);
for (int i=0; i<items.Count; i++) {
IDictionary itemMap = new Hashtable();
itemList.Add(itemMap);
XmlElement item = (XmlElement)items.Item(i);
itemMap.Add("title", GetChildText(item,"title",defaultNS));
itemMap.Add("link", GetChildText(item,"link",defaultNS));
itemMap.Add("guid", GetChildText(item,"guid",defaultNS));
itemMap.Add("pubDate", GetChildText(item,"pubDate",defaultNS));
itemMap.Add("dc:date", GetChildText(item,"date",dcNS));
itemMap.Add("description", GetChildText(item,"description",defaultNS));
itemMap.Add("content:encoded", GetChildText(item,"encoded",contentNS));
itemMap.Add("body", GetChildText(item,"body",xhtmlNS));
}
return feedMap;
}
private string GetChildText(XmlElement element, string childName, string namespaceURI) {
string text = null;
XmlNodeList nodeList = null;
if (namespaceURI != null) {
nodeList = element.GetElementsByTagName(childName, namespaceURI);
} else {
nodeList = element.GetElementsByTagName(childName);
}
if (nodeList!=null && nodeList.Item(0)!=null) {
if (nodeList.Item(0).FirstChild!=null) {
text = nodeList.Item(0).FirstChild.Value;
} else {
text = "";
}
}
return text;
}

And here is the same thing, but using a pull-parser based XmlTextReader approach:

public IDictionary ParseFeed(String fileName) {
XmlTextReader reader = new XmlTextReader(fileName);
reader.WhitespaceHandling = WhitespaceHandling.None;
IDictionary feedMap = new Hashtable();
IList items = new ArrayList();
IDictionary itemMap = null;
feedMap.Add("items", items);
while (reader.Read()) {
bool isStart = reader.NodeType.Equals(XmlNodeType.Element);
bool isEnd = reader.NodeType.Equals(XmlNodeType.EndElement);
if (isEnd && reader.Name.Equals("item")) {
itemMap = null;
}
else if (isStart && reader.Name.Equals("item")) {
itemMap = new Hashtable();
items.Add(itemMap);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("title")) {
reader.Read();
itemMap.Add("title", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("link")) {
reader.Read();
itemMap.Add("link", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("description")) {
reader.Read();
itemMap.Add("description", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("content:encoded")) {
reader.Read();
itemMap.Add("content:encoded", reader.Value);
}
else if (itemMap!=null && reader.Name.Equals("body")) {
reader.Read();
itemMap.Add("body", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("pubDate")) {
reader.Read();
itemMap.Add("pubDate", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("dc:date")) {
reader.Read();
itemMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("title")) {
reader.Read();
feedMap.Add("title", reader.Value);
}
else if (isStart && reader.Name.Equals("description")) {
reader.Read();
feedMap.Add("description", reader.Value);
}
else if (isStart && reader.Name.Equals("link")) {
reader.Read();
feedMap.Add("link", reader.Value);
}
else if (isStart && reader.Name.Equals("pubDate")) {
reader.Read();
feedMap.Add("pubDate", reader.Value);
}
else if (isStart && reader.Name.Equals("dc:date")) {
reader.Read();
feedMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("image")) {
// skip images
while (reader.Read()) {
if (reader.Name.Equals("image")
&& reader.NodeType.Equals(XmlNodeType.EndElement)) {
break;
}
}
}
}
return feedMap;
}

Have some better examples of parsing RSS with .Net? Please point me to them.


That was fun.

I haven't had as much fun watching the hits roll since when Weblogger.com threatened to sue me. Yesterday was much much more fun, of course. Thanks to all who commented, linked, welcomed and trackbacked me. One thing is for sure, you made my mom and dad feel a whole lot better about my leaving the seemingly safe sanctuary of SAS.

I'm venturing into new territory as a blogger. I have always kept my employer a secret. I never wanted anybody to google for HAHT or SAS and end up on my blog. I was a little worried about getting fired for blogging. It still happens even to those who try to be careful. Now, everybody knows who I work for and that changes things for me. On the positive side, blogging about my work with Roller, blogging technologies, Sun, and Java will give me lots of interesting material to work with - and then there's that evangelism thing. On the negative side, there are probably some topics that I had better avoid. Even with a company with a clueful policy on public discourse, you can still screw up and do damage to your career.

I'm confident that I'll do just fine in this new territory. I tend to be conservative in my output, perhaps too conservative. I'm also biased in favor of Sun and always have been. I'm a shareholder too. There's my full disclosure for you. I've been working with Sun hardware and software since the Sun3 timeframe. In fact, I proposed to the woman who became my wife as a direct result of a SPARCstation sale. I went down to Jamaica in '91 to install a SPARCstation-based system and to do a training workshop on the open source GRASS GIS software, got a great job offer at the Univ. of the West Indies, came home and asked Andi to marry me. We had a great honeymoon in Jamaica that lasted about a year and a half. I hope my honeymoon at Sun will last a lot longer than that.


Full time Roller!

It's official. Roller is now my full time job. I just accepted a job with Sun Microsystems to "design, develop, and deploy the primary blogging system for Sun in conjunction with other engineers" and to evangelize blogging both inside and outside of Sun. Needless to say, I'm thrilled. I'm honored to be working for Sun and with great folks like Will Snow, John Hoffman, Tim Bray, Patrick Chanezon, and Danese Cooper. I'm excited to be working for a company that feels the same was as I do about the value of blogs and wikis, open source software, and encouraging employees to speak with honest and authentic voice to customers, to partners, and to each other.

What does this mean to Roller? Only good things. Sun wants many of the same things for Roller that other Roller users want including high performance, high availability, great user interface, support for standards, and better support for large communities of bloggers. Thanks to Sun I'll be working full time to help make these things happen. Since Roller will continue on as an open source project, you can help too (and I hope you will).


Those pesky Autoruns

I've been using SysInternals freeware Windows Process Explorer and other Sysitnernals utilties for years now, but I never noticed this one. AutoRuns "shows you what programs are configured to run during system bootup or login" and allows you to delete or disable any of them. Via Jonathan Hardwick.


Friday photo

Over the past couple of years, I've been scanning my photo collection using a HP slide/negative scanner. My dad, who is an excellent photographer, has been scanning his collection as well. So, to add a little life to this tired old blog, I'm going to start taking advantage of my .Mac account (no longer active) and posting each week a photo from my collection or my dad's collection. Here is the first one:

Jamaican carwash - My old VM Golf in the carwash close to Ocho Rios, Jamaica

RSS link vs. guid vs. source elements

I've been researching newsfeed formats for various reasons. I've been using Rome to convert to and from various formats and that revealed a problem with Roller's RSS feed. After re-reading the loosey goosey RSS specs, I'm thinking that I it wrong in the Roller RSS feeds. What do you think? Currently, Roller uses the following elements for links:

  • <guid isPermaLink="true"> - the permalink
  • <link> - the (optional) source link, i.e. the one link that the blog entry is about
After looking at some Radio generated RSS 2.0 feeds, seems like Roller should do this:
  • <link> - the permalink
  • <guid isPermaLink="true"> - the permalink, same as <link>
  • <source url="[url of source]"> - the (optional) source link
Have the permalink in two places (link and guid) seems silly. Should we drop guid entirely? Putting the source link in the source element seems like right think to do, but the spec says the source url "links to the XMLization of the source" - that is, the source url should point to an RSS feed. Is that the common usage of the source element?

Can't tell you yet

I'm sitting on some very big news for Roller and for me, but I have to tell some other folks about it before I can tell you.


Socialtext Closes Series A Financing

Enterprise wiki-blogging meets venture capitalists

RSS in Thunderbird

"receiving and reading RSS feeds" in Thunderbird

atomflow

"atom storage/query core"

The Balkanization of the Internet

"how often do you actually visit sites in other countries?"

IBM Touts New Eclipse Package for Linux

Eclipse 3.0 and the IBM JRE

Red Hat launches employee blogs.

Jesus points out that Red Hat has launched employee blogs, or perhaps I should say employee blog. Unlike Sun's employee blog system, where each employee has a personal blog with it's own personal theme, Red Hat's Movable Type based blog system appears to be configured for two group blogs. One group blog, Red Hat People, is for regular employees and one, Red Hat Executives, is for executives. Interesting approach. Hey, aren't executives people too?


Knowledge management on your keychain.

Thank goodness for referrers. They bring in the porn spam, sure, but they also bring in wonderful news of the world. How could I have missed this incredible technological acheivement:

Le Danois: A wiki and weblog placed on a USB key, is that possible? The answer seems to be yes. I have put a bundle of Roller weblogger, JSPWiki, HSQLDB (file based database) and Tomcat on the USB key and I am currently testing it.

BlogWave

BlogWave is an interesting .Net based blog app that supports the scheduled generation and publishing of RSS feeds variety of sources, including blogs and NNTP, to a variety of destinations, including blogs, FTP, and plain old directories. It supports plugins so you can implement your own source and destination adapters. Sounds a lot like eSyndication, but for newsfeeds only.

Why I hate Wikis

Jimmo explains why he hates Wikis

Identifying Atom

Mark Pilgrim explains Atom identity issues

SonicBreakdown.com

The two concerts that I attended were arranged by a friend of mine who was working in Armenia. He keeps up with new releases and concerts by watching music news sites and Pollstar. From half a world away, he had more awareness of local music happenings than I do. I'm just not that hip to the scene.

What I need is something like SonicBreakdown.com. SonicBreakdown.com is a site that is designed to keep you in touch with your favorite music - sort of a personal music portal/aggregator. The spyware-free SonicBreakdown client scans your music collection, uploads that information to the site, and then keeps you informed of nearby concerts, upcoming television appearances, new releases, and music news about the artists in your collection. It does this by using RSS from a variety of music news sites and Web services provided by Amazon, Pollstar, and others. I wish I had signed up earlier, then I would have known that Sonic Youth is playing the Cradle tonight. Now, if only I could subscribe to SonicBreakdown feeds in BlogLines...

The Dead @ Raleigh

The Dead show was a lot of fun. With the addition of guitarists Warren Haynes and Jimmy Herring, new keyboard player Jeff Chimenti, and Branford Marsalis on sax and clarinet the Dead has a very different sound than the Grateful Dead had in the 90's. The band seems more together and more "tight" but is still has the ability to veer off into wildly improvisational trippy jam territory and then bring it all back home again. Overall the music was good and the Dead songs they played were generally excellent, but there were times when the new songs ventured a little too far into the southern rock sound for my tastes. I'm not that big a fan of southern rock, unless you count REM, the B-52s, and The Connells. Overall review: two thumbs up. (see also: photos and set-list).


Masters of slow motion

No blogging or any other productive work for me tonight - I'm going to see 'the Dead.' That's something I haven't done since June 18, 1995. Ok, it is not really the Dead without Jerry, but it is as close as you can get these days (Dark Star Orchestra not withstanding).

« Previous page | Main | Next page »