« That was fun. | Main | Securing Pebble,... »

Parsing RSS with .Net

How do you do it? I need to provide some examples to show how to parse RSS with Java and C#. I have written simple parsers using the common XML parsing techniques such as DOM, SAX, and Pull. I have also written some examples that use parser libraries, but I have yet to find a good and free RSS parser library for .Net. Lazy-web, please help me out here.

When you assume...

If you assume that RSS is XML and you are just interested in getting titles, decriptions, links, and dates then it is pretty easy to write a simple parser that can handle most forms of RSS including RSS 1.0, RSS 2.0, and some forms of funky RSS. If you to handle more than those basic elements, then I recommend that you use a parser library.

Parser libraries

Python programmers are blessed with a great newsfeed parser library: Pilgrim's regex-based Universal Feed Parser which can parse any feed, even if it is not valid XML. I don't think Pilgrim's parser will port easily to the Java version of Python Jython, because Jython is missing some important Python libraries and Jython uses a Java regex which is different from Python's built-in regex. The same thing probably goes for the .Net version of Python IronPython. By the way, Lazy-web, would you please port Pilgrim's parser to Jython?

So, Java developers don't have the Universal Feed Parser, but we do have two active projects that are developing full featured RSS (and Atom) parsers: Informa (used by Javablogs.com) and Rome. .Net developers have RSS.Net, but it is incomplete and development seems to have comletely stagnated back in November of 2003.

So how do you parse RSS with .Net? I started looking around and digging into source code. I found that Dare built his C# based RSS parser for RssBandit on top of an SGML parser. Joe built his C# based RSS parser for Aggie using good old System.Xml. I guess you just have to do it by hand, so here goes...

My examples

Now it's time for the lazy web to point and laugh at my feeble efforts to build simple RSS parsers in C#. I have two examples for your ridicule. After you are done laughing, please, .Net heads, help me out and tell me what I am doing wrong and where I can make improvements.

First, here is a simple C# RSS parser method that uses a DOM based approach. It extracts the basic elements of title, description, link, and pubDate from the channel and item levels and it puts them into a dictionary (just like Pilgrim's parser does). It can handle RSS 1.0, RSS 2.0, and some forms of funky RSS. Have a look:

public IDictionary ParseFeed(String fileName) {
XmlDocument feedDoc = new XmlDocument();
feedDoc.Load(fileName);
XmlElement root = feedDoc.DocumentElement;
string defaultNS = null;
string contentNS = "http://purl.org/rss/1.0/modules/content/";
string dcNS = "http://purl.org/dc/elements/1.1/";
string xhtmlNS = "http://www.w3.org/1999/xhtml";
if (root.Name.Equals("rss")) {
defaultNS = null;
}
else {
defaultNS = "http://purl.org/rss/1.0/";
}
XmlElement channel = (XmlElement)root.GetElementsByTagName("channel").Item(0);
IDictionary feedMap = new Hashtable();
feedMap.Add("title", GetChildText(channel,"title",defaultNS));
feedMap.Add("pubDate", GetChildText(channel,"pubDate",defaultNS));
feedMap.Add("dc:date", GetChildText(channel,"date",dcNS));
feedMap.Add("description", GetChildText(channel,"description",defaultNS));
feedMap.Add("link", GetChildText(channel,"link",defaultNS));

XmlNodeList items = null;
if (root.Name.Equals("rss")) {
items = channel.GetElementsByTagName("item");
}
else {
items = root.GetElementsByTagName("item");
}
IList itemList = new ArrayList();
feedMap.Add("items", itemList);
for (int i=0; i<items.Count; i++) {
IDictionary itemMap = new Hashtable();
itemList.Add(itemMap);
XmlElement item = (XmlElement)items.Item(i);
itemMap.Add("title", GetChildText(item,"title",defaultNS));
itemMap.Add("link", GetChildText(item,"link",defaultNS));
itemMap.Add("guid", GetChildText(item,"guid",defaultNS));
itemMap.Add("pubDate", GetChildText(item,"pubDate",defaultNS));
itemMap.Add("dc:date", GetChildText(item,"date",dcNS));
itemMap.Add("description", GetChildText(item,"description",defaultNS));
itemMap.Add("content:encoded", GetChildText(item,"encoded",contentNS));
itemMap.Add("body", GetChildText(item,"body",xhtmlNS));
}
return feedMap;
}
private string GetChildText(XmlElement element, string childName, string namespaceURI) {
string text = null;
XmlNodeList nodeList = null;
if (namespaceURI != null) {
nodeList = element.GetElementsByTagName(childName, namespaceURI);
} else {
nodeList = element.GetElementsByTagName(childName);
}
if (nodeList!=null && nodeList.Item(0)!=null) {
if (nodeList.Item(0).FirstChild!=null) {
text = nodeList.Item(0).FirstChild.Value;
} else {
text = "";
}
}
return text;
}

And here is the same thing, but using a pull-parser based XmlTextReader approach:

public IDictionary ParseFeed(String fileName) {
XmlTextReader reader = new XmlTextReader(fileName);
reader.WhitespaceHandling = WhitespaceHandling.None;
IDictionary feedMap = new Hashtable();
IList items = new ArrayList();
IDictionary itemMap = null;
feedMap.Add("items", items);
while (reader.Read()) {
bool isStart = reader.NodeType.Equals(XmlNodeType.Element);
bool isEnd = reader.NodeType.Equals(XmlNodeType.EndElement);
if (isEnd && reader.Name.Equals("item")) {
itemMap = null;
}
else if (isStart && reader.Name.Equals("item")) {
itemMap = new Hashtable();
items.Add(itemMap);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("title")) {
reader.Read();
itemMap.Add("title", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("link")) {
reader.Read();
itemMap.Add("link", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("description")) {
reader.Read();
itemMap.Add("description", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("content:encoded")) {
reader.Read();
itemMap.Add("content:encoded", reader.Value);
}
else if (itemMap!=null && reader.Name.Equals("body")) {
reader.Read();
itemMap.Add("body", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("pubDate")) {
reader.Read();
itemMap.Add("pubDate", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("dc:date")) {
reader.Read();
itemMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("title")) {
reader.Read();
feedMap.Add("title", reader.Value);
}
else if (isStart && reader.Name.Equals("description")) {
reader.Read();
feedMap.Add("description", reader.Value);
}
else if (isStart && reader.Name.Equals("link")) {
reader.Read();
feedMap.Add("link", reader.Value);
}
else if (isStart && reader.Name.Equals("pubDate")) {
reader.Read();
feedMap.Add("pubDate", reader.Value);
}
else if (isStart && reader.Name.Equals("dc:date")) {
reader.Read();
feedMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("image")) {
// skip images
while (reader.Read()) {
if (reader.Name.Equals("image")
&& reader.NodeType.Equals(XmlNodeType.EndElement)) {
break;
}
}
}
}
return feedMap;
}

Have some better examples of parsing RSS with .Net? Please point me to them.

Comments:

  1. Get IronPython
  2. Get <em>the</em> Feed Parser
  3. Profit!!!11!!!
  4. Posted by John Beimler on September 01, 2004 at 02:58 PM EDT #

Are you saying that the Feed Parser will work with Iron Python? Have you tried it?

Posted by Dave Johnson on September 01, 2004 at 03:30 PM EDT #

I just tried it and got a stack trace trace in the interactive interpreter. :( I wouldn't be surprised if it works in the near future, or for someone better at .net than I.

Posted by John Beimler on September 02, 2004 at 01:57 AM EDT #

instead of the big long if else statement else if (isStart && reader.Name.Equals("link")) { reader.Read(); feedMap.Add("link", reader.Value); how about if(isStart) { if(reader.Name.Equals("image")) { // skip images while (reader.Read()) { if (reader.Name.Equals("image") && reader.NodeType.Equals(XmlNodeType.EndElement)) { break; } } } else { reader.Read(); feedMap.Add(reader.Name, reader.Value); } } Testing isStart each if statement is wasting processing power...

Posted by Ski on September 06, 2004 at 05:50 AM EDT #

Thanks a lot buddy... My search ends here..

Posted by Navab on March 26, 2008 at 12:46 PM EDT #

Thx for Ideas. One Question: It's free using your Code above? Thanx a lot...!

Posted by Rachid B. on April 18, 2008 at 04:38 PM EDT #

For the comments from 2008, you can now do better than this. As .net 2.0 and above have generics and other niceties, I it probably is best to use something like this, which returns a list of items (element name to value pairs in "item"s):
public static List<Dictionary<string, string>> ReadRssItems(string url)
{
 List<Dictionary<string, string>> items = new List<Dictionary<string, string>>();
 Dictionary<string, string> currentItem = null;

 XmlTextReader reader = new XmlTextReader(url);
 while (reader.Read())
 {
  if (reader.NodeType == XmlNodeType.Element)
  {
   string name = reader.Name;
   if (name.ToLowerInvariant() == "item")
   {
    // Save the previous item
    if (currentItem != null)
     items.Add(currentItem);

    // Create a new item
    currentItem = currentItem = new Dictionary<string, string>();
   }
   else if (currentItem != null)
   {
    reader.Read();
    currentItem.Add(name, reader.Value);
   }
  }
 }

 return items;
}

Posted by George Helyar on June 09, 2008 at 04:16 PM EDT #

Or, don't write the parser yourself. Instead use the Feeds API that is part of the Windows RSS platform. It parses all forms of RSS and Atom with caching and other nice features. I cover it in my book RSS and Atom in Action, chapter 6.

- Dave

Posted by Dave Johnson on June 09, 2008 at 06:52 PM EDT #

to Dave please add code below before return //last one if (currentItem != null) items.Add(currentItem); otherwise, you lose the last item.

Posted by Owen on December 23, 2009 at 12:20 AM EST #

sorry dave, previous post should be to George Helyar

Posted by owen on December 23, 2009 at 12:22 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed

« That was fun. | Main | Securing Pebble,... »

Welcome

This is just one entry in the weblog Blogging Roller. You may want to visit the main page of the weblog

Related entries

Below are the most recent entries in the category Microsoft, some may be related to this entry.