Parsing RSS with .Net
How do you do it? I need to provide some examples to show how to parse RSS with Java and C#. I have written simple parsers using the common XML parsing techniques such as DOM, SAX, and Pull. I have also written some examples that use parser libraries, but I have yet to find a good and free RSS parser library for .Net. Lazy-web, please help me out here.
When you assume...
If you assume that RSS is XML and you are just interested in getting
titles, decriptions, links, and dates then it is pretty easy to write a
simple parser that can handle most forms of RSS including RSS 1.0, RSS
2.0, and some forms of funky RSS. If you to handle more than those
basic elements, then I recommend that you use a parser library.
Parser libraries
Python programmers are blessed with a great newsfeed parser library: Pilgrim's regex-based Universal Feed Parser which can parse any feed, even if it is not valid XML. I don't think Pilgrim's parser will port easily to the Java version of Python Jython, because Jython is missing some important Python libraries and Jython uses a Java regex which is different from Python's built-in regex. The same thing probably goes for the .Net version of Python IronPython. By the way, Lazy-web, would you please port Pilgrim's parser to Jython?
So, Java developers don't have the Universal Feed Parser, but we do have two active projects that are developing full featured RSS (and Atom) parsers: Informa (used by Javablogs.com) and Rome. .Net developers have RSS.Net, but it is incomplete and development seems to have comletely stagnated back in November of 2003.
So how do you parse RSS with .Net? I started looking around and digging into source code. I found that Dare built his C# based RSS parser for RssBandit on top of an SGML parser. Joe built his C# based RSS parser for Aggie using good old System.Xml. I guess you just have to do it by hand, so here goes...
My examples
Now it's time for the lazy web to point and laugh at my feeble efforts to build simple RSS parsers in C#. I have two examples for your ridicule. After you are done laughing, please, .Net heads, help me out and tell me what I am doing wrong and where I can make improvements.
First, here is a simple C# RSS parser method that uses a DOM based approach. It extracts the basic elements of title, description, link, and pubDate from the channel and item levels and it puts them into a dictionary (just like Pilgrim's parser does). It can handle RSS 1.0, RSS 2.0, and some forms of funky RSS. Have a look:
public IDictionary ParseFeed(String fileName) {
XmlDocument feedDoc = new XmlDocument();
feedDoc.Load(fileName);
XmlElement root = feedDoc.DocumentElement;
string defaultNS = null;
string contentNS = "http://purl.org/rss/1.0/modules/content/";
string dcNS = "http://purl.org/dc/elements/1.1/";
string xhtmlNS = "http://www.w3.org/1999/xhtml";
if (root.Name.Equals("rss")) {
defaultNS = null;
}
else {
defaultNS = "http://purl.org/rss/1.0/";
}
XmlElement channel = (XmlElement)root.GetElementsByTagName("channel").Item(0);
IDictionary feedMap = new Hashtable();
feedMap.Add("title", GetChildText(channel,"title",defaultNS));
feedMap.Add("pubDate", GetChildText(channel,"pubDate",defaultNS));
feedMap.Add("dc:date", GetChildText(channel,"date",dcNS));
feedMap.Add("description", GetChildText(channel,"description",defaultNS));
feedMap.Add("link", GetChildText(channel,"link",defaultNS));
XmlNodeList items = null;
if (root.Name.Equals("rss")) {
items = channel.GetElementsByTagName("item");
}
else {
items = root.GetElementsByTagName("item");
}
IList itemList = new ArrayList();
feedMap.Add("items", itemList);
for (int i=0; i<items.Count; i++) {
IDictionary itemMap = new Hashtable();
itemList.Add(itemMap);
XmlElement item = (XmlElement)items.Item(i);
itemMap.Add("title", GetChildText(item,"title",defaultNS));
itemMap.Add("link", GetChildText(item,"link",defaultNS));
itemMap.Add("guid", GetChildText(item,"guid",defaultNS));
itemMap.Add("pubDate", GetChildText(item,"pubDate",defaultNS));
itemMap.Add("dc:date", GetChildText(item,"date",dcNS));
itemMap.Add("description", GetChildText(item,"description",defaultNS));
itemMap.Add("content:encoded", GetChildText(item,"encoded",contentNS));
itemMap.Add("body", GetChildText(item,"body",xhtmlNS));
}
return feedMap;
}
private string GetChildText(XmlElement element, string childName, string namespaceURI) {
string text = null;
XmlNodeList nodeList = null;
if (namespaceURI != null) {
nodeList = element.GetElementsByTagName(childName, namespaceURI);
} else {
nodeList = element.GetElementsByTagName(childName);
}
if (nodeList!=null && nodeList.Item(0)!=null) {
if (nodeList.Item(0).FirstChild!=null) {
text = nodeList.Item(0).FirstChild.Value;
} else {
text = "";
}
}
return text;
}
And here is the same thing, but using a pull-parser based XmlTextReader approach:
public IDictionary ParseFeed(String fileName) {
XmlTextReader reader = new XmlTextReader(fileName);
reader.WhitespaceHandling = WhitespaceHandling.None;
IDictionary feedMap = new Hashtable();
IList items = new ArrayList();
IDictionary itemMap = null;
feedMap.Add("items", items);
while (reader.Read()) {
bool isStart = reader.NodeType.Equals(XmlNodeType.Element);
bool isEnd = reader.NodeType.Equals(XmlNodeType.EndElement);
if (isEnd && reader.Name.Equals("item")) {
itemMap = null;
}
else if (isStart && reader.Name.Equals("item")) {
itemMap = new Hashtable();
items.Add(itemMap);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("title")) {
reader.Read();
itemMap.Add("title", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("link")) {
reader.Read();
itemMap.Add("link", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("description")) {
reader.Read();
itemMap.Add("description", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("content:encoded")) {
reader.Read();
itemMap.Add("content:encoded", reader.Value);
}
else if (itemMap!=null && reader.Name.Equals("body")) {
reader.Read();
itemMap.Add("body", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("pubDate")) {
reader.Read();
itemMap.Add("pubDate", reader.Value);
}
else if (isStart && itemMap!=null
&& reader.Name.Equals("dc:date")) {
reader.Read();
itemMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("title")) {
reader.Read();
feedMap.Add("title", reader.Value);
}
else if (isStart && reader.Name.Equals("description")) {
reader.Read();
feedMap.Add("description", reader.Value);
}
else if (isStart && reader.Name.Equals("link")) {
reader.Read();
feedMap.Add("link", reader.Value);
}
else if (isStart && reader.Name.Equals("pubDate")) {
reader.Read();
feedMap.Add("pubDate", reader.Value);
}
else if (isStart && reader.Name.Equals("dc:date")) {
reader.Read();
feedMap.Add("dc:date", reader.Value);
}
else if (isStart && reader.Name.Equals("image")) {
// skip images
while (reader.Read()) {
if (reader.Name.Equals("image")
&& reader.NodeType.Equals(XmlNodeType.EndElement)) {
break;
}
}
}
}
return feedMap;
}
Have some better examples of parsing RSS with .Net? Please point me to them.