Thursday, February 19, 2009

Reading Feeds using the ROME API

A recent task had me doing something pretty simple - reading some RSS feeds for display on pages inside our application.  I know that it's easy enough to just write some custom code that parses an RSS feed - after all it's just XML, right?  I didn't want to do that, so I did some digging, and found two real options - ROME, and Commons FeedParser.

It quickly became obvious that ROME was the correct choice, and that's when the fun started. ROME is a Sun project that seemed to provide the most flexibility as far as reading different syndication formats, and clarity of API docs.  Using this tutorial from the site, getting the feed and parsing it was really really easy.  Just pass a feed url to the feed reader, and get some content items back.  

Loop through them and display the correct content.  


URL feedUrl = new URL( feed );

SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build( new XmlReader( feedUrl ) ;

List feedEntries = feed.getEntries();

There are some subtleties, however, that seemed to merit a post, as I didn't really find any clear explanations for these things in one place.

Issue 1:  Content Encoding

We are parsing a feed from a wordpress blog, and it seems that some posters always post content that has the weird characters that signify a content encoding issue.  The weird diamonds with question marks in them (or just empty boxes in Opera) that are inserted where there is a sort of 'half-space' on the actual blog.  I determined that the blog was using UTF-8 (this seems to be the default encoding for a WordPress instance.  After much searching, I came across this post, which seemed to contain about a million suggestions for how to handle the error.  What worked for some didn't seem to work for others, and certainly didn't work for me! I tried to read the url as a stream, and to no avail.  

Instead, I settled on setting the character encoding type on the HttpServletResponse object, which seems to take care of things.  Seems a little weird to me, but that's okay as long as it works and I don't have to write custom parsers.  After updating things, here's how my code looked:


URL feedUrl = new URL( feed );
String respEncoding = "";
if ( encoding == null )
{
encoding = "UTF-8";
respEncoding = "UTF8";
}
else
{
respEncoding = encoding.replaceAll( "-", "" );
}

XmlReader.setDefaultEncoding( "UTF-8" );

SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build( new XmlReader( feedUrl.openStream(), true ) ) ;

List feedEntries = feed.getEntries();

response.setCharacterEncoding( respEncoding );

Issue 2: Where's My Content?

The first blog I tested was easy to parse, once I got that list of feeds.  I just needed to to display the description field on the SyndFeedEntry object, and it gave me a nicely formatted (accounting for html inside the post) abridged posting.  Then I tried to display a blog that was hosted by the blogger platform (the very blog you are reading now).  Would you believe that the description property was unset.  No content. Now I am left having to get the raw content  straight out of the raw content feed.  I didn't really want an if-else for this in the display, and the way the SyndEntry was made, it wasn't really super simple to subclass it, so I went ahead and created my own class that took that SyndEntryImpl object and decorated with a few simple convenience methods:



/**
* Helper method to return the contents whether they come from the description field (ie wordpress is kind and does this) or raw content
* @return
*/
public String getAbridgedContents() {
//first try the description
if ( myEntry.getDescription() != null && myEntry.getDescription().getValue() != null )
{
return myEntry.getDescription().getValue();
}

//if that's not working, use the raw contents
StringBuilder sb = new StringBuilder();
SyndContent sc = null;

for ( int i = 0; i < myEntry.getContents().size(); i++ )
{
sc = (SyndContent) myEntry.getContents().get( i );
sb.append( sc.getValue() );
}

String ret = sb.substring( 0, 255 ) + " [...]";

return ret;
}

public String getDateString()
{
SimpleDateFormat dateFormat = new SimpleDateFormat("EEE, MMM d" );

return dateFormat.format( myEntry.getPublishedDate() );
}

public String getCatName()
{
if ( myEntry.getCategories() != null && myEntry.getCategories().size() > 0 )
{
SyndCategory sc = (SyndCategory) myEntry.getCategories().get( 0 );

return sc.getName();
}
else
{
return null;
}
}

public String getCatUri()
{
if ( myEntry.getCategories() != null && myEntry.getCategories().size() > 0 )
{
SyndCategory sc = (SyndCategory) myEntry.getCategories().get( 0 );

return sc.getTaxonomyUri();
}
else
{
return null;
}
}



Hopefully this will help people get their feedreader working quickly.

blog comments powered by Disqus