Subscribe: Use of Lucene to store data from RSS feeds - Stack Overflow
http://stackoverflow.com/feeds/question/3933189
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
add  calculate  data feeds  data  element  feed  feeds  hashmap string  lucene store  lucene  store data  store  string int  tenum  text 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Use of Lucene to store data from RSS feeds - Stack Overflow

Use of Lucene to store data from RSS feeds - Stack Overflow



most recent 30 from stackoverflow.com



Updated: 2018-04-24T23:33:55Z

 



Use of Lucene to store data from RSS feeds

2010-10-15T15:50:41Z

I would like to store data retrieved hourly from RSS feeds in a database or in Lucene so that the text can be easily indexed for wordcounts.

I need to get the text from the title and description elements of RSS items.

Ideally, for each hourly retrieval from a given feed, I would add a row to a table in a dataset made up of the following columns:

feed_url, title_element_text, description_element_text, polling_date_time

From this, I can look up any element in a feed and calculate keyword counts based upon the length of time required.

This can be done as a database table and hashmaps used to calculate counts. But can I do this in Lucene to this degree of granularity at all? If so, would each feed form a Lucene document or would each 'row' from the database table form one?

Can anyone advise?

Thanks

Martin O'Shea.




Answer by Xodarap for Use of Lucene to store data from RSS feeds

2010-10-15T15:50:41Z

My parsing of your question is:

for each item in feed:
    calculate term frequency of item, then add to feed's frequency list

This is not something that Lucene excels at, so CouchDB or another db might be as good if not a better choice (like larsmans suggests). However, it can be done (in a way that is probably slightly easier than other DBs):

HashMap terms = new HashMap(indexReader.getUniqueTermCount());
TermEnum tEnum = indexReader.Terms();
while (tEnum.Next())
{
    results.Add(tEnum.Term().Text(), tEnum.DocFreq());
}

All Lucene is saving you is the difficulty of calculating the docfreq, and it will probably be a bit faster than looping through all the rows yourself. But I'd be surprised if the performance difference is noticeable for reasonably small data sets.