Subscribe: Comments on: Live or recorded data feeds
Added By: Feedage Forager Feedage Grade B rated
Language: English
data feeds  data  feeds  geographical  grok request  idea  live recorded  live  object  recorded data  recorded  stream  synthetic data 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Comments on: Live or recorded data feeds

Comments on: Live or recorded data feeds

Comments on Ask MetaFilter post Live or recorded data feeds

Published: Sun, 12 Jun 2005 20:30:50 -0800

Last Build Date: Sun, 12 Jun 2005 20:30:50 -0800


Question: Live or recorded data feeds

Sun, 12 Jun 2005 20:20:07 -0800

I'm looking for datasets (or ideas for possible datasets), preferably live feeds of data, with which to stress test some analysis algorithms I've written.

The dataset needs to:

* Be either live (as in a source I can harvest data from as it's created) or pre-recorded.
* Cover a stream of changes over time about a large number of discrete objects.
* Accessible in as raw form as possible, preferably XML -- scraping inconsistently-tagged HTML files is a recipe for a waste of time.
* Free or fairly cheap -- a few hundred $ is okay, thousands of $ is not.

The algorithms use the stream of changes as input, and produce various statistics as the output. Each conceptual change needs to relate to an object's creation, modification or eventual deletion. An "object" may be a thermometer, or a sensor, or a person, or a car -- anything that can exist.

Each object must be expressible as a set of metadata covering as many of the following classifications as possible:

* Text: Free text, such as titles and descriptions.
* Hierarchical: For example, a "location" may be broken into a hierarchy of continents, countries, states, counties, cities, etc.
* Geographical: One or more geographical locations, expressed in either longitude/latitude coordinates or nautical coordinats.
* Scalar: For example, "size", "age", "weight", "length", "temperature" are scalar values.

Having some geographical component is relatively important, as the algorithms are geographical by nature; but if need be I can derive or randomize geographical locations from a non-geographical data.

There are many geographically-related datasets out there -- air temperatures, ocean buoys, population sizes, and so on. However, the ones I have found are always small and unchanging, and the amount of metadata is limited. For example, a bit of rainfall statistics from the 1800s will not suffice. Neither will a static database of US city populations.

To give you an idea of what I'm looking for, at one point, I worked on a project where we had access to the live stream of data coming from an oild field in the North Sea, a large, continuous mass of heterogenous data covering anything from drilling points to oil and gas extraction statistics, lots of deep and rapidly changing scientific information. Millions of changes. That's the kind of dataset I want. Unfortunately, I no longer have access to this feed.

I'm toying with different possibilities. For example, one idea I have is to monitor an IRC network, treating each user as an "object" and recording their names, login/logout times, and so on, and artificially mapping their IP address to a geographical location. Another is to collect a huge amount of RSS feeds and track that, although again the geographical component becomes artificial. And RSS feeds deal mostly with new articles, not changing ones. Yet another idea is to track stock markets, but I don't know where I can get raw stock feeds.

By: scalespace

Sun, 12 Jun 2005 20:30:50 -0800

You don't mention it in your description, but have you tried generating a large body of synthetic data and using that to test your algorithm?

Or, using the sea/climatic data, you could interpolate and resample to get data points in between the sample points to fill out your data set.

By: devilsbrigade

Sun, 12 Jun 2005 20:51:24 -0800

WebTrends has about 60 megs of sample web stats. That might be useful.

By: null terminated

Sun, 12 Jun 2005 21:34:11 -0800

Any way you could convert an online radio station to usable data?

By: gentle

Mon, 13 Jun 2005 07:30:35 -0800

scalespace, I use synthetic data for some tests. But for this particular test, the idea is to simulate real-world behaviour as accurately as possible without actually exposing the code to the real world. Even the most well-designed synthetic data will not provide the true randomness and coverage that I'm looking for -- the data combinations that will trip up this code will be things I didn't consider beforehand.

I don't see what an online radio station can give me. I'm not looking for a continuous stream of raw bytes; if I were, I could scrape anything, including my own hard drive.

By: chota

Mon, 13 Jun 2005 08:52:19 -0800

The NASDAQ apparently has some sort of web service whereby you can grab 10 quotes at a time, on a 15-minute delay. Maybe just cycle through sets of 10 quotes repeatedly, treating each ticker symbol as an object, and the price as a property?

Also, maybe you could use weather data, by treating each ZIP code as an object, and use historical, current, future low/hi temperature data? I would recommend using the NOAA's 108-year historical data.

Just some thoughts, as I'm not sure if I fully grok your request. Good luck though!

By: gentle

Mon, 13 Jun 2005 09:03:26 -0800

Good ideas, chota, though by cycling through quotes at 15-minute intervals it would take forever to generate any useful volume of data for stress testing. The weather data might be useful, though it looks rather small.

What do you not grok about my request? I would be happy to amplify.