Subscribe: Comments on Geeking with Greg: Using IMDb data for Netflix Prize
Added By: Feedage Forager Feedage Grade B rated
Language: English
bin  data  imdb data  imdb  information  metadata  movie  movies  netflix  open question  qualifying set  question  rating  set  user  yehuda 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Comments on Geeking with Greg: Using IMDb data for Netflix Prize

Comments on Geeking with Greg: Using IMDb data for Netflix Prize

Updated: 2017-12-06T07:37:01.524-08:00


I think Yehuda is right. Any of the latent semant...


I think Yehuda is right. Any of the latent semantic indexing approaches, SVD, etc. all factor the ratings into features, which represent abstract concepts which are typically represented by metadata.

I ended up gathering metdata from the Netflix BOB data URLs. For example, this is the kind of data for "Dinosaur Planet"

"title":"Dinosaur Planet",
"synopsis":"Explore the past and relive the sights, sounds and sensations of the dinosaurs' world, as cutting-edge filming techniques and hard science combine to create four unique narratives that illustrate the struggle to survive in the violent Cretaceous period -- all from the dinosaur's point of view. The dinosaurs learn to hunt, dodge natural disasters and simply live to see another day amid predators. Actor Christian Slater narrates each adventure.",
"starring":[{"id":20055255,"name":"Scott Sampson"},{"id":86462,"name":"Christian Slater"}],

I used a naive bayesian approach, dumping all metadata from all user movies rated 1, 2, 3, 4, or 5 into 5 'bins'. Then, to make a new prediction, I took the metadata for the movie, and used bayesian matching to see which of the 5 groups of metadata tags was the most likely. A weighted average was used for all 5 groups to produce a rating for that movie.

What I found was that it only achieved a 1.12 RMSE on the qualifying set, although it had only 0.5 RMSE on the probe data set. Presumably this was due to the probe movies being included in the 5 bayesian 'bins', so that predictions were almost perfect.

In order to try and get the predictions closer, I used results from a prior SVD run, based on the Timely Development code, which had 0.9261 RMSE to help 'guess' what bin the qualifying set movies would be included in. If the movie was spot on (within 0.1 point of an integer rating), I dumped the metadata for the movie into that bin. Otherwise, I tried putting the metadata into the Left or Right bin (if the SVD value was predicted rating of 3.4, I would put it into the Left side [bin 3], compute a prediction, then the Right side [bin 4], make another prediction, then combine the two). This approach only produced an RMSE of 0.9459 on the qualifying set.

So, barring bugs in my implementation, I suspect that there is little to gain from using more metadata, since any 'feature'-based approach should in theory be achieving the same factorizations.

Maybe someone can figure out how to turn this additional metadata into a lower rating?

A simple question: how would using imdb data be co...


A simple question: how would using imdb data be considered possible, unless a data fee were paid; and imdb already has a direct relationship with Amazon?

Just wondering aloud.

Something that still seems to be an open question ...


Something that still seems to be an open question is whether one approach is better than another in a general sense. For the Netflix test set, Yehuda implies that Netflix has (probably intentionally) biased it to cases where we don't know much about the user. They've made it a bit easier on us by focusing on movies with high support. In this case, and with the extensive data conditioning BellKor is doing, additional movie information may be just gravy that doesn't add much information (or even detracts from the less subjective information derived from the historical data).

But there's still an open question of the general case, where the movie may not have good support.

Either way, the Stanford example is too limited to draw a firm conclusion. Who is to say if the "more complex" algorithm was a good comparison to the IMDB algorithm? Were they doing PCA, which in at least some sense should make up for any advantage of the IMDB information, or was the more complex algorithm purely CF-based?

Anybody that has a good cite on the relative merits of well-conditioned machine-learned recommendations vs. external data augmented recommendations, I'd love to see it, though.


I'm very new to CF but from what you said yehuda t...


I'm very new to CF but from what you said yehuda the netflix CF is user to user. Like movie choices group similar user right? I wish Netflix would add some actor, director, screenwriter rating to the mix as some movies are good but John Travolta is always bad.

But the bigger point is, adding more, independent ...


But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this.

How do both approaches compare to more data with ever-better algorithms? Combine better algo intelligence with more data? Does that beat more data with weak intelligence?

Our experience with the Netflix data is different....


Our experience with the Netflix data is different.
While extra information could be very useful, it must be orthogonal to what we already have. It is important realize that IMDB data and the likes gives more information only about the movies, not about the users.
However, the test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies.

Of course, weak collaborative-filtering (CF) models can benefit from such IMDB information, but our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.

Yehuda Koren
The BellKor team