Well I’m in the process of getting somewhere finally, let’s put it that way. My initial efforts were piss poor with only slightly interesting improvements using some ad hoc messing with means and brute force approaches. Nothing I did would get me into the leader board.

Of course, I could just take what all the leaders have done and build on that. Some teams have done that. I think there’s something more simple to do and finding that more simple thing is hard when you are taking others complicated models as your starting point (you pick up all their shortcomings too!). I’m starting with the most basic systems I can and searching through various simple tweaks. Also, I want a system that will make sense to me if I have to put this on the shelf for a week or two to get other work done.

**What I’ve done:**

I use the pyflix package to handle the basic dataset and algorithm framework.

I borrowed the basic weighted KNN clustering algorithm from O’Reilly publishing/Toby Segaran’s book “Collective Intelligence”

I’ve connected the Pythonika package to Mathematica 6 so that I can use Mathematica front end to work out algorithm variations and visualizations (Mathematica 6 has very good clustering algorithms and features that are far easier to play with than building your own or using some python, c, java library. And, once speed is the key, it’s possible to do compiles and speed ups in mathematica, I digress). Pythonika let’s me run python code (retrieve the data from the pyflix data store)

I used the python pickles from Ilya G. He did the hard work of creating feature vectors (coded descriptions) from IMBD data for things like release date, genre, cast and so forth. This makes the KNN algorithms more interesting than just going off of the movie title, rating and year.

**My basic outline so far of the algorithm system:**

Create clusters of the data based on Genre, release year, number of ratings, director (as proxy for many other factors). Further polish the clusters by user.

Add in some meta features of users like frequent rater, avid rater and so on for clustering the users.

Do multiple runs of rating predictions based on just movie clusters, user clusters and combinations with variations neighborhood sizes.

**My particular challenges are:**

Reducing the memory required to do trial runs on the data with the algorithms.

Reducing the code required to try new algorithms

Keeping track of all these different parts

I hope this is helpful for instructive. It’s fun for me at the very least.