And they'll be right 67% of the time. It started as a typical battle-of-wits game on Kaggle.com but quickly turned into a duel between two amazing machine-learning algorithms: Nolearn and Overfeat. In case you don't know, Kaggle.com is a popular platform for data science competitions. This one was about developing the most accurate model to differentiate between dogs and cats on a set of 12,500 unlabeled and shuffled pictures.
I was blown away at how well the algorithms performed. I scored above 96% accuracy differentiating between dog and cat pictures with Nolearn and the leader achieved an incredible 98%. He was rumored to be using the Overfeat algorithm.
Tagging pets is cute and may have genuine scientific purpose for someone, somewhere (the competition was sponsored by Microsoft and Asirra - Animal Species Image Recognition for Restricting Access) but I wanted to apply this to something more interesting, maybe even controversial. The subject of political allegiance sprung to mind. How correlated are one's political beliefs with one's appearances?
In a nutshell, I wanted to find out if somebody's face could be fed into one of these models and accurately predict if that person was a Democrat or a Republican.
One thing all these data science experiments have in common is a good training data set. These algorithms are based on supervised machine learning. This means that we teach the machine what a cat and a dog looks like, or a republican and a democrat, by feeding it lots of pictures of each subject, and explicitly telling the machine what each picture represents. After such training phase, we can feed unlabeled pictures and ask for the machine's best guess.
I turned to Wikipedia to find pictures of members and associates of the senate and congress. I created two folders, one named 'democrat' and the other 'republican', and dumped all the pictures into each according to their party. I ended up with 539 pictures. All the wiki pictures were similarly framed, most likely taken by the same agency.
Here are samples of the Democratic and Republican Party data sets:
I used logistic regression to model the data. The algorithm extracts the RGB value of each pixel and compacts it into a one-dimensional array. This enables the regression to easily compare and classify all the pictures. The algorithm also emphasizes the center of the picture to reduce peripheral noise.
To score this model, I went with a cross-validated k-fold approach. This separates the data into chunks, one to build the model, the other to measure the model's accuracy. This is done over and over until all the data has had a chance to be part of the training and testing set, but never both at the same time. This allows us to test all the data without cheating and without wasting a single picture.
I set the k-fold to run 5 sets and these were the results:
This averages to 0.6678. This means that the machine was able to correctly predict what party a picture belonged to 67% of the time. In all honesty, I was surprised at the accuracy of this model. For starters, I am not sure how I feel about my beliefs being vulnerably displayed over my face, nevertheless, everyday I am taken aback at how smart and far reaching machine learning can be.
The Nolearn and Overfeat algorithms use pre-trained libraries. This means that they will be much better at extracting features on certain type of subjects than generic algorithms. This gives them an advantage in terms of precision and speed. Please follow the above links for more information on these libraries.
Now, about that score, many factors could be at play here. Is the photographer different for each party? What about the mean age of party members? How about the balance of sex, race, those wearing glasses, republican versus democrat fashions, etc… This isn't mean to be a scientific study but a reminder, or warning, as to how powerful machine learning can be. The programming of a machine learning system can be disarmingly simple, yet its discoveries dramatic!
The code and data set to reproduce the above experiment can be found on github at https://github.com/amunategui/pol…