Friday 21 November 2014

A Picture is Worth A Thousand Words (Literally)

I have a healthy skepticism of the part of the job interview process where applicants are asked to write a super-efficient piece of code to create what amounts to a cog in a much larger machine. In my experience as a researcher and developer, most of my time is spent finding the cogs bringing the machine to a grinding halt, also known as debugging, also known as staring at sections of code for minutes on end as your monitor goes to sleep.

One of my favorite methods of debugging is to visualize what I am dealing with wherever possible. Indeed, one of the reasons I got into the field of image processing and computer vision was because generally, when you don't know what's going on, you can look at an image output and get a pretty good idea of what's happening.

Recently, I have been looking into evaluating a classification technique that I have been working with on a text categorization problem. I am familiar with the fundamentals of text classification, as visual classification has long been inspired by the former, but have never worked in the domain before. In order to make what follows a bit easier for non-CS people to follow, I'll try to explain the classification problem in simpler, qualitative terms. Feel free to skim through it if you're comfortable with the field.

Who's who?

So here I am looking at a collection of about 5,000 webpages crawled from the websites of multiple universities. The collection is split into a training set and a testing set. As the names imply, the former is used for teaching our classification algorithm what it is to recognize and the latter is used for testing its performance. The actual categorization problem is a bit more complex, but for the purpose of explanation I'll say we want to distinguish between webpages belonging to faculty (the positive class) from the webpages belonging to non-faculty (the negative class, comprising of students, projects, clubs, etc.). Each document is labelled with its class, and the labels are supplied to the classifier for training, but only used for comparing its outputs in testing.

How do we go about this? We need to structure each document into a form from which its meaning can be easily extracted. One of the most popular techniques from years of research in text analysis is to simply take all the words appearing in a document and count their occurrence. The order of the words is ignored, and the output we get is basically a count of how many times a word in an arbitrary vocabulary or dictionary appears in the document. This is termed as a document vector. There may be additional preprocessing and normalization steps involved, but I'll skip those for now. As an example, a document like:
apple banana apple papaya
will result in a vector:
apple=2 banana=1 orange=0 papaya=1 watermelon=0 ...
for a fruit-based dictionary. Each vector thus supplies the classifier with information about which words occur in the document and how frequently. The classifier then looks for common words in documents belonging to the same class, and uses them to try to assign new, unseen documents to the correct classes.

So what's this got to do with visualization?

Now, the problem I was running into was that my algorithm kept giving me excellent performance on the training set, but not on the testing set. This is very bad for any classifier because it means that it is not able to work on anything it hasn't seen before, making it useless. It is not an uncommon issue though. The classifier is ending up overfitting the data; it is placing too much emphasis on the data it has been given at the expense of leaving little wiggle room to deal with ambiguity in unseen data. This is like the anecdote of the kid who complains to his mother that he could not solve an addition problem in class because he had been taught to do it with apples and the problem asked him to add oranges. Similarly, what we want the classifier to realize is that it is certain words that matter, not all of them.

So, I go about trying to fix the apparent overfitting. This is done by something called regularization, which is a fancy word for penalizing your algorithm when it tries to improve its performance by reducing its flexibility too much. I tried different values for the regularization parameters, different regularization techniques, different parameters for other components of my algorithm. Nothing worked.

Puzzled but quite sure that something should have worked by now, I thought of checking the data. But 5,000 documents containing some 7,500 words overall are not easy to wade through. That's when my instincts kicked in, and I went to the comforting familiarity of images.

Visualization of the positive training set

I visualized the entire positive class of document vectors in the training set as the image above. The vertical dimension consists of the words in the vocabulary, the horizontal dimension consists of the documents, and the brightness of each pixel tells us how frequently a particular word occurs in the corresponding document. There is some normalization done as well, which is responsible for the relatively uniform brightness you see in the horizontal dimension.

Notice there are some bright rows in the picture. I looked back into the vocabulary for the words these rows correspond to and these turned up being words such as "professor", "faculty", "department", "research", etc. - dead giveaways for the positive class, especially when more than one of them occurs on the webpage.

Now, let's take a look at the testing set of the positive class.

Visualization of the positive testing set
Notice the lack of bright horizontal lines or any significant structure for that matter. This immediately violates one of the preconditions for successful classification - the distribution of words in the training and testing sets should be approximately the same (the other condition is that there should be some distinction between the positive and negative classes). No classifier can do well on such a problem, because there is no information about the testing set that can be learned from the training set.

That long-winded explanation is how I realized that there was something wrong with the data. I checked and yup, there was some corrupt data that had caused the document vectors to be improperly calculated. Once that was fixed, the correlation between training and testing performance immediately manifested itself.

Moral of the story? Don't trust the data, especially if you're in unfamiliar territory.

Monday 10 November 2014

Greetings, planet!


What? Did you really think I wasn't going to open with some variation on "Hello, world!"?

Anyway, I just thought of starting a blog, because I certainly don't tweet or share as much as I used to. Ironic, considering that I am working on social media analysis these days. Moreover, I read this wonderful article on Lifehacker the other day, which says that no matter how little you think you know about something, you are still an expert to someone. Therefore, I decided to start getting some of my thoughts out there, and hopefully generate some new ideas.

I really don't know what this blog is going to be about. Maybe some personal stuff, maybe some Formula 1, maybe some computer vision stuff. Of course, without any idea of what I want this to be, but not wanting to add another blog to the catch-all of "Random Thoughts", I decided to go a bit technical with the title.

I'll try to post regularly and see how I take to blogging, so...