Friday, 20 April 2018

Redshift - Going from Indian to Singaporean

More than 2 years after my last post, I decided to write up my experience with taking a major life decision changing my citizenship from Indian to Singaporean. It can still be a fairly opaque affair where you don't know what the status of your application at any time is. Here's hoping that you find some information that may be useful to you.

My wife and I decided to apply for Singaporean citizenship in 2016 and scheduled an appointment at the ICA to submit our application on 09-Jan-2017. At that point, I had been in Singapore for 7.5 years (~3 years as a PR) and my wife had been here for a bit more than 3 years (~2 years as a PR). 

We collected all the necessary documents and went to the ICA. We were asked whether we wanted to submit two separate applications, or if my wife wanted to submit it as a dependent of mine. We had originally wanted to submit our application as a family so we were not sure which was the closest approximation (Spoiler alert: It's the latter). We decided to submit separate applications as my wife was not really dependent on me having her own income, and the applications did include spouse details so they would get linked at some point.

For the next 11 months, complete radio silence as the application made its way through the system and iEnquiry constantly showed the status as a terse "Pending". Then in mid-December 2017, my wife got an email from an officer at the ICA saying that they had reviewed her application and were recommending that she apply as a dependent to my application. They asked both of us to bring some additional documents (marriage certificate, my payslips for the past 6 months, etc) and sign and acknowledge amendments to her application. While at the ICA, I noticed that her application had been examined in some detail with various annotations along the margins and we took that to be a good sign. After all, why go through such effort to end up rejecting the application? We also asked the officer how long it would take for a decision to reach us and he gave us a reply of 4-5 months (Spoiler alert 2: It didn't; I guess he was just trying to manage expectations by giving a worst-case timeline).

Finally in early February 2018, we got letters dated 01-Feb-2018 that our applications had been approved in principle and we were required to go through some assimilation programs as part of the Singapore Citizenship Journey before final approval. Interestingly, iEnquiry still showed the status as "Pending" at this point; I guess it is more binary than I thought.

We were going on holiday two days after we received the letter so we scheduled program appointments (where needed) for early March. Towards the end of Feburary, we completed the e-Journey, which is the online learning session covering Singapore's history, civil systems and cultural norms. We attended the Community Sharing Session at our local community centre on 09-Mar-2018, which was a great (and at times, emotional) experience getting to know other new citizens and hearing about their experiences living in Singapore and adopting it as their home country. Finally we went for the Singapore Experiential Tour the next day on 10-Mar-2018, where we covered part of the Jubilee Walk starting at the National Museum and ending at the National Gallery. All in all, very engaging experiences and I did learn a few things about Singapore's history that I did not know previously.

All our statuses for the programs were marked as completed in a few days by 14-Mar-2018 and iEnquiry showed my application as "Approved" and my wife's application as "Approved (In-Principle)" by 22-Mar-2018. The reason for the latter was likely that as a dependent her citizenship was contingent on me accepting the citizenship in the first place. I got the physical approval letter the next day (my wife's approval came attached with mine), instructing us to go to the High Commission of India (HCI) to renounce Indian citizenship, and scheduling our appointment for registration of citizenship at ICA for 11-Apr-2018.

We went to the HCI's authorized agent in Singapore, BLS International, to initiate the process of renunciation on 26-Mar-2018. After completing the form and submitting the documents, we were given a date of 04-Apr-2018 to collect the renunciation letter and cancelled passport from HCI itself. Hence, we decided to bring forward our appointment at ICA to 05-Apr-2018 and fortunately still had a couple of open time slots for that day.

On the day of registration, we took our renunciation letters, photographs and current ICs to ICA where the documents were verified and our current ICs were collected and punched through making them invalid. Then, we provided our fingerprints and photographs for our new ICs and were given an acknowledgment slip that is meant to be our temporary IC until we get the actual ones at a constituency-level citizenship ceremony in a few months' time. Finally, we were directed to the Commissioner of Oaths before whom we verbally pledged our renunciation of Indian citizenship and allegiance to Singapore, before signing the same pledge. She congratulated us and was the first to welcome us as Singaporeans. The entire process took about 30-40 minutes including getting our photographs done.

We could have gone to apply for the Singapore passport in the ICA the same day, but we would have had to get slightly different photographs done as well as pay a bit more to do it in-person. Hence, we decided to do it online via APPLES since we already had our digital photographs. The system however did not allow us to submit our online application on the same day as we register as citizens (someone somewhere must have used '>' instead of '>='), so we applied for our passports on 06-Apr-2018, and got the notification that they were approved on 11-Apr-2018. 

We arranged an appointment for collection on 18-Apr-2018, went down to ICA and picked them up having to wait for about ten minutes. And that was how we changed our passports from blue to red!

Hope this is of some help to you, and do feel free to comment below if you have any questions or observations.

Thursday, 7 January 2016

Deep Learning: MatConvNet on a 32-bit system

Note: This post is going to be very different from the other ones on my blog, because I want there to be a reference out there in case someone else runs into the same issue I did. So, no detailed background, diving right in.


For whatever reason, you have a 32-bit system with a decent enough graphics card that can handle a deep learning library. In today's day and age, you are a rarity. Most systems are implicitly 64-bit and existing libraries (rightly) do not support 32-bit systems. Welcome to hell.


I have spent considerable time going down dead-ends (problematic installations, unsupported versions, etc.) and backtracking to get GPU-based deep learning work on my system. Unfortunately, I didn't document my steps in a detailed manner, so I am relying on some rudimentary notes, diff and my memory to write the following; there might be oversights or errors.


MatConvNet is a fairly simple deep learning framework, and it is easy to get started doing things right away. Personally, I prefer Caffe, but I inherited some code based around MatConvNet, so went ahead with using it. I used MatConvNet beta v1.17 with Matlab R2014a on a 32-bit Windows 7 system with an NVIDIA NVS4200M graphics card and driver version 354.56 with CUDA v6.5. Compilation was done using the Windows SDK 7.1 (which installs MSVC++ 2010) and NVCC compilers.

The first step is to go ahead and edit the vl_compilenn.m file and add in support for a 32-bit architecture. MatConvNet does not natively support 32-bit systems so this needs to be done by hand. A simple rule-of-thumb is to search for occurrences of win64 and add in clauses for win32 usually just replicating the parameter values from the former.

As an example, opts.imageLibrary selection should look like this:

if isempty(opts.imageLibrary)
    switch arch
        ...
        case 'win64', opts.imageLibrary = 'gdiplus' ;
        case 'win32', opts.imageLibrary = 'gdiplus' ;
    end
end

Similarly, go ahead and add in library path info for your CUDA installation (CUDAROOT/lib/Win32 in my case). I also found that adding the include path of the Windows SDK (SDKROOT\Include in my case) to flags.nvcc and disabling check_clpath() (instead adding --cl-version 2010 -ccbin "MSVCROOT\VC\bin" to nvcc_opts in nvcc_compile()) helps. Finish duplicating parameter values from win64 for win32 architectures wherever necessary.

If you try to compile at this point, you might be lucky and not encounter any issues except a linking error at the tail end. This is because the linker is looking for a static gpu.lib file from Matlab's installation, but cannot find it. For some reason, only the dynamic version gpu.dll seems to be present in MATLABROOT\bin\win32. Use dumpbin and lib from the Windows SDK 7.1 command prompt (in Program Files\Windows SDK 7.1)  and follow the instructions here to generate the static library.

Cross your fingers, and hopefully everything should go off smoothly. Test that everything is working fine using vl_testnn('gpu', 1).

Hope this helps! If you run into issues, please leave a comment below and I'll try to help out if I can.

Monday, 27 April 2015

Let My EPUBs Go

Lately, I have been finding myself getting increasingly frustrated with the restrictions that are being placed on the content I choose to consume. In the so-called war on piracy, user experience is being sacrificed and I have no choice but to compromise or find workarounds to enjoy my content the way I want to. I will offer e-books as an example here. Music, movies and TV are still problematic, but thanks to streaming services (Spotify, iTunes, Play Music, Netflix, HBO Go) these are not as bad as e-books.

So channeling my inner Lewis Black, here goes:

E-books are easy to carry around to read - I already carry my phone and I can access thousands of them in my cloud library which takes up a tiny amount of physical space in some server farm somewhere. E-books can be synced across devices, so that I don't need to bother with bookmarks to hold my position. This is a god-send when I am feeling especially lazy and will just pick up the nearest device to read - my phone, my tablet or my laptop. That's it! These are the only two things I care about in a reading setup - ease of access and cross-device sync. If the setup has a well-designed display and smooth transitions between pages, that's an added bonus.

I am a member of the Singapore National Library Board, and in addition to the eight items I can check out at a time, I am also entitled to six e-resources. NLB, like tens of thousands of other libraries globally, uses Overdrive, which is the largest collection of e-resources in the world and unfortunately, provides a very poor experience considering my two concerns above.

First, ease of access. I have used Google Play Books for a couple of years now, and I am very happy with the experience. Books are a click away to read on all my devices and I can choose which books to keep locally on my device. Everything works from my Google ID, with virtually no hiccups. 

On the other hand, Overdrive tends to be unnecessarily problematic. First, I need to create an Overdrive ID. Then, I need to add my library into my Overdrive profile and log in using my library ID. On every device, I am logged out of the library system after a certain amount of time, so browsing the catalog requires me to log in again. But wait. Overdrive allows me to use Adobe Digital Editions (ADE: the go-to DRM solution for almost every major e-book publisher) to authorize the books on my computer to read offline. Unfortunately, I use Linux and there isn't any version of ADE that I can use. The only version that I can run under Wine is 1.7, which compared to the current 4.0 version, is positively prehistoric. Okay, so I get ADE installed (creating an Adobe ID) and then try to download a book to my computer that I checked out from the library on my phone.




Computer says no. Apparently because I am authorized in my Overdrive phone app with the Overdrive ID and ADE on my computer with the Adobe ID, the book cannot be read in both places. Oh no no, Overdrive helpfully allows me to authorize ADE with my Overdrive ID, but


Note: These instructions are for ADE 2.0 or newer. We recommend always using the latest version of ADE. You can learn how to install Adobe Digital Editions on your Windows or Mac computer here.

I would if I could! Remember how I cannot use anything beyond ADE 1.7 on Linux? This option goes out of the window. OK, so how about I sign in to the Overdrive app with my Adobe ID? No go.


Note: You may also authorize newer versions (3.2+) of the OverDrive app with an Adobe ID, but you don't need to. When you first launch the OverDrive app, you'll be required to log in with an OverDrive account, which authorizes the app automatically.

In other words, I will be forced to create an Overdrive ID. Worse, because the initial authorization is automatic, I have to deauthorize the app first, then reauthorize with the Adobe ID. Does it work? 



Oh no, I borrowed the book initially when I was authorized with Overdrive, so changing to Adobe voids the authorization, instead of actually reauthorizing it with the new ID like any sane piece of software would do.


By Tanya Little (Flickr: 9 of 365 ~ Frustration) [CC BY-SA 2.0 (http://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons


Three IDs and I am still nowhere close to reading the book that I want to. OK, so how about I return the book and check it out again? (a.k.a. have you tried turning it off and on again?) Hopefully, this should result in a single authorization and I can finally get access where I want. But! I cannot return a book I checked out with a different authorization if I cannot download it. Why would Overdrive do that? If I am logged into my Overdrive account and my library account, I should be free to check out and return books. I can almost understand DRM restricting what I read, but being locked out of the system for all book transactions is crazy. At this point, my only option is to wait for the book to expire and be automatically returned. 

Of course, if you are familiar with Overdrive, you must be asking why I don't read directly in my browser? After all, that's the only option that Google Play Books offers, so it's a direct equivalent. Well, there are two minor reasons: First, it's a matter of principle. If Overdrive offers a feature, I expect it to work. If I have the option for reading offline via ADE, then I want to be able to use what is advertised. Second, Google Play Books does allow you to get your files offline, either via Takeout for uploaded files or directly for purchased books.

However, the main reason for wanting a different solution for reading e-books is the abysmal cross-device syncing in Overdrive. I have had about a 10% success rate getting a book to correctly sync. This was before the ADE fiasco, so I was moving between phone, tablet and browser.

First, if I check out a book from the library using my Overdrive account, it is not automatically available in my Overdrive account on all my other devices. I am required to log in to the library account to be able to see what books I have checked out. Contrast this with Google Play Books, which helpfully sends a notification to all my devices when I upload or purchase a new book. Tap on the notification and I can start reading immediately.

Second, there seems to be some major bugs in the syncing algorithm. For some reason, it is unable to identify what was the last device I read on. In a few cases, if I started reading on my phone with my last session also on the phone, it offered to put me back to where I was at the start of my last session! At other times, it managed to sync the chapter correctly, but could not get me to the correct page in the chapter. I have never had such an issue with Google Play Books, and the lack of this ability has forced me to look at other alternatives.

Unfortunately, as the publishing industry is mired in such a deep love affair with DRM, there is no easy, viable alternative. I can use a pipeline like Wine+ADE+Calibre+DeDRM plugin, but that goes against the usage rights of the e-book, and I don't want to do that because it will provide short-term justification to the publishers to further clamp down on e-books and charge readers and libraries even more for them.

If you think about it, the concept of e-books being handled as paper books is all wrong. The marginal cost of another e-book is minuscule compared to that of another paper book. Within this premise, the concepts of hold lists, limited copies, delayed online publishing windows can only be justified by publishers trying to squeeze as much money as they can out of this new medium. In fact, according to this article from GigaOm,

...“friction” may decline in the ebook lending transaction as compared to lending print books. From the publisher viewpoint, this friction provides some measure of security. Borrowing a print book from a library involves a nontrivial amount of personal work that often involves two trips-one to pick up the book and one to return it. The online availability of e-books alters this friction calculation, and publishers are concerned that the ready download-ability of library ebooks could have an adverse effect on sales.
In other words, making e-books difficult to access, especially from libraries, is in the interest of the publishing industry. It will probably be some time before there is any meaningful change that comes about; until then I'm going to try to stick to paper books from NLB. 

Monday, 23 March 2015

RIP, Mr. Lee Kuan Yew

I finished reading the fantastic book Trespassing on Einstein's Lawn by Amanda Gefter yesterday. In the book, she presents the arguments that lead to the conclusion that in physics, there is nothing that is invariant in all reference frames. In other words, there is no ultimate reality; all reality is observer-dependent. This point of view particularly resonates with me because I believe that every person has a perspective on things that to him or her seems correct. Of course, in our cultural and social norms, we do have a pseudo-ultimate reality enshrined in our traditions, laws and etiquette, but these are man-made constraints and not inviolable otherwise.

What is the point of this? The point is that while many eulogies and obituaries will be written for Mr. Lee Kuan Yew, Singapore's first prime minister who passed away this morning, my perspective of living in Singapore while not being born and brought up here helps me understand the enormity of what he managed to accomplish. My point of view of what Singapore is today is influenced by the fact that having grown up in Mumbai, I can see the different directions the story could have gone if not for this man of iron will.

Let's start with the similarities. Mumbai and Singapore are both islands of roughly the same area (600 vs 700 sq. km.). Both were ruled by the British before independence, both have the advantage of being natural ports that helped foster initial settlements and trade. Both attained independence within two decades of each other. Emerging from the convoluted process of gaining independence, both were tried by communal tensions (Mumbai indirectly due to Partition, Singapore between the Malay and Chinese communities).

Of course, this isn't to say that Mumbai and Singapore are identical. Mumbai is a part of a much larger country and policies that influence the nation may not have a directly beneficial effect on the city. Combine this with the tri-level governance of nation, state and city and the sheer force of population makes it a wondrous story in its own right.

On the other hand, the lack of being part of a larger country was also a big handicap for Singapore. With no natural resources, no agriculture, no oil, no minerals, no drinking water, there was virtually no hope for this tiny city-state to survive. The only three things the new nation possessed were its port, its existing British infrastructure in the downtown area and its people - a largely uneducated and illiterate populace of immigrants.

As a commenter on Reddit said, given these initial conditions, I would have restarted my game of Civ4. Enter one of the greatest gamers ever seen - Lee Kuan Yew. After crying on television about Singapore being thrown out of the Malaysian Federation, he set about the task of building Singapore up. He was a visionary like no other.

While most emerging nations are hesitant about foreign investment fearing foreign influence, he encouraged it because he saw that Singapore had nothing it could export or manufacture on its own. Backed into a corner, he encouraged foreign companies to set up base in Singapore, offering them tax incentives and ease of doing business. The experience gained by the population from working in these companies helped local banks and technology companies climb to be shoulder-to-shoulder with some of the best in the world. Think about that: the man saw that the only option was to take which most other countries in its place feared, and he took it and played it so that it benefited the country.

Faced with a severe housing shortage in the 50s and 60s, he set up the Housing Development Board (HDB) to make sure that most people could own their houses and feel they own a part of the nation. It is thanks to his vision that fifty years later, Singapore ranks as one of the highest in the world in terms of home ownership rate (>90%). While the technicality of the term "home ownership" may be debated - all HDB flats are technically on a 99-year lease from the Board - it is nevertheless a long-term security that citizens possess. Contrast this with MHADA, where the mere name conjures up pictures of run-down, cramped buildings.

Faced with a lack of drinking water and importing water from Malaysia, he invested in creating water catchment areas on the island, recycling water as well as desalination technologies (all of which were years away from being commercially viable) as a result of which Singapore aims to be completely self-sufficient water-wise by 2061. Think about that: An island with barely any drinking water and limited land aims to have enough water to support a population of millions. Facing a local populace not tied together by language, he made education in English compulsory (along with a mother tongue language) as that is the language of science, technology and business worldwide. Contrast that with the language politics that we see in India with English and Western culture looked upon as destroying all that is good and holy. To attract top talent to government and keep it corruption-free, he pegged cabinet members' salaries to that of top-earning executives in private companies. Look at the beauty of that idea: if you can earn that much, there is less need for you to be corrupt when you wield power and earning that much can be an incentive for you to get into government, not just the need to do good for the people.

I could go on with other examples of vision and foresight that LKY had, but there was also the other less positive side of things. He ruled with an iron fist, bankrupting opponents (thereby making them ineligible to run for office) via costly lawsuits, controlling the media and imprisoning whoever was considered to be an internal threat to Singapore. We still see the lingering effects of these even though some of the policies have loosened thanks to the democratization of speech via the Internet. Nevertheless, the way I look at it, his view was that the ends justify the means. We try something and if it works, fine; if not, throw it away, try something else.

Yes, in terms of conventionally defined Western freedoms, Singapore lags far behind. This however begs the question: are such freedoms good for their own sake? I will leave the philosophers and the policy-pundits to debate this, but I believe that from the survivalist tendencies that Singapore emerged, this was the best way forward and we are gradually seeing a change in the idea of freedoms here. Freedom of expression and speech engender creativity and the incipient arts and culture scene and the increasingly vocal political space are examples of areas where boundaries are being pushed and modified. It will be a test of the current generation of leaders if they can maintain the trajectory that Singapore was launched on, while still adapting to the times.

As more or less an internal observer with an external background, I see LKY's policies' influence wherever I go. Whether it is taking a flight out of Changi (he pushed for an airport at the edge of the island, when multiple studies advocated the expansion of an existing inland airport like Sahar), taking the MRT to work (the largest public works project of its time, pushed by LKY because he believed that land was too precious for an all-bus system), heading into the research cluster of Singapore (born out of the Industrial Research Unit under the Economic Development Board set up by him), having lunch at a food court or hawker centre (LKY's government organized the food services sectors by setting up markets and formal hawker centres in housing and commercial estates) or going for a walk along the park connector (LKY pushed for greening Singapore to such a degree, that the motto for the National Parks Board today is not Garden City, but City in a Garden), I see the gigantic influence he has had on the place I live. More importantly, I am also able to see how easily things could have gone south. Giving in to special interests, pandering to communities on the basis of race or religion, copying principles from other countries without pausing to consider local applicability or rejecting viable ideas because they didn't emerge internally would all have sabotaged one or more of these.

And that prescient understanding of Singapore and its position in the world along with an iron will while being open to trying new things is what made him such an effective leader. Singapore is truly poorer for having lost its greatest caretaker and his loss will resonate for years as people stop to ask, "What would LKY have done?"

RIP, Lee Kuan Yew.

Wednesday, 4 February 2015

Headspace

I got my hands on a Bluetooth headset (the very comfortable LG HBS-750) a few weeks ago for my commute. I always prefer wireless headsets in the train as there are no wires to be pulled or tangled in the crush. Another reason I enjoy my personal audio in the train is that it gives me much needed headspace, especially when decompressing from work.

I have started listening to some podcasts so I'll chronicle them below. Any recommendations would definitely be welcome. As an aside, I use Player FM as my podcast app, and it is brilliant with a well-designed UI, an extensive catalogue and good offline storage settings.

So, here are some of my favorite podcasts currently:

1. Serial: Well, of course! How can I not be listening to Serial given the hype surrounding it? For those who don't know, it is a long-form weekly podcast spanning almost three months that investigates the alleged murder of Hae Min Lee by her ex-boyfriend Adnan Syed in 1999. The podcast takes us through mountains of unreliable data and dozens of twists and turns with no clear end in sight. It is probably what a lot of criminal investigations are like, and to witness the uncertainty and inconsistencies in the case presented by either side is very edifying.

2. TwiT: This was an unusual one for me. I didn't really get into it at the beginning, because it seemed too slow for what I considered to be a tech news program. After giving it some time however, I realized it is more of a group discussion and has tons of banter that really liven up a two-hour long show. It has become one of the must-hear podcasts for me, and I look forward to it every Tuesday (Singapore time).

3. The Bugle Podcast: From the titans of comedy John Oliver (of Last Week Tonight fame) and Andy Zaltzman (of Cricinfo fame), this self-described "audio newspaper for a visual world" is irreverent and holds no punches in taking everything and everyone to task. I am guaranteed a good chuckle during my commute, much to the disconcert of my co-passengers.

4. No Such Thing As A Fish: Another British podcast, this time from the writers of the comedy panel show QI, it covers interesting facts every week. Loads of comedy and you do get to learn some interesting stuff. For instance, did you know that after the moon landing, Buzz Aldrin worked as a car salesman in a Cadillac dealership and didn't sell a single car in six months? This, of course, led the panel to question whether it was a good decision to have the dealership on the moon, but that's a different story. The name of the podcast comes from the fact that the common ancestor of what we call fish is also the common ancestor of all four-legged, land vertebrates and therefore, there is really no such thing as a fish.

So, that's all from me for now. Do you have any podcasts that you'd recommend for me to add on? I'm still on the lookout for a daily news program that is available between 4AM and 7AM SGT, so that I can get some current news on my way to work. Something like BBC Asia would work well but I can't seem to find a good, relevant podcast. All suggestions are welcome!

Friday, 21 November 2014

A Picture is Worth A Thousand Words (Literally)

I have a healthy skepticism of the part of the job interview process where applicants are asked to write a super-efficient piece of code to create what amounts to a cog in a much larger machine. In my experience as a researcher and developer, most of my time is spent finding the cogs bringing the machine to a grinding halt, also known as debugging, also known as staring at sections of code for minutes on end as your monitor goes to sleep.

One of my favorite methods of debugging is to visualize what I am dealing with wherever possible. Indeed, one of the reasons I got into the field of image processing and computer vision was because generally, when you don't know what's going on, you can look at an image output and get a pretty good idea of what's happening.

Recently, I have been looking into evaluating a classification technique that I have been working with on a text categorization problem. I am familiar with the fundamentals of text classification, as visual classification has long been inspired by the former, but have never worked in the domain before. In order to make what follows a bit easier for non-CS people to follow, I'll try to explain the classification problem in simpler, qualitative terms. Feel free to skim through it if you're comfortable with the field.

Who's who?

So here I am looking at a collection of about 5,000 webpages crawled from the websites of multiple universities. The collection is split into a training set and a testing set. As the names imply, the former is used for teaching our classification algorithm what it is to recognize and the latter is used for testing its performance. The actual categorization problem is a bit more complex, but for the purpose of explanation I'll say we want to distinguish between webpages belonging to faculty (the positive class) from the webpages belonging to non-faculty (the negative class, comprising of students, projects, clubs, etc.). Each document is labelled with its class, and the labels are supplied to the classifier for training, but only used for comparing its outputs in testing.

How do we go about this? We need to structure each document into a form from which its meaning can be easily extracted. One of the most popular techniques from years of research in text analysis is to simply take all the words appearing in a document and count their occurrence. The order of the words is ignored, and the output we get is basically a count of how many times a word in an arbitrary vocabulary or dictionary appears in the document. This is termed as a document vector. There may be additional preprocessing and normalization steps involved, but I'll skip those for now. As an example, a document like:
apple banana apple papaya
will result in a vector:
apple=2 banana=1 orange=0 papaya=1 watermelon=0 ...
for a fruit-based dictionary. Each vector thus supplies the classifier with information about which words occur in the document and how frequently. The classifier then looks for common words in documents belonging to the same class, and uses them to try to assign new, unseen documents to the correct classes.

So what's this got to do with visualization?

Now, the problem I was running into was that my algorithm kept giving me excellent performance on the training set, but not on the testing set. This is very bad for any classifier because it means that it is not able to work on anything it hasn't seen before, making it useless. It is not an uncommon issue though. The classifier is ending up overfitting the data; it is placing too much emphasis on the data it has been given at the expense of leaving little wiggle room to deal with ambiguity in unseen data. This is like the anecdote of the kid who complains to his mother that he could not solve an addition problem in class because he had been taught to do it with apples and the problem asked him to add oranges. Similarly, what we want the classifier to realize is that it is certain words that matter, not all of them.

So, I go about trying to fix the apparent overfitting. This is done by something called regularization, which is a fancy word for penalizing your algorithm when it tries to improve its performance by reducing its flexibility too much. I tried different values for the regularization parameters, different regularization techniques, different parameters for other components of my algorithm. Nothing worked.

Puzzled but quite sure that something should have worked by now, I thought of checking the data. But 5,000 documents containing some 7,500 words overall are not easy to wade through. That's when my instincts kicked in, and I went to the comforting familiarity of images.

Visualization of the positive training set

I visualized the entire positive class of document vectors in the training set as the image above. The vertical dimension consists of the words in the vocabulary, the horizontal dimension consists of the documents, and the brightness of each pixel tells us how frequently a particular word occurs in the corresponding document. There is some normalization done as well, which is responsible for the relatively uniform brightness you see in the horizontal dimension.

Notice there are some bright rows in the picture. I looked back into the vocabulary for the words these rows correspond to and these turned up being words such as "professor", "faculty", "department", "research", etc. - dead giveaways for the positive class, especially when more than one of them occurs on the webpage.

Now, let's take a look at the testing set of the positive class.

Visualization of the positive testing set
Notice the lack of bright horizontal lines or any significant structure for that matter. This immediately violates one of the preconditions for successful classification - the distribution of words in the training and testing sets should be approximately the same (the other condition is that there should be some distinction between the positive and negative classes). No classifier can do well on such a problem, because there is no information about the testing set that can be learned from the training set.

That long-winded explanation is how I realized that there was something wrong with the data. I checked and yup, there was some corrupt data that had caused the document vectors to be improperly calculated. Once that was fixed, the correlation between training and testing performance immediately manifested itself.

Moral of the story? Don't trust the data, especially if you're in unfamiliar territory.

Monday, 10 November 2014

Greetings, planet!


What? Did you really think I wasn't going to open with some variation on "Hello, world!"?

Anyway, I just thought of starting a blog, because I certainly don't tweet or share as much as I used to. Ironic, considering that I am working on social media analysis these days. Moreover, I read this wonderful article on Lifehacker the other day, which says that no matter how little you think you know about something, you are still an expert to someone. Therefore, I decided to start getting some of my thoughts out there, and hopefully generate some new ideas.

I really don't know what this blog is going to be about. Maybe some personal stuff, maybe some Formula 1, maybe some computer vision stuff. Of course, without any idea of what I want this to be, but not wanting to add another blog to the catch-all of "Random Thoughts", I decided to go a bit technical with the title.

I'll try to post regularly and see how I take to blogging, so...