Friday, September 9, 2011

Rate of FemINism amongst LinkedIn members?

Have you recently joined LinkedIn or edited your profile lately? Below the lines for your first and last names, there is a line for "Former/Maiden Name:".
The evidence

The obvious question is, "How and when did it come to be there?".

The question The interesting meta-question, Ash, ish "Why has this so very 19th century concept persisted well into the 21st century, and what does it tell us about the LinkedIn membership?"

The answer Here's (Thank you Megrah!)  my answer: It is because the rate of femINism amongst LinkedIN members is very low, in fact about two per million or one in 500,000 in the best case scenario.

My assumptions:
1) Had they seen "Former/Maiden Name" any feminist worth their (Sorry, Megrah!) salt would object to that language ...
2) ... and have taken action --the difference between feminism and namby-pamby "humanism" is action-- for example by bringing it to the attention of LinkedIn...
3) ...and LinkedIn would have removed the sexist language.
(As of 9 September 2011, that line was still there on the "edit profile" page.)
4) Every former or current LinkedIn member has seen that line at least once, during registration.

Worst case analysis: There are currently upwards of 120 million LinkedIn members. Only one "took action" - walking a 100' and chatting. So in the worst case analysis, the rate of femINism is 1 in 120 million.

But wait, LinkedIn hasn't always had a 120 million members! And, how long did it take that one person (myself) to act?

That calls for a more sophisticated analysis, taking into account the duration of exposure of the sexist line to members and the time it took me from when I could have first noticed it till I acted.

Let's do the math, Barbie! Take LinkedIn's membership numbers: 4.5K in June 2003 increasing to 120M in August 2011. Assume, 5) exponential growth and calculate the time-constant (in base-10 it is about 1/2 per year). Then simply integrate the membership over time and you get about 1million man-years! Ohh ... OK ... people-years.

I've been a member of LinkedIn for about 2 years, and it took me that long to notice and take (minor) action - so that is 2 people-years in the numerator. And there you have it:

Rate of FemINism amongst LinkedIn membership is at most 1 in 500,000!  

One possible quibble and why it beautifully doesn't matter because of exponential growth: What if that line was only introduced later, not at the very beginning but say when membership was already 10 times as large as at the beginning? Wouldn't that mean that the rate of feminism is actually much better, 2 per 100,000 (one 10th of million) , which is really 1 out 50,000? That doesn't look too bad! Is the analysis so sensitive to assumptions about initial conditions?

Well no. It takes LinkedIn only 2 years to increase its membership by a factor of 10. Over those early 2 years with exponentially less members, the loss in member exposure turns out to be only 135,000 people-years. So even with that ameliorative assumption, we would get the rate of femINism to be 2 per (1million - 135,000) , or about 1 per 430,000 as opposed to 1 per 500,000. Does that really make you feel better?

Full Disclosure: I am an intern at LinkedIn (and proud of it) and a femINist (and proud of it).

Thursday, September 1, 2011

Cosine Similarity Murdabad!


On the use of the “Cosine Similarity” as a quantifier  of the relatedness  of elements of  a linear space.

The Cosine Similarity is one of the quantifiers used in numerical text analysis for judging the similarity between texts, as follows:
Consider a text, either a web document, an essay or a book. Perform an e e cummings on it, which is jargon for the following set of operations: remove all punctuation and other non-letter keyboard characters, lowercase everything and then parse it on whitespace. What you are left with is a collection of words. Now remove various small and frequent “stop” words like “a”, “the”, “is” etc., which account for about a third of Shakespeare’s  vocabulary. Now count the number of occurrences of the distinct words (or “word-tokens”).

If the word tokens are ordered, the list of the corresponding number of occurrences in a text is the “word histogram” of the text. The space of word histograms is linear, has a natural origin and is non-negative, it is a whole-number valued vector space. So lets call these things “word vectors”.

Any given text is represented by a point in this Word vector Space. In numerical text analysis (after TFLDF if one so chooses) one wants a way to be able to quantify the relatedness or similarity between two texts – each represented as a vector, and the simplest  thing one can do with two vectors is … scalar product! Which, as one knows, is the product of the magnitudes of the two vectors and the Cosine of the angle between them. Now as far as “similarity” is concerned, the magnitudes of the word vectors don’t matter (Half of “Romeo and Juliet” is similar to the whole text.).

So we have the numerical “Cosine Similarity”: it is close to 1 for vectors in almost the same direction and 0 for perpendicular vectors. Since the space is non-negative, there are  no pairs of anti-parallel vectors with Cosine Similarity = -1.

Cosine similarity as a cardinal number: You can’t add it, you can’t subtract it, in fact you can’t do any of the numerical operations on it. It isn’t a distance function, nor are its additive or multiplicative inverses. From the point of view of trying to do any geometry on this space, it is quite useless.

If what one is really trying to quantify is the angle between the two vectors, Cosine is a very poor substitute. Since its slope at 0o is zero, it discriminates very poorly between close neighbours, which is precisely the region of interest. Other absurdities are manifest: vector 60 degrees apart from each other have a Cosine Similarity of 50%, vectors 30 degrees (a third of the way to perpendicular to each other) have a Cosine Similarity of 86%! Bit vectors with only half their bits in common have a Cosine Similarity of 70%, but they also have a Sine Dissimilarity of 70%!

Now it turns out that ArcSine of the square-root of (1 minus the square of the Similarity) is a good metric distance. OK, so this is just the angle that Cosine is the Cosine of. So Cosine has possible value for calculating the angle between the vectors. Except for that pesky vanishing slope, which causes an amplification of errors at small angles. More on this when we consider the ordinal virtues of Cosine.

So one should use the angle, it is the metric distance on projective word vector space and one can use it to do all kinds of geometrical stuff – cluster density is an example.
However, to calculate it there are better, less error-prone trigonometric means: whereas Cosine(theta) = a . b for unit vectors,
Sine(theta/2) = |ab|/2 is monotonic increasing in theta and strictly monotonic except at theta =180, which is in some sense the least interesting angle.

What about that theorem in Manning and Schütze[1], which proves that Cosine leads to the same ordering as does the angle? Yes, it is mathematically true, and obvious, I might add, since Cosine is monotonic. However, it is not true computationally: an error of 0.025 in the Cosine spells the difference between 2 degrees and 12 degrees, which could completely invert the ordering!

In conclusion: Cosine Similarity is no good as a cardinal quantifier, no good as an ordinal quantifier, and horrendous for calculating theta.

As far as what I propose to do, you’ll just have to wait for the movie.

Cosine Similarity Murdabad![2]

[1] An excellent introductory text on “Foundations of Statistical Natural Language Processing”.
[2] Roughly translates as “Die! Die! Die!”