COSINE SIMILARITY MURDABAD!
On the use of the “Cosine Similarity” as a quantifier of the relatedness of elements of a linear space.
The Cosine Similarity is one of the quantifiers used in numerical text analysis for judging the similarity between texts, as follows:
Consider a text, either a web document, an essay or a book. Perform an e e cummings on it, which is jargon for the following set of operations: remove all punctuation and other non-letter keyboard characters, lowercase everything and then parse it on whitespace. What you are left with is a collection of words. Now remove various small and frequent “stop” words like “a”, “the”, “is” etc., which account for about a third of Shakespeare’s vocabulary. Now count the number of occurrences of the distinct words (or “word-tokens”).
If the word tokens are ordered, the list of the corresponding number of occurrences in a text is the “word histogram” of the text. The space of word histograms is linear, has a natural origin and is non-negative, it is a whole-number valued vector space. So lets call these things “word vectors”.
Any given text is represented by a point in this Word vector Space. In numerical text analysis (after TFLDF if one so chooses) one wants a way to be able to quantify the relatedness or similarity between two texts – each represented as a vector, and the simplest thing one can do with two vectors is … scalar product! Which, as one knows, is the product of the magnitudes of the two vectors and the Cosine of the angle between them. Now as far as “similarity” is concerned, the magnitudes of the word vectors don’t matter (Half of “Romeo and Juliet” is similar to the whole text.).
So we have the numerical “Cosine Similarity”: it is close to 1 for vectors in almost the same direction and 0 for perpendicular vectors. Since the space is non-negative, there are no pairs of anti-parallel vectors with Cosine Similarity = -1.
Cosine similarity as a cardinal number: You can’t add it, you can’t subtract it, in fact you can’t do any of the numerical operations on it. It isn’t a distance function, nor are its additive or multiplicative inverses. From the point of view of trying to do any geometry on this space, it is quite useless.
If what one is really trying to quantify is the angle between the two vectors, Cosine is a very poor substitute. Since its slope at 0o is zero, it discriminates very poorly between close neighbours, which is precisely the region of interest. Other absurdities are manifest: vector 60 degrees apart from each other have a Cosine Similarity of 50%, vectors 30 degrees (a third of the way to perpendicular to each other) have a Cosine Similarity of 86%! Bit vectors with only half their bits in common have a Cosine Similarity of 70%, but they also have a Sine Dissimilarity of 70%!
Now it turns out that ArcSine of the square-root of (1 minus the square of the Similarity) is a good metric distance. OK, so this is just the angle that Cosine is the Cosine of. So Cosine has possible value for calculating the angle between the vectors. Except for that pesky vanishing slope, which causes an amplification of errors at small angles. More on this when we consider the ordinal virtues of Cosine.
So one should use the angle, it is the metric distance on projective word vector space and one can use it to do all kinds of geometrical stuff – cluster density is an example.
However, to calculate it there are better, less error-prone trigonometric means: whereas Cosine(theta) = a . b for unit vectors,
Sine(theta/2) = |a – b|/2 is monotonic increasing in theta and strictly monotonic except at theta =180, which is in some sense the least interesting angle.
What about that theorem in Manning and Schütze, which proves that Cosine leads to the same ordering as does the angle? Yes, it is mathematically true, and obvious, I might add, since Cosine is monotonic. However, it is not true computationally: an error of 0.025 in the Cosine spells the difference between 2 degrees and 12 degrees, which could completely invert the ordering!
In conclusion: Cosine Similarity is no good as a cardinal quantifier, no good as an ordinal quantifier, and horrendous for calculating theta.
As far as what I propose to do, you’ll just have to wait for the movie.
Cosine Similarity Murdabad!