Friday, September 9, 2011

Rate of FemINism amongst LinkedIn members?

Have you recently joined LinkedIn or edited your profile lately? Below the lines for your first and last names, there is a line for "Former/Maiden Name:".
The evidence
 

The obvious question is, "How and when did it come to be there?".

The question The interesting meta-question, Ash, ish "Why has this so very 19th century concept persisted well into the 21st century, and what does it tell us about the LinkedIn membership?"

The answer Here's (Thank you Megrah!)  my answer: It is because the rate of femINism amongst LinkedIN members is very low, in fact about two per million or one in 500,000 in the best case scenario.

My assumptions:
1) Had they seen "Former/Maiden Name" any feminist worth their (Sorry, Megrah!) salt would object to that language ...
2) ... and have taken action --the difference between feminism and namby-pamby "humanism" is action-- for example by bringing it to the attention of LinkedIn...
3) ...and LinkedIn would have removed the sexist language.
(As of 9 September 2011, that line was still there on the "edit profile" page.)
4) Every former or current LinkedIn member has seen that line at least once, during registration.

Worst case analysis: There are currently upwards of 120 million LinkedIn members. Only one "took action" - walking a 100' and chatting. So in the worst case analysis, the rate of femINism is 1 in 120 million.

But wait, LinkedIn hasn't always had a 120 million members! And, how long did it take that one person (myself) to act?

That calls for a more sophisticated analysis, taking into account the duration of exposure of the sexist line to members and the time it took me from when I could have first noticed it till I acted.

Let's do the math, Barbie! Take LinkedIn's membership numbers: 4.5K in June 2003 increasing to 120M in August 2011. Assume, 5) exponential growth and calculate the time-constant (in base-10 it is about 1/2 per year). Then simply integrate the membership over time and you get about 1million man-years! Ohh ... OK ... people-years.

I've been a member of LinkedIn for about 2 years, and it took me that long to notice and take (minor) action - so that is 2 people-years in the numerator. And there you have it:

Rate of FemINism amongst LinkedIn membership is at most 1 in 500,000!  

One possible quibble and why it beautifully doesn't matter because of exponential growth: What if that line was only introduced later, not at the very beginning but say when membership was already 10 times as large as at the beginning? Wouldn't that mean that the rate of feminism is actually much better, 2 per 100,000 (one 10th of million) , which is really 1 out 50,000? That doesn't look too bad! Is the analysis so sensitive to assumptions about initial conditions?

Well no. It takes LinkedIn only 2 years to increase its membership by a factor of 10. Over those early 2 years with exponentially less members, the loss in member exposure turns out to be only 135,000 people-years. So even with that ameliorative assumption, we would get the rate of femINism to be 2 per (1million - 135,000) , or about 1 per 430,000 as opposed to 1 per 500,000. Does that really make you feel better?

Full Disclosure: I am an intern at LinkedIn (and proud of it) and a femINist (and proud of it).

Thursday, September 1, 2011

Cosine Similarity Murdabad!


COSINE SIMILARITY MURDABAD!

On the use of the “Cosine Similarity” as a quantifier  of the relatedness  of elements of  a linear space.

The Cosine Similarity is one of the quantifiers used in numerical text analysis for judging the similarity between texts, as follows:
Consider a text, either a web document, an essay or a book. Perform an e e cummings on it, which is jargon for the following set of operations: remove all punctuation and other non-letter keyboard characters, lowercase everything and then parse it on whitespace. What you are left with is a collection of words. Now remove various small and frequent “stop” words like “a”, “the”, “is” etc., which account for about a third of Shakespeare’s  vocabulary. Now count the number of occurrences of the distinct words (or “word-tokens”).

If the word tokens are ordered, the list of the corresponding number of occurrences in a text is the “word histogram” of the text. The space of word histograms is linear, has a natural origin and is non-negative, it is a whole-number valued vector space. So lets call these things “word vectors”.

Any given text is represented by a point in this Word vector Space. In numerical text analysis (after TFLDF if one so chooses) one wants a way to be able to quantify the relatedness or similarity between two texts – each represented as a vector, and the simplest  thing one can do with two vectors is … scalar product! Which, as one knows, is the product of the magnitudes of the two vectors and the Cosine of the angle between them. Now as far as “similarity” is concerned, the magnitudes of the word vectors don’t matter (Half of “Romeo and Juliet” is similar to the whole text.).

So we have the numerical “Cosine Similarity”: it is close to 1 for vectors in almost the same direction and 0 for perpendicular vectors. Since the space is non-negative, there are  no pairs of anti-parallel vectors with Cosine Similarity = -1.

Cosine similarity as a cardinal number: You can’t add it, you can’t subtract it, in fact you can’t do any of the numerical operations on it. It isn’t a distance function, nor are its additive or multiplicative inverses. From the point of view of trying to do any geometry on this space, it is quite useless.

If what one is really trying to quantify is the angle between the two vectors, Cosine is a very poor substitute. Since its slope at 0o is zero, it discriminates very poorly between close neighbours, which is precisely the region of interest. Other absurdities are manifest: vector 60 degrees apart from each other have a Cosine Similarity of 50%, vectors 30 degrees (a third of the way to perpendicular to each other) have a Cosine Similarity of 86%! Bit vectors with only half their bits in common have a Cosine Similarity of 70%, but they also have a Sine Dissimilarity of 70%!

Now it turns out that ArcSine of the square-root of (1 minus the square of the Similarity) is a good metric distance. OK, so this is just the angle that Cosine is the Cosine of. So Cosine has possible value for calculating the angle between the vectors. Except for that pesky vanishing slope, which causes an amplification of errors at small angles. More on this when we consider the ordinal virtues of Cosine.

So one should use the angle, it is the metric distance on projective word vector space and one can use it to do all kinds of geometrical stuff – cluster density is an example.
However, to calculate it there are better, less error-prone trigonometric means: whereas Cosine(theta) = a . b for unit vectors,
Sine(theta/2) = |ab|/2 is monotonic increasing in theta and strictly monotonic except at theta =180, which is in some sense the least interesting angle.

What about that theorem in Manning and Schütze[1], which proves that Cosine leads to the same ordering as does the angle? Yes, it is mathematically true, and obvious, I might add, since Cosine is monotonic. However, it is not true computationally: an error of 0.025 in the Cosine spells the difference between 2 degrees and 12 degrees, which could completely invert the ordering!

In conclusion: Cosine Similarity is no good as a cardinal quantifier, no good as an ordinal quantifier, and horrendous for calculating theta.

As far as what I propose to do, you’ll just have to wait for the movie.

Cosine Similarity Murdabad![2]



[1] An excellent introductory text on “Foundations of Statistical Natural Language Processing”.
[2] Roughly translates as “Die! Die! Die!”

Saturday, August 13, 2011

Proof of Conjecture for paths in hypercube lattices

Problem:
  1. hypercubical lattice in D dimensions
  2. start from O = (0,0...)
  3. allowed moves are only of the form (0,0,+, 0,0, ...)
  4. End point of path is (N1, N2,...ND)
What is the number of paths?

Answer:
The multinomial coefficient
(N1+N2 ... ND)C(N1)(N2)...(ND) = (N1+N2...+ND)! / ( N1! N2! ...ND!)

Proof:
Let the whole number valued coordinate axes be n_i, i = 1...D.
The end point lies on the hyperplane \sum(n_i) = \sum(N_i).
One has to choose \sum(N_i) times to get to that hyperplane, from between D choices each time (the choice of direction to move in).
To get to the point (N_1, N_2..., N_D), exactly  N_i choices have to have been in the i- direction.
Hence the number of paths is the number of ways of choosing the above partition from its sum.

In 3D, this is the number of ways of choosing 3 Easts, 5 Norths and 2 Ups from 10 choices to reach the point 3E, 5N and 2U.

Hence the multinomial coefficient. QED

As a little extra: lets say we were only interested in paths between diagonally opposite points of the D-hypercube, say (0,0,...) and (N,N,...).

Then the number of paths is (D*N)!/(N!)**D, which was my conjecture two days ago!!!


Now, the Analytical answer can be produced with 2 lines more of code: one to read the dimension from the command line, the other to read the coordinates of the end point.

The codeCounting program would need a bit more, a couple of for loops to write out the D-dimensional python code (if needed) and then to call it. 

But the amount of time it is going to take is going to be ...  HELP?

Last point for today: The factorial function can be extended to complex plane except negative integers by the Gamma function.

So now I can legitimately answer the question: How many paths are there to a say integer diagonal point in a fractional dimension?

Answer = Gamma(N*D +1)/(N!)**D.

_ : Who cares about your perspective? What practical value can it possibly have?

Here goes: Play fast and loose with "fractional" = "fractal". Now we actually have something we can visualize: paths on a square Sierpinski gasket e.g.

And the practical use? Conjecture: The structure of the web or networks on the web is fractal. There is fractal behaviour even in Statistical Natural Language Processing, Benoit Mandelbrot did important work on it in the 50's.

Lets say that the LinkedIN network is fractal. Maybe the above could be of some use in counting paths, for a better calculation of the clustering coefficient of an egonet.

From the perspective of SNLP, simple extensions of the above counting processes lead to the distributions of sizes of stochastically generated words, e.g with truncation on choice of ' '. 

Ranjeet


simplifying the boundary conditions

Last night and this morning when I woke up I was thinking of the extension to higher dimensions. One thought was that the number of elif conditions with special case sums for the points on the boundary would grow as 2**D -1, all the "lower" edges, faces, vertices etc. Even if I wrote python code to
  1. take in the dimension as an argument and
  2. have it write python code for the dimension required (which I'll do in a minute)
this would be quite tedious.
Then I realised that the recursive self-calls stop when they encounter a value, so I can just set boundary conditions for "parent nodes" on the enveloping lower hyper-planes (of which there are only D) to be 0.

Here is the somewhat more "biutiful" code (I rented it, but ended up watching Terminator 3 instead):

#!/usr/bin/python                                                              
""" 'codeCountPaths1.py Natural [Natural]'                                     
    Single argument -> (n,n)                                                   
    code to count the number of paths                                          
    Assumption 1: starting at (0,0)                                            
    Assumption 2: ending at user input (n,m)                                   
    moving in (+,+) direction only, on a complete Euclidean square lattice.    
    CHANGE from codeCountPaths.py:                                             
    A branch of the recursion will stop when it encounters a value, so         
    BCs can be accounted for by putting in lines of                            
    0s at n= -1 and m = -1 as opposed to                                       
    restricting the sum for boundary points to just those parent nodes         
    which lie within the domain. There is no need for the 0-valued             
    parent nodes to go back to the n+m = 0 line (or soon2B hyperplane)         
"""
import sys, math
def NumPaths(i,j,n,m):
  if i==0 and j==0:
    numpaths = 1 # BC at origin
  elif ((i==-1) and j in range(1,m+1)) or ((j==-1) and i in range(1,n+1)):
    numpaths = 0 # BCs at enveloping lower lines
  else:
    numpaths = NumPaths(i-1,j,n,m) + NumPaths(i,j-1,n,m)
  return numpaths
def main():
  # reads command line arguments for the location of the ending point          
  if len(sys.argv) < 2:
    print "usage: $ python codeCountPaths.py Natural [Natural]"
  elif len(sys.argv) == 2:
    n = int(sys.argv[1])
    m = int(sys.argv[1])
  elif len(sys.argv) ==3:
    n = int(sys.argv[1])
    m = int(sys.argv[2])

  # scolds user for trying to trick poor little machine                        
  if (n < 0) or (m < 0):
    print "BoundError: n >= 0 and m >= 0. Tricks later, for now positive intege\
r only please"
  # asks for answer, and sits back for an awfully long time                                             
  ans = NumPaths(n,m,n,m)
  print "Code counted number of paths from (0,0) to (",n,', ',m,") on a complet\
e Euclidean square lattice moving only in the (+,+) direction = ", ans
if __name__ == '__main__':
  main()

Question: I have a recursion relation for the number of time-steps needed, but can't solve it analytically, neither can I correctly code to just count each time a call is made.

Can one of you CS gods help put better bounds on "awfully long time"?

Code for A BEAUTIFUL BEAUTIFUL PROBLEM

Beautiful code!!!

OK: I can't believe
0) yesterday's code was so Uuugly!!
1) I completely missed all sorts of beautiful things and the joy of having coded this problem (succesfully, meaning without crashing my computer or bringing down the whole LinkedIn site even though I had been careful enough to close all connections before hitting 'return')
2) Yesss! it was also correct!

Yesterday's code:
#!/usr/bin/python                                                              \                                                                         
""" 'diagPaths.py' """
import sys, math
def diagN(n):
  if n < 1:
    print "tricks later, for now positive integer only please"
  else:
    return math.factorial(2*(n-1))/math.factorial(n-1)**2
def main():
  if len(sys.argv) < 3:
    #                                                                          
    print "usage: $ python diagPaths.py --n natural number"
  elif sys.argv[1] == '--n':
    n = int(sys.argv[2])
    ans = diagN(n)
    print "Number of paths between diagonal vertices for ",n, "-square lattice \
= ", ans
if __name__ == '__main__':
  main()

3) If you want to exploit the beauty and power of recursion, code backwards! The number of paths to get to (i,j) is = numpaths(i-1,j) + numaths(i,j-1). Then just ask for numpaths(whatever you want) and let the code call itself ...

4) ... Modulo the bounds: while thinking about the bounds I realised that the upper bounds don't constrain anything. Which in turn means ...

5) The number of paths to the point (i,j) from (0,0) is just (i+j)C(j). Why? It is the i+j th diagonal, and the point is the jth from one end of it, so it is just the Binomial coefficient given above.

6) Which further means that we have an alternative proof of the sum of the squares of the binomial coefficients: yesterday I explained that the numpaths to get to (n,n) is the sum of the squares of the numpaths to get to points on the transvecting diagonal, which is the sum of the squares of the Binomial coefficients. Today, I've explained that the numpaths to get to (i,j) is (i+j)C(j). Hence to get to (n,n) the numpaths is (2n)C(n) = (2n)!/(n!)^2 !!!! - That is exciting, not factorials.

Analytical answer:

#!/usr/bin/python                                                       
""" 'AnalyticalCountPaths.py' 1 argument -> end point (n,n)                    
     2 arguments -> (n,m)                                                      
     assume complete square Euclidean lattice                                  
     starting point = (0,0)                                                    
"""
import sys, math
def main():
  if len(sys.argv) < 2:
    print "usage: $ python AnalyticalCountPaths.py Natural [Natural]"
  elif len(sys.argv) == 2:
    n = int(sys.argv[1])
    m = int(sys.argv[1])
  elif len(sys.argv) ==3:
    n = int(sys.argv[1])
    m = int(sys.argv[2])
  if (n < 0) or (m < 0):
    print "EEp! Bloink! HAL No Compute, go fall into pole of Riemann Zeta. BoundError"
    print "For now positive integers only please"
  else:
    ans = math.factorial(n+m)/(math.factorial(n)*math.factorial(m))
  print "Analytical solution: Number of paths from (0,0) to (",n,', ',m,") on a\
 complete Euclidean square lattice moving only in the (+,+) direction = ", ans
  print "\n Oh my but this problem is beautiful!! Alternate proof of sum of squ\
ares of Binomial coefficients!"
if __name__ == '__main__':
  main()

so why do I need to write the code for actually counting this?
Because, I can't believe I am saying this, it is EVEN more beautiful!
AND FINALLY:
#!/usr/bin/python                                                      
""" 'codeCountPaths.py Natural [Natural]'                                      
    Single argument -> (n,n)                                                   
    code to count the number of paths                                          
    Assumption 1: starting at (0,0)                                            
    Assumption 2: ending at user input (n,m)                                   
    moving in (+,+) direction only, on a complete Euclidean square lattice.    
"""
import sys, math
def NumPaths(n,m):
  if (n < 0) or (m < 0):
    print "BoundError: n >= 0 and m >= 0. Tricks later, for now positive integer only please"
  elif n==0 and m==0:
    numpaths = 1
  elif n == 0:
    numpaths = NumPaths(n, m-1)
  elif m == 0:
    numpaths = NumPaths(n-1,m)
  else: # THE MEAT OF THE CODE
    numpaths = NumPaths(n-1,m) + NumPaths(n,m-1)
  return numpaths
def main():
  if len(sys.argv) < 2:
    print "usage: $ python codeCountPaths.py Natural [Natural]"
  elif len(sys.argv) == 2:
    n = int(sys.argv[1])
    m = int(sys.argv[1])
  elif len(sys.argv) ==3:
    n = int(sys.argv[1])
    m = int(sys.argv[2])
  ans = NumPaths(n,m) # here is where I ask for the answer just once!
  print "Code counted number of paths from (0,0) to (",n,', ',m,") on a complet\
e Euclidean square lattice moving only in the (+,+) direction = ", ans
  print "\n Oh my but this problem is beautiful!!"
if __name__ == '__main__':
  main()

I'll confess I was nervous when I ran this: I'd already calculated the answer in my head, and was praying to Pascal that my codecounting would be right!

Did I thank _ enough for this problem?
Tomorrow: we'll relax the Euclidean requirement, then the "completeness".

By The Way: I should remind myself that any second year Numerical Path Integral Physics Grad student is probably doing this in their sleep.

Thursday, August 11, 2011

A BEAUTIFUL BEAUTIFUL PROBLEM




& BEAUTIFUL CODE WITHOUT CODE


The problem, as I recall it being described to me, is to calculate the number of ways to get from one vertex of a square lattice (of size N points per side) to the diagonally opposite one, where along either axis each move is always in the same direction, say ++. It is or used to be a popular interview question to test the interviewee’s coding skills.

My zeroth thought … I’ll come back to that later.

My 1.1th thought was of course, an iterative approach – figure out a process to get from the k-1th diagonal to the kth: at each step of the iteration use the distribution  of ways to get to the pth point on the k-1th diagonal to calculate the number of ways to get to the qth point on the kth, store this new distribution. Start from k=2 and repeat until k=N. OK, that gets us to the diagonal. Thought 1.1 was divide and conquer: get to the transvecting diagonal between the vertices, square the distribution and then add.

My second thought was to take a recursive approach for the first part (the distribution on the diagonal):
Def diag(k):
  If k ==1:
    Return [1]
  Else:
     Oldlist = diag(k-1)
     Newlist = [ oldlist[i] + oldlist[i+1] for i in range (0,k) ]
     Return newlist
I had forgotten how much more beautiful recursion is than iteration!
All you do is call diag(N), square the elements and sum!

I think the above goes as n**3.

Time to implement and try it:
$ python diagonalPaths.py --n natural number

OK, so for n =1,2,3,4,5 we get 1,2,6,20,70.

So how do we know it is right? Let’s say you’ve debugged the code, validated it for  n=1,2,3. It agrees with what the interviewer has. The answer agrees with the consensus, still, how do we know that it is right?

Put another way, can you predict the answer for n = 6 or any n, before you run your code, or at least, in less than n**3 time?
I’ll claim that my code runs in n**1 time. How?

Well, I didn’t code the recursion. Which brings me to my 0th thought, I can do it analytically, why code it?

One can readily see that the distribution on the diagonal is the binomial, since at every node you are making the choice between +/--, and the number of ‘+’ gets you the position along the diagonal. If we just wanted to get to the diagonal, the number of ways is 2**(n-1). So the answer we want is the sum of the squares of the binomial coefficients, which luckily has a closed form : (2N)!/(N!)**2. If ! goes as N-time, so does my code.

Comment 0: I can also ask the question: “to calculate the number of ways to get from one vertex of a square lattice (of size N points per side)”, where N is:
Non-positive
Rational
Real (for the computer this should be pseudo-real or float I think)
Complex
Matrix (real or complex valued)
Anything which forms a group under multiplication, or anything that forms a group under a group operation which satisfies the properties of multiplication, with the relaxation of the commutativity requirement.

Since n! = Gamma(n+1), any of the above can be calculated (convergence for matrices will be harder to prove). 

I wouldn’t have the slightest idea how to code any of the above!

Comment 1: What about lattices of different dimensions? Now here you’ve got me, I can’t avoid coding. If we are only interested in the number of ways of getting to the nearest transvecting diagonal plane, the answer is simply D**(n-1). But it is no longer sufficient to square the multinomial coeff.s and add, since we have to get to the other transvecting planes before getting from the last one to the opposite vertex. There are two such planes in 3D, with normal vector (1,1,1) and at distances 1/sqrt(3) and 2/sqrt(3) from the origin. Similarly in higher dimensions.

Now if you construct the following D dimensional solids: namely to take two D-simplices and to join them  on their D-hedron faces, then we are reduced to summing the squares of the multinomial coefficients. But otherwise:
the “coding” value of this problem lies in its extension to higher dimensions.

Comment 1.1: I am going to take a reasonable guess and claim that the answer in D-dimensions is (Dn)!/(n!)**D. Again, by using the gamma function (and extending it to Complexes), we can allow both n and D to take on complex values. I have no idea what a complex dimension looks like (though I suspect that John Conway and others do), but a fractional dimension, or at least a real valued fractal dimension, that I can visualize. The famous Sierpinski gasket, also http://www.jimloy.com/fractals/sierpins.htm has a fractal dimension (Box counting = Hausdorff) of 1.585. Similarly, if I construct the square equivalent (taking away 4 of the nine subsquares of a tictactoization of the square at every recursion), I get a lattice of fractal dimension ln(5)/ln(3) ~1.465 (5 copies, 1/3 linear dimension each). I can readily imagine that the constraints of having to pass through certain lattice points reduces the number of paths relative to the number in a square lattice.

Comment 2: What a beautiful introduction to Path Integrals! The function of paths we are “integrating” above is just 1. But you can readily imagine summing the length L of the path over all paths, or say the phase change over the path, exp(i L), or indeed any other function of paths. If you use L, you are doing geometric optics, if you use some particle action S, why, “welcome to quantum mechanics!”

I’d like to thank _ for suggesting this problem.