theta_etc

trying to embed pdf

2016-08-25T23:49:00.003-07:00

Introductory text or abstract, then
to read more:
Here is the doc

so the steps are:
upload pdf to drive
change settings to public
copy the link
then in blogger, new post, compose, insert link and whatever text
then publish, it will appear on Google+.

Check: as a reader, click on link, it takes you to the pdf on owner's google+ or googleDocs whatever, click yet again on that and it leads to to full pdf.

If this is what Google does with simple pdf publication, how can we trust them to design a self-driveling car?

Video CPM as a Function of Demand - Complete

2015-06-07T00:00:00.001-07:00

Video CPM as a function of Demand

Experiment description and Analysis

13 biggest client tactics representing about 9% of total video revenue were duplicated with all prior restrictions (geo, site, device etc.), additionally restricted to Tier I-III inventory and budget-density tested at 1X, 2X, 6X and 12X daily budget on 25, 12, 4 and 2 buckets respectively.

For each of these budget points we have the spend, cost and impressions in the TierI-III inventory.

From other sources (Roni's presentation) we know that about 15% of all our impressions (50% of all avails and 20% of cost) are in Tier I-III,

so for all the video tactics which weren't restricted to TierI-III as part of this test, we uniformly apportioned 20% of spend and cost (assuming margin was same) and 15% impressions in the tested user buckets to TierI-III inventory.

The above spend cost and impressions were added to those for the tested tactics,

and CPM vs daily cost was plotted and fitted to a power law.

Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law.

dailySpend = AnnualSpend/365

dailyCost = (1-margin)*dailySpend

actualDailyCost = (spendFraction + (1-spendFraction)costFraction) dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost
Revenue CPM = CostCPM/(1-margin)

ClickMePushYou

2015-06-03T17:07:00.000-07:00

The Puzzle

If you click on me...

I will disappear.

Try clicking me away!

No, dooon't!

How did you do?

Your score = 6 - (number of clicks + page_refreshes)

Player size restricted Video CPM as a function of Demand

2015-06-02T13:19:00.002-07:00

Player size restricted Video CPM as a function of Demand

Experiment description and Analysis

Biggest client tactics representing about 9% of total video revenue were duplicated with all prior restrictions (geo, site, device etc.), additionally restricted to Player size > 400 X 300 and budget-density tested at 1X, 2X, 6X and 12X daily budget on 25, 12, 4 and 2 buckets respectively. For each of these budget points we have the spend, cost and impressions in the player size inventory.

We are clearly having trouble delivering at scale for the higher budget density - we are delivering only about 35%. For example, to hit the actual budget density points, every tactic in a line item (targeting different numbers of user buckets) should have had the same cost, yet on 5/19/15 the cost in "2 buckets" is only 129 instead of 379. Hence, while the goal was to explore a range 12X the current total video revenue, in practice we've only been able to get to about 4X.

From querying our database

we know that about 35% of all our impressions and 35% of video cost are for player size > 400X300, so for all the video tactics which weren't restricted by player size as part of this test, we uniformly apportioned 35% of spend and cost (assuming margin was same) and 35% impressions in the tested user buckets to player size restricted inventory.

The above spend cost and impressions were added to those for the tested tactics,

and CPM vs daily cost was plotted and fitted to a power law.

Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law.

The Calculator

dailyCost = (1-margin)*dailySpend

actualDailyCost = (spendFraction + (1-spendFraction)costFraction) dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost
Revenue CPM = CostCPM/(1-margin)

Annual Video spend to Cost CPM calculator

2015-04-27T17:12:00.000-07:00

Video CPM as a function of Demand

Experiment description and Analysis

13 biggest client tactics representing about 9% of total video revenue were duplicated with all prior restrictions (geo, site, device etc.), additionally restricted to Tier I-III inventory and budget-density tested at 1X, 2X, 6X and 12X daily budget on 25, 12, 4 and 2 buckets respectively. For each of these budget points we have the spend, cost and impressions in the TierI-III inventory.

From other sources (Atul's presentation) we know that about 15% of all our impressions (50% of all avails) are in Tier I-III, so for all the video tactics which weren't restricted to TierI-III as part of this test, we uniformly apportioned 20% of spend and cost (assuming margin was same) and 15% impressions in the tested user buckets to TierI-III inventory.

The above spend cost and impressions were added to those for the tested tactics,

and CPM vs daily cost was plotted and fitted to a power law.

Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law.

How do I put in formulas? dailySpend = AnnualSpend/365

dailyCost = (1-margin)*dailySpend

actualDailyCost = (spendFraction + (1-spendFraction)costFraction) dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost
Revenue CPM = CostCPM/(1-margin)

NTA7:Dimensional Sparseness Or What Tokens To Use?

2012-03-30T12:52:00.000-07:00

NTA7: Dimensional Sparseness Or What Tokens To Use?

Guide to all Numerical Text Analysis posts

Word histogram and Zipf plot of corpus for identifying stop words

After

lowercasing,
depunctuating,
removing single atom words,
removing single occurrence tokens,
stemming and removing a few stop words (whether the 125 I have chosen or the ~ 450 in standard lists,

we are left with about 14000 tokens from a bit less than 100 documents.

Here is the Zipf plot:

Before going on to the Cosine Similarity business, let's think about Word-vector space. Each token represents a dimension, the frequency of occurrence is the coordinate. Each document represents a point in this space. So what we have is a 14,000 dimensional space for a 100 points. That's ridiculous! But let's consider we have a million documents, that is still only 70 points per dimension!

Now of course this is going to vary from case to case, e.g. the categorization problem I worked on had 25,000 points in 50,000 dimensions. Hardly a situation where standard clustering algorithms – most of which are designed for denser graphs (at least dimensionally speaking) can be expected to apply. My solution to this is on SlideShare, and I'll blob on it at length later.

My point, as was my point for selecting stop words, is that some [statistical or numerical] and purpose-based criteria have to be used to decide which set of tokens to use. If you look at the Zipf plot above, in my view there are clearly three families of tokens:

the high ranking ones that approximate a Zipf line with a coefficient marginally less than 1
the mid ranking tokens at the knee of the plot which fall off the Zipf line
the low ranking tokens whose power law deviates very strongly from 1

My choices for the cutoffs are somewhat arbitrary, but unimportant for now. Summarizing,

What do these mean? Consider the extremely rare ones that occur only once in the corpus, and hence in only one document. So they are extremely powerful for distinguishing one document from another and a reasonably small set of these plus a few that occur a few times (There may be a few documents which contain no singly occurring tokens.) will be sufficient to label every document uniquely. However, precisely because of their rarity, we can't use them for finding any similarities between texts. So these low ranking tokens lead to a very fine-graining of the set of documents. Rejecting these will have the advantage of reducing the dimensionality of word space by an order of magnitude.

What about the high-ranking ones? They suffer from the converse problem, they occur so often in so many texts that they are likely to not be able to distinguish well between texts at all, which is sort of the motivation for throwing out the stop words in the first place. (Though my suspicion is that stop words are, or should be, chosen based on the smallness of the standard deviation of their relative frequencies.) So short of calculating these frequency standard deviations for the tokens, which I hope to get to soon, we should reject these stochastic tokens, the ones that closely follow Zipf's law (or Mandelbrot's modification). Rejecting these will not reduce the dimensionality of word-vector space by much, but it will reduce the radial distance of the document-points from the origin, the de facto center for the Cosine Similarity.

So we can use only 1500 mid-ranking tokens (that is “only” 15 dimensions per point, which is better than 144 dimensions per point) for the next steps.

Last point: before calculating Cosine Similarities, a standard step is to scale all the token - frequencies by the log of the document frequency for the token. This is a flat metric on wordspace. This will have the effect of bringing down the weight at the high-ranking end, and since log(f) increases slower than f, not scaling down as much the low ranking end. So the overall weights of the tokens may show precisely the bump in the mid-range that we want. So another point for comparison.

So many choices, so little time.

NTA6: Constructing Histograms For A Corpus

2012-03-29T09:20:00.000-07:00

Constructing Histograms For A Corpus : Numerical Text Analysis 6

Guide to all Numerical Text Analysis Posts

For the corpus, or the set of documents I am going to analyse, I've chosen the top 100 books over the last 30 days (as of 3/27/2012) from Project Gutenberg. (Incidentally, Vatsayana's Kamasutra is a perennial number 3, and this accords with Amartya Sen's footnote (pg. 25, The Contemplative Indian): In fairness to Western expertise on India, it must be conceded that there has never been any lack of Occidental interest in what may be called the 'carnal sciences', led by the irrepressible Kamasutra and Anangaranga.) I looked at the source code for the page, copied it, and did regular expression searches for the serial numbers of the books, then wrote Python code to suck them from the website, keeping my fingers crossed that they wouldn't think I = Robot and block my IP address. Some of the .txt files turned out to be empty. I chopped off the beginning (Project Gutenberg info.) and the last bit ( license info.) for each document.

As a reminder, the steps for processing text are:

Lowercase, depunctuate and stem the words in the text of each document.
Construct a word histogram for each document, for all the tokens. For example, for eBook3600:

Zipf plot for eBook3600
Construct the histogram for the corpus: for each token, add the occurrence frequencies in all the documents in the corpus.
Identify the rare tokens – those that occur a total of only one time in the entire corpus. Why remove any low-frequency tokens at all? There are a large number of non standard-English letters (possibly some of the documents are in languages other than English or there may be a few words from other languages or simple mitsakes) and non-ASCII and non-UTF-8 punctuation marks (for example Sanskrit diacritical notation from the book ranked number 3 above) that my algorithm doesn't pick up.

I thought that restricting the allowed atoms to lowercase unaccented Latin letters would rule out too many accented letters, and played with removing small frequency words. It turns out that even for a 100 documents, ruling out the unipresent tokens is enough since the remaining bipresent tokens seem quite reasonable.
Identify the small tokens, that is, the single character ones except “a” and “i”, and put both the small and the rare tokens in a list of raw words to be removed. The few small tokens can be removed from all the documents in the corpus, but since the rarity of a token can change as documents are added to the corpus the rare tokens should be left in place, even though the number of rare tokens will increase almost linearly with the number of words in the corpus whereas the number of higher frequency words will increase much slower.
From which you can see
- The number of unipresent tokens is about half of all tokens, but represent marginally few words.
- There are a surprisingly large number (132) of unidentified atoms, which on average occur multiple times.
- If you think that English is a gender-neutral language, think again.

Zipf plot of the entire corpus,

which should be used to determine the minimum rank to exclude “stop” words (deviations from flatness at the high frequency end. Let's take a look at the high frequency end in more detail:

from which I can see choosing a stop rank anywhere between 50 and a 150, but I don't see at all why the stop rank should be between 450 and 500 words, which is what standard stop-word lists consist of. Ignorance => opportunity to learn. However, for now I am going to go with the list of the top-ranked 103 as the stopword list, which reverentially saves "god" "himself", which come in at #s124th and 125th.
Add the set of stop words to the set of words to be removed. The complement of this subset in the set of all tokens in the corpus is the set of tokens to be used for the statistical calculations. The truncations of the histograms to this “use-set” of words are what we will use for studying Cosine Similarity, its alternatives, and what should really be done with the histograms.

So in the next post, I'll lay out my arguments comparing the ubiquitous Cosine Similarity to some simple trignometric alternatives, with some simple illustrative examples. The post after that will consider document similarity within the above corpus using both Cosine and my proposed alternative.

Next what I'll propose is to go beyond this approach to similarity, to really situate ourselves well in word-vector space and see what really good measures we can construct there, and of course, following all that theory, apply those ideas to the documents in the corpus. Yum-yum!

Guide to Posts on text Analysis

2012-03-27T00:47:00.000-07:00

The posts make most logical sense in the following order:

NTA0: Sample Text for Analysis

NTA1: Preliminaries

NTA2: Text Analysis Illustrative Example

NTA3: Selecting Stop Tokens

NTA4: Word Histograms

NTA5: WordPlay

NTA6: Word Histogram for the Corpus

NTA7: What tokens to use, dimensional Sparseness

NTA5: WordPlay

2012-03-26T13:05:00.002-07:00

Numerical Text Analysis 5

Guide to All Text Analysis Posts

Prev: Constructing Word Histograms

This is not serious stuff, so indulge me here. I am just playing with a cute idea which probably has zipf practical applications.
Read this after reading about the construction of word histogram space.

Since I'll be analysing text, and writing text to do so, let me distinguish text to be analysed by putting it on a gray background. For example, the result of the action of the lower-casing operator on some text can be shown as

low:

ThiS is tHe TeXt fOR makiNG LowER case

this is the text for making lower case

Before proceeding I just want to point out that it is not really the text I am interested in as in the histogram of the text. So

LOWER case

and

case LOWER

are the same as vectors or word-histograms, as are:

text text text

3 *

text

Eigen Wha?: Just to refresh our memories, a linear operator on a vector space acts on any vector and gives you back another vector in the space. If the result of the action of the operator on a specific vector is to return that same vector multiplied by a scalar, that vector is called an eigenvector of the operator and the corresponding scalar is an eigenvalue of the operator. The determinant of the operator is the product of its eigenvalues. The eigenvalues are solutions to det(oper. - lambda* Identity Op.) = 0.

An eigenvector of the low operator with eigenvalue 1 is

upper

, since

low:

upper

= 1*

upper

Let's just look at the low operator in a bit more detail. If I restrict my word space to only 'a', both lower and upper case, then it is two dimensional. Let me represent text in this space by the ordered pair of the number of occourences of 'a' and 'A', (x, y). So the text

a A A a a

is represented by (3,2). Now the action of the lowering operator is low(x,y) = (x+y,0), and the eigenvector with eigenvalue 1 is (1,0). However, if you calculate the determinant of low, it turns out to be 0!, so we know that it has an eigenvector with 0 eigenvalue. In the above representation, this is (1,-1). But what does '-1' occurences mean? We know that a word can't occur -1 times in a text!

Or, do we?

My word but it can

So what are the rules? Well there is only one:

word

word word

So the 0-eigenvalue eigenvector of low is the following text!

a A

When we read text like this, we are only sensitive to the magnitude.

What we've accomplished is that we have extended word histogram space (linear space but only in the positive 2ⁿ-ant) to be a genuine, full-fledged, authentic VECTOR space!

This raises the possibility –which, since I am a self-acknowledged ignoramus in this field, I am sure has already been explored– of using vector space and operator analysis to look at text. However, I doubt that there is anything new there beyond what is already known from standard statistical text analysis.

In any case, I hope this was as good for you as it was for me.

Word Histograms for the Corpus

Next: Cosine Theta, Murdabad!

NTA3: StopAndRareWordsToBeOrNotToBeZipf

2012-03-26T01:52:00.002-07:00

Exactly, that is the question!

Numerical Text Analysis 3

Guide to All Text Analysis Posts

Prev: Text Analysis with Illustrative Example

Let's stop and think about stop words. What is the reason to remove them before doing numerical text analysis? Surely the reason is not that they account for 50% of all the words in a text (they do) and that hence removing them will result in data savings. These are words like 'the', 'and', 'to' and 'of' that occur with very high frequency in the corpus of all texts and whose relative frequencies across texts are approximately constant. As such stop words do not serve to distinguish one text from another. (I was about to say that they do not have any semantic content, but I'll carefully ... step ... around ... that ... mine.) So we can now choose to drop some number of the most frequent words. Note that this decision should absolutely be corpus-based! and not based on some canonical list of stop words that some grad. student came up with. For example, the most frequent words in the corpus(Wm. Shakespeare's works), corpus(Charles A. Dodgson's works), corpus(LinkedIn profiles) or corpus(Match.com profiles) will likely be different.

So if there isn't a canonical set (nor even number) of stop words, how should one choose the cut-off? For this let's consider another, more numerical reason to drop stop words.

In Numerical Text Analysis, there is a law that captures the general distribution of word frequencies in texts. For any text or corpus, rank all the word tokens by decreasing order of frequency, so 'the' is ranked 1 (relative frequency is 0.05) and something like 'zitterbewegung' –with relative frequency of 10^-9 even within physics texts– is ranked 250,000. Zipf's Law essentially states that the product of the rank and the frequency is a constant function of rank. Mandelbrot –who got his start in science doing numerical text analysis until he tried to analyse Jorge Luis Borges' works and switched to chaos theory– made a modification to this law that the exponent for frequency as a function of rank is less than -1, and took into account deviations near the ends. In any case, we then expect a plot of log₁₀(frequency * Rank) vs log₁₀(Rank) to approximate a straight line. So let's look at a sample Zipf plot for Alice:

Alice in Wonderland original text

Note that the way I am plotting the data is slightly different from Zipf's. Typically (meaning every instance I've seen so far) log(frequency) is plotted vs. log(rank), and one looks for a slope of -1. What that implies is that Frequency * Rank = constant, hence if I plot log(Frequency * Rank), data that satisfies Zipf's law should be flatlined, and deviations from Zipf's law will be zoomed up.

"What's with the discontinuities?". The text has only 50K words and, after stemming, 2,000 tokens. There are a very large number of words (~700) that occur only once in the text. As we go up this (unsorted) subset, the frequency remains the same but the numerical rank keeps reducing, until we get to the top end and enter the subset of words which occur twice: at this point the rank only changes by '1', but the frequency doubles, adding the 0.3 on the log10 scale. Note that this discontinuity is almost half the y-axis range! For a large corpus the number of genuine words that occur only a few times will reduce and their rank will be pushed to larger numbers on a wider y-axis range, so even though the first discontinuity will still be log10(2), the plot will appear smoother.

We clearly see the deviations from the expected power law behaviour at low ranks, i.e. for frequently occouring words. Roughly speaking from the 100th token onwards we see that a fit is an almost 0-slope line, whereas higher ranked tokens deviate from that behaviour. If we wanted to be aggressive about stemming, we would drop words ranked lower than 359. If we wanted to be less aggressive, we would perhaps choose a cut off rank of 99. See the application of this idea to a corpus.

For the full corpus of all texts, we should also expect deviations from linearity at the low frequency end of the plot, and this should be used to find a cut off for the rare words.

Conclusion for stop and rare words

For the corpus of texts under consideration:

Do a Cummings and stem the words in the texts
count the frequencies of occurrence in the corpus of all the word tokens, and rank the word tokens by order of frequency.
Plot log(frequency * rank) vs. log(rank) of the word tokens – the Zipf Plot for the corpus.
Look for deviations from the expected straight line behaviour and choose a cut-off rank at both ends.
The set of stop words are the word tokens with higher rank (more frequent) than the corrsponding cut-off and the rare words are those with lower rank (less frequent).
Remove these sets of stop and rare words from all texts under consideration.
The resulting truncated histograms are the word-histograms for the texts that one should then proceed to use for further numerical text analysis.

Next: Word Histograms

NTA2: NumericalTextAnalysisOrHamletQuestion

2012-03-26T01:49:00.001-07:00

Numerical Text Analysis 2
Guide to all Text Analysis posts
Prev: Preliminaries
How the first line of Hamlet's soliloquy at the beginning of Act 3 Scene 1 of the eponymous play becomes {question: 1}.

Consider a text, either a web document, an essay, a personal profile or a book, e.g. Alice in Wonderland. It is a sequence of words (woRds, wodrs, wd.s, 'words?' and 9) separated by whitespace. Before doing any numerical work, we have to standardize the text using the text operations we described in NumericalTextAnalysisPreliminaries.

Here is what we know or what we should follow:

Don't create more word tokens.

Lower, since it commutes with all other operators and can be applied to the whole text, should be first.

Stem needs to come after DePunctuation, and since it operates on words, after Parse. Stop[rank] should be used after all the rest of the cleaning-up, and since the rank is to be determined based on word frequencies, should be after Parse and Histogram. Histogram should come after Stem. Chop[n] is not needed. I don't know what to do about abbreviations and mis-spellings, so I'll ignore those for now.

Standardization

DePunctuate (All punctuation but '-' deleted, '-' replaced by single space.) These first two steps can act on either text or lists.
Lower. The first two together are cummings.
Parse (on whitespace)
Stem (Porter2 from stemming package in Python by Matt Chaput)
Histogram Can act on either text or word list.
Stop[rank] Look at the Zipf plots of the histogram for the whole corpus to determine the rank of the word token below which words are to be dropped.
DeRarify

First let me show you the Zipf plots for the text as we sequentially process it. Recall from the post on To Zipf or not to Zipf that it is a plot of the log of the product of the frequency and rank for any token vs log(rank). We expect it to be linear, indicating a negative power law dependence of the frequency on the rank.

All processing is done using Python code, which also creates data files and calls R for plotting. When I get my hands on real data, the R program can also fit the data to Zipf's law or Mandelbrot's law, which will help with identifying the deviations from expected linear behaviour and the ranks at which we establish the 'stop' and 'rare' cut-offs. (To see more details of the figures and tables, click on them.)

Zipf Plot of Alice, original text

From right to left, the different segments correspond to tokens which appear only once, then those that appear twice, thrice etc. The most frequent word 'the' sits at the bottom left corner. With a large corpus, we expect the curve to be smoothed out, especially in the middle.

Zipf Plot of Alice, text with punctuation removed

Zipf Plot of Alice, lowercased text

Zipf Plot of Alice, histogram of stemmed tokens

Zipf Plot of Alice, histogram with 100 stop words removed
These 100 'stop' words are just the 100 most frequent words in the text itself. As I have explained in the Zipf post, the cut off should be based on the histogram of the entire corpus.

Again, this is for illustration purposes only.

Zipf Plot of Alice, histogram with 20 rare words removed

Now, let's take a look at the 20 top and 20 bottom ranked tokens in Alice, and see how this list changes with processing.

Top 20 Tokens

Bottom 20 Tokens

Summarizing,

The above steps suffice for spell-checked and edited works, the following steps need to be implemented for user provided text.

Expand any recognizable abbr to its most likely form, so there is no remaining ambiguity. (I am not sure about this step, but presumably it involves looking up some standard dictionary.) From the remaining sequences of letters, identify the non-words and check if there are any likely words in the dictionary that could have been misspelled as the non-word. For example, consider the non-word 'mnager'. Let's say that 'manager' is likely to be misspelled as 'mnager' (dropped letter 'a') 5% of the time (where the probabilities over all result words – spellings and misspellings – of the source word 'manager' sum to 1), whereas 'manger' is likely to be misspelled as 'mnager' ('a' and 'n' transposed dyslexically) 10% of the time. These are not the probabilities to consider and thence assign 'mnager' to 'manger'. Since 'manager' is a much more frequently used word than is 'manger', 'mnager' is much more likely to be a misspelling of 'manager'. So we want to look at the probabilities that 'mnager' results as a misspelling of a source word ('manager' or 'manger'), where the sum over all source words equals 1, and then select the most likely source. Hence 'mnager' would be assigned to 'manager', unless one were to have additional information, e.g. that the text is biblical in nature.
Toss out 'fzdurq' and other non-words., so that all remaining sequences of letters are words.

These last two complicated steps can be substituted by the step of simply removing the letter sequences which are least frequent in the corpus, which are either terms from highly specialized jargon, mis-spellings or abreviations.

So you see how “To be, or not to be, that is the question:” is rendered “question”. The corpus of the text remains; ready for some numerical, but not literary, analysis.

Next: Selecting Stop Tokens

NTA1: NumericalTextAnalysisPreliminaries

2012-03-26T01:26:00.002-07:00

Numerical Text Analysis 1
Guide to all Text Analysis Posts

Prev: Sample Text

Definitions and Text Operations:

Text atoms: Uppercase letters (L), lowercase letters (u), punctuation marks (?), numerals (4).

Accent Marks: None in English, just consider accented letters as independent letters?

Whitespace: space, tab, newline, end-of-line, end-of-file.

Word: A sequence of text atoms; in text, preceded and succeeded by whitespace.

Text: A sequence of text atoms and whitespace. Equivalently, a sequence of words separated by whitespace. Text space is a linear space.

Corpus: A collection of texts.

Word list: a list of words, order usually does not matter. It is generated either fro a text or from another list.

Dictionary word: A word found in a dictionary or in one of a set of dictionaries.

Parse: (By default on whitespace, but a parse map can be defined for any sequence of words.) This is an operation that acts on text and generates a list of words, so parse is not in fact a text operation. It maps a point from one space into another.

Various operations can be carried out on text (or on lists, since each element of a list is text), the ones relevant for our purposes are:

Lower: Acts on any piece of text, converts all uppercase letters to the corresponding lowercase ones.

lower(upper's and(?) LOWER) = upper's and(?) lower

DePunctuate: Almost all punctuation in normal English text is juxtaposed with whitespace (WJPMs). Punctuation marks which are normally juxtaposed with whitespace can either be deleted or replaced with whitespace, in either case one will be left with just whitespace. The two exceptions are the hyphen '-' and the apostrophe “'”. '-' occours in hyphenated words like 'super-talented', which are composed of independent words. Hence the hyphen should be replaced by a whitespace. The apostrophe occours in possessives and contractions, e.g. “dog's”, “she'd” and “isn't”. In order to not create single letter words and peculiarities like “isn”, the apostrophe should be deleted. Note that stemming leaves unchanged all three of 'isn', 'isnt' and 't', and hence can't be used to make a choice. With either choice for eliminating the WJPMs, we will have one exception, either the hyphen or the apostrophe. For simplicity, let us delete the WJPMs without replacement, delete the apostrophe without replacement as well, and replace the hyphen with whitespace.

DePunctuate(“The dog's bone was super-tasty! Wasn't it?” she asked.) = The dogs bone was super tasty Wasnt it she asked

Chop[n]: Eliminates words of length n or less, with possible exceptions like 'i' and 'a' for n = 1.

Chop[2](To be or not to be, that is the question.) = not be, that the question.

I thought of this as a short cut to stop. With the choice of depunctuate above that punctuation marks (except for '-', which is replaced by whitespace) are simply deleted, there shouldn't be single letter words other than 'i' and 'a'. Having rethought stop, chop became superfluous.

Stem: Acts on a word and returns its stem. Conjugations of verbs, plurals, adjectives, adverbs and other modifiers often have the same stem. Note that the result of stem on a regular English word is not necessarily a dictionary word – it is only the sequence of letters identified as the stem. There are many stemming algorithms. I've used Porter2 from the Python implementation by Matt Chaput.

I had misunderstood the purpose of stem, and when I saw words like veget and readabl I wrote ifstem so it would only replace a word by its stem if the stem was a dictionary word.

Stop[rank]: Eliminates words whose tokens rank higher (are more frequent) than rank. See

StopWordsToBeOrNotToBeZipf Post to appear in the next day or two.

Derarify[rank from bottom]: Eliminates words whose tokens are very low-ranking by frequency. See

StopWordsToBeOrNotToBeZipf

SpellCheck: Not thought about this yet. Eliminating the least frequent tokens –DeRarifying– should take care of some mis-spellings.

ImProper: Identify and eliminate proper nouns. While any one given text may have a few names that occur with high frequency (for example in Alice in Wonderland the name Alice is the 10^th ranked word), over the whole corpus most proper nouns should be relative rarities and can be eliminated by DeRarifying the text. The most common proper nouns can perhaps be listed and eliminated separately.

Histogram: Again, this is not a text operation, in fact it acts on a wordlist and returns a list of tuples or a dictionary (list of key:value pairs) whose first elements or keys are the members of the set constructed from the elements of the list and the second elements or values are the number of times they occur in the list. The space of word histograms is linear, has a natural origin and is non-negative, it is a whole-number valued “vector space”. So lets call these things “word vectors”. For any given text, then, one can construct its word vector.

The above operations don't commute with each other, hence the order in which they are carried out will affect the result.

Note for further thought: the above operations are linear operators on a linear space, and they do have eigenvalues and eigenvectors. But the operators are not all symmetric, and some of their eigenvectors lie outside text space. E.g. pure lower case text are eigenvectors of lower with eigenvalue 1. A word vector consisting of lowercase text low and equal and opposite frequencies of occurrence of text whose lowercase is low (e.g. lOw, LOW) is an eigenvector with eigenvalue 0. (See my future post on how to extend text space to make some of these operators symmetric.)

Next: Text Analysis with Example

NTA0: SampleTextForAnalysis

2012-03-26T00:25:00.001-07:00

Numerical Text Analysis 0
Guide to all Text Analysis posts

Try and guess the text I am going to use as an illustrative example. At a certain stage of text-processing, it has somewhat fewer than 2,000 word-tokens and a total length of a bit less than 27,000 words. I rank these 2,000 word tokens by order of frequency of appearance in the text. Here are some of the salient tokens and their ranks in increasing order: scan the tokens one at a time, and stop when you guess the text they are from. Give your points corresponding to the rank of the word you stopped at and report back to me via a comment.

Here is more information about the words:

The word 'head' appears only 4/5 of the times that the word 'queen' does. Assuming that 'head' appears half the time in neutral circumstances, the Queen is only yelling “Off with his head!” 2/5 of the times she makes an appearance.

The full text can be found here.

Next: Preliminaries

NTA4: WordHistogramsOrHamlet{question:1}

2012-03-22T02:24:00.000-07:00

Numerical Text Analysis 4

Guide to All text Analysis posts

Prev: Selecting Stop Tokens

WordHistogramsOrHamlet{question:1}

Consider the space of word-vectors:

If there are D different word tokens in the entire collection of texts under consideration, then word-vector space is D-dimensional.

It is a linear space, but it is not exactly a vector space: Though there is an origin or 0-vector (null-texts), word-vectors do not have additive inverses that I can interpret. Hence there is no subtraction defined on the space of word-vectors. Hence word-vector space is the D-dimensional positive 1/2^D-ant.

2. Two texts are equivalent if they have the same word histogram, i.e. the same distribution of the frequencies of occurrence of the stemmed, non-stop word-tokens. Word-vectors are then equivalence classes of texts.

For example, of the following 3 texts, Text A and B are equivalent to each other, but not to Text C.

Text A: This “teXt” is equivalent to the following text, Not the preceding one?

Text B: The preceding text IS equivalent to “this one”, but not to the following texts.

Text C: “This text” is not the same as the preceding one, and has no following text.

After removing punctuation, lowercasing, stemming and removing stop words, their histograms are A: {text: 2, preced: 1, follow: 1, equival: 1},

B: {text: 2, preced: 1, follow: 1, equival: 1} and

C: {text: 2, preced: 1, follow: 1, equival: 0}.

Two word-vectors can be added to each other, and they can be scaled by positive integers. If we define the '+' of two texts to be the text formed by writing one text followed by the other, then

vec(Text D) + vec(Text E) = vec(Text D '+' Text E)
You can effectively divide a word-vector V by a positive integer q (and hence scale by positive rationals) by leaving V alone and scaling all the rest by q.

If we normalize all word-vectors to unit magnitude (i.e. as vectors, so the sum of the squares of the frequencies is 1), then the resulting space is the positive 1/2^D-1-sphere and texts are now points on this hemisphere.

However, we could also normalize the word-histograms, so the sum of the frequencies themselves – not the sum of their squares – is 1. In this case the resulting space is the D-1 simplex, which is tangent to the sphere at (1, 1, 1, …). Note that the space of probabilities for the occurrence of D mutually exclusive events is a D-simplex, and that word-histograms have a natural probabilistic interpretation but not an obvious interpretation as vectors. So in most cases we will prefer to live on the simplex and not the sphere.

What does text-space look like? I want you to imagine yourself at the origin, at the point corresponding to the null-text. Small texts with few words are close by and large texts are further away. All 'normal' texts are in the positive 2^-D-ant (π/2 radians in 2D and π steradians in 3D etc.). Similar texts are along the same line of sight and very dissimilar texts will be at large angles to each other, up to a maximum of 90 degrees. Whether we've normalized the texts as vectors (on the unit sphere) or as histograms (on the unit simplex), the angles between them as viewed from the origin are the same.

The word histogram for the first line of Hamlet's soliloquy at the beginning of Act 3 Scene 1 of the eponymous play is {question: 1}.

Still fairly simple undergrad or even high school CS skills (based on the really smart interns at LinkedIn during the summer of 2011) and perhaps a graduate numerical text analysis course.

Next: An aside Word Play
or see the word histogram for the corpus

Factorial calculator

2012-01-23T16:14:00.001-08:00

Factorial Test - Javascript

Factorial - JavaScript

Little test bits as I build up to implementing the computational and analytical algorithms in JavaScript. This is the first dynamic webpage I've ever written!

Please enter a natural number less than 171. But feel free to try a larger number, and don't trust answers for non-natural numbers, it is NOT the Gamma function extension to reals or complexes.

test

2012-01-20T10:04:00.000-08:00

Mana Vita Will - Javascript

Rate of FemINism amongst LinkedIn members?

2011-09-09T22:21:00.000-07:00

Have you recently joined LinkedIn or edited your profile lately? Below the lines for your first and last names, there is a line for "Former/Maiden Name:".
The evidence

The obvious question is, "How and when did it come to be there?".

The question The interesting meta-question, Ash, ish "Why has this so very 19th century concept persisted well into the 21st century, and what does it tell us about the LinkedIn membership?"

The answer Here's (Thank you Megrah!) my answer: It is because the rate of femINism amongst LinkedIN members is very low, in fact about two per million or one in 500,000 in the best case scenario.

My assumptions:
1) Had they seen "Former/Maiden Name" any feminist worth their (Sorry, Megrah!) salt would object to that language ...
2) ... and have taken action --the difference between feminism and namby-pamby "humanism" is action-- for example by bringing it to the attention of LinkedIn...
3) ...and LinkedIn would have removed the sexist language.
(As of 9 September 2011, that line was still there on the "edit profile" page.)
4) Every former or current LinkedIn member has seen that line at least once, during registration.

Worst case analysis: There are currently upwards of 120 million LinkedIn members. Only one "took action" - walking a 100' and chatting. So in the worst case analysis, the rate of femINism is 1 in 120 million.

But wait, LinkedIn hasn't always had a 120 million members! And, how long did it take that one person (myself) to act?

That calls for a more sophisticated analysis, taking into account the duration of exposure of the sexist line to members and the time it took me from when I could have first noticed it till I acted.

Let's do the math, Barbie! Take LinkedIn's membership numbers: 4.5K in June 2003 increasing to 120M in August 2011. Assume, 5) exponential growth and calculate the time-constant (in base-10 it is about 1/2 per year). Then simply integrate the membership over time and you get about 1million man-years! Ohh ... OK ... people-years.

I've been a member of LinkedIn for about 2 years, and it took me that long to notice and take (minor) action - so that is 2 people-years in the numerator. And there you have it:

Rate of FemINism amongst LinkedIn membership is at most 1 in 500,000!

One possible quibble and why it beautifully doesn't matter because of exponential growth: What if that line was only introduced later, not at the very beginning but say when membership was already 10 times as large as at the beginning? Wouldn't that mean that the rate of feminism is actually much better, 2 per 100,000 (one 10th of million) , which is really 1 out 50,000? That doesn't look too bad! Is the analysis so sensitive to assumptions about initial conditions?

Well no. It takes LinkedIn only 2 years to increase its membership by a factor of 10. Over those early 2 years with exponentially less members, the loss in member exposure turns out to be only 135,000 people-years. So even with that ameliorative assumption, we would get the rate of femINism to be 2 per (1million - 135,000) , or about 1 per 430,000 as opposed to 1 per 500,000. Does that really make you feel better?

Full Disclosure: I am an intern at LinkedIn (and proud of it) and a femINist (and proud of it).

Cosine Similarity Murdabad!

2011-09-01T00:42:00.000-07:00

COSINE SIMILARITY MURDABAD!

On the use of the “Cosine Similarity” as a quantifier of the relatedness of elements of a linear space.

The Cosine Similarity is one of the quantifiers used in numerical text analysis for judging the similarity between texts, as follows:

Consider a text, either a web document, an essay or a book. Perform an e e cummings on it, which is jargon for the following set of operations: remove all punctuation and other non-letter keyboard characters, lowercase everything and then parse it on whitespace. What you are left with is a collection of words. Now remove various small and frequent “stop” words like “a”, “the”, “is” etc., which account for about a third of Shakespeare’s vocabulary. Now count the number of occurrences of the distinct words (or “word-tokens”).

If the word tokens are ordered, the list of the corresponding number of occurrences in a text is the “word histogram” of the text. The space of word histograms is linear, has a natural origin and is non-negative, it is a whole-number valued vector space. So lets call these things “word vectors”.

Any given text is represented by a point in this Word vector Space. In numerical text analysis (after TFLDF if one so chooses) one wants a way to be able to quantify the relatedness or similarity between two texts – each represented as a vector, and the simplest thing one can do with two vectors is … scalar product! Which, as one knows, is the product of the magnitudes of the two vectors and the Cosine of the angle between them. Now as far as “similarity” is concerned, the magnitudes of the word vectors don’t matter (Half of “Romeo and Juliet” is similar to the whole text.).

So we have the numerical “Cosine Similarity”: it is close to 1 for vectors in almost the same direction and 0 for perpendicular vectors. Since the space is non-negative, there are no pairs of anti-parallel vectors with Cosine Similarity = -1.

Cosine similarity as a cardinal number: You can’t add it, you can’t subtract it, in fact you can’t do any of the numerical operations on it. It isn’t a distance function, nor are its additive or multiplicative inverses. From the point of view of trying to do any geometry on this space, it is quite useless.

If what one is really trying to quantify is the angle between the two vectors, Cosine is a very poor substitute. Since its slope at 0^o is zero, it discriminates very poorly between close neighbours, which is precisely the region of interest. Other absurdities are manifest: vector 60 degrees apart from each other have a Cosine Similarity of 50%, vectors 30 degrees (a third of the way to perpendicular to each other) have a Cosine Similarity of 86%! Bit vectors with only half their bits in common have a Cosine Similarity of 70%, but they also have a Sine Dissimilarity of 70%!

Now it turns out that ArcSine of the square-root of (1 minus the square of the Similarity) is a good metric distance. OK, so this is just the angle that Cosine is the Cosine of. So Cosine has possible value for calculating the angle between the vectors. Except for that pesky vanishing slope, which causes an amplification of errors at small angles. More on this when we consider the ordinal virtues of Cosine.

So one should use the angle, it is the metric distance on projective word vector space and one can use it to do all kinds of geometrical stuff – cluster density is an example.

However, to calculate it there are better, less error-prone trigonometric means: whereas Cosine(theta) = a . b for unit vectors,

Sine(theta/2) = |a – b|/2 is monotonic increasing in theta and strictly monotonic except at theta =180, which is in some sense the least interesting angle.

What about that theorem in Manning and Schütze[1], which proves that Cosine leads to the same ordering as does the angle? Yes, it is mathematically true, and obvious, I might add, since Cosine is monotonic. However, it is not true computationally: an error of 0.025 in the Cosine spells the difference between 2 degrees and 12 degrees, which could completely invert the ordering!

In conclusion: Cosine Similarity is no good as a cardinal quantifier, no good as an ordinal quantifier, and horrendous for calculating theta.

As far as what I propose to do, you’ll just have to wait for the movie.

Cosine Similarity Murdabad![2]

[1] An excellent introductory text on “Foundations of Statistical Natural Language Processing”.

[2] Roughly translates as “Die! Die! Die!”

Proof of Conjecture for paths in hypercube lattices

2011-08-13T13:16:00.000-07:00

Problem:

hypercubical lattice in D dimensions
start from O = (0,0...)
allowed moves are only of the form (0,0,+, 0,0, ...)
End point of path is (N1, N2,...ND)

What is the number of paths?

Answer:
The multinomial coefficient
(N1+N2 ... ND)C(N1)(N2)...(ND) = (N1+N2...+ND)! / ( N1! N2! ...ND!)

Proof:
Let the whole number valued coordinate axes be n_i, i = 1...D.
The end point lies on the hyperplane \sum(n_i) = \sum(N_i).
One has to choose \sum(N_i) times to get to that hyperplane, from between D choices each time (the choice of direction to move in).
To get to the point (N_1, N_2..., N_D), exactly N_i choices have to have been in the i- direction.
Hence the number of paths is the number of ways of choosing the above partition from its sum.

In 3D, this is the number of ways of choosing 3 Easts, 5 Norths and 2 Ups from 10 choices to reach the point 3E, 5N and 2U.

Hence the multinomial coefficient. QED

As a little extra: lets say we were only interested in paths between diagonally opposite points of the D-hypercube, say (0,0,...) and (N,N,...).

Then the number of paths is (D*N)!/(N!)**D, which was my conjecture two days ago!!!

Now, the Analytical answer can be produced with 2 lines more of code: one to read the dimension from the command line, the other to read the coordinates of the end point.

The codeCounting program would need a bit more, a couple of for loops to write out the D-dimensional python code (if needed) and then to call it.

But the amount of time it is going to take is going to be ... HELP?

Last point for today: The factorial function can be extended to complex plane except negative integers by the Gamma function.

So now I can legitimately answer the question: How many paths are there to a say integer diagonal point in a fractional dimension?

Answer = Gamma(N*D +1)/(N!)**D.

_ : Who cares about your perspective? What practical value can it possibly have?

Here goes: Play fast and loose with "fractional" = "fractal". Now we actually have something we can visualize: paths on a square Sierpinski gasket e.g.

And the practical use? Conjecture: The structure of the web or networks on the web is fractal. There is fractal behaviour even in Statistical Natural Language Processing, Benoit Mandelbrot did important work on it in the 50's.

Lets say that the LinkedIN network is fractal. Maybe the above could be of some use in counting paths, for a better calculation of the clustering coefficient of an egonet.

From the perspective of SNLP, simple extensions of the above counting processes lead to the distributions of sizes of stochastically generated words, e.g with truncation on choice of ' '.

Ranjeet

simplifying the boundary conditions

2011-08-13T12:35:00.000-07:00

Last night and this morning when I woke up I was thinking of the extension to higher dimensions. One thought was that the number of elif conditions with special case sums for the points on the boundary would grow as 2**D -1, all the "lower" edges, faces, vertices etc. Even if I wrote python code to

take in the dimension as an argument and
have it write python code for the dimension required (which I'll do in a minute)

this would be quite tedious.
Then I realised that the recursive self-calls stop when they encounter a value, so I can just set boundary conditions for "parent nodes" on the enveloping lower hyper-planes (of which there are only D) to be 0.

Here is the somewhat more "biutiful" code (I rented it, but ended up watching Terminator 3 instead):

#!/usr/bin/python
""" 'codeCountPaths1.py Natural [Natural]'
    Single argument -> (n,n)
    code to count the number of paths
    Assumption 1: starting at (0,0)
    Assumption 2: ending at user input (n,m)
    moving in (+,+) direction only, on a complete Euclidean square lattice.
    CHANGE from codeCountPaths.py:
    A branch of the recursion will stop when it encounters a value, so
    BCs can be accounted for by putting in lines of
    0s at n= -1 and m = -1 as opposed to
    restricting the sum for boundary points to just those parent nodes
    which lie within the domain. There is no need for the 0-valued
    parent nodes to go back to the n+m = 0 line (or soon2B hyperplane)
"""
import sys, math
def NumPaths(i,j,n,m):
if i==0 and j==0:
    numpaths = 1 # BC at origin
elif ((i==-1) and j in range(1,m+1)) or ((j==-1) and i in range(1,n+1)):
    numpaths = 0 # BCs at enveloping lower lines
else:
    numpaths = NumPaths(i-1,j,n,m) + NumPaths(i,j-1,n,m)
return numpaths
def main():
# reads command line arguments for the location of the ending point
if len(sys.argv) < 2:
    print "usage: $ python codeCountPaths.py Natural [Natural]"
elif len(sys.argv) == 2:
    n = int(sys.argv[1])
    m = int(sys.argv[1])
elif len(sys.argv) ==3:
    n = int(sys.argv[1])
    m = int(sys.argv[2])

# scolds user for trying to trick poor little machine
if (n < 0) or (m < 0):
    print "BoundError: n >= 0 and m >= 0. Tricks later, for now positive intege\
r only please"
# asks for answer, and sits back for an awfully long time
ans = NumPaths(n,m,n,m)
print "Code counted number of paths from (0,0) to (",n,', ',m,") on a complet\
e Euclidean square lattice moving only in the (+,+) direction = ", ans
if __name__ == '__main__':
main()

Question: I have a recursion relation for the number of time-steps needed, but can't solve it analytically, neither can I correctly code to just count each time a call is made.

Can one of you CS gods help put better bounds on "awfully long time"?

Code for A BEAUTIFUL BEAUTIFUL PROBLEM

2011-08-13T03:04:00.000-07:00

Beautiful code!!!

OK: I can't believe
0) yesterday's code was so Uuugly!!
1) I completely missed all sorts of beautiful things and the joy of having coded this problem (succesfully, meaning without crashing my computer or bringing down the whole LinkedIn site even though I had been careful enough to close all connections before hitting 'return')
2) Yesss! it was also correct!

Yesterday's code:
#!/usr/bin/python                                                              \
""" 'diagPaths.py' """
import sys, math
def diagN(n):
if n < 1:
    print "tricks later, for now positive integer only please"
else:
    return math.factorial(2*(n-1))/math.factorial(n-1)**2
def main():
if len(sys.argv) < 3:
    #
    print "usage: $ python diagPaths.py --n natural number"
elif sys.argv[1] == '--n':
    n = int(sys.argv[2])
    ans = diagN(n)
    print "Number of paths between diagonal vertices for ",n, "-square lattice \
= ", ans
if __name__ == '__main__':
main()

3) If you want to exploit the beauty and power of recursion, code backwards! The number of paths to get to (i,j) is = numpaths(i-1,j) + numaths(i,j-1). Then just ask for numpaths(whatever you want) and let the code call itself ...

4) ... Modulo the bounds: while thinking about the bounds I realised that the upper bounds don't constrain anything. Which in turn means ...

5) The number of paths to the point (i,j) from (0,0) is just (i+j)C(j). Why? It is the i+j th diagonal, and the point is the jth from one end of it, so it is just the Binomial coefficient given above.

6) Which further means that we have an alternative proof of the sum of the squares of the binomial coefficients: yesterday I explained that the numpaths to get to (n,n) is the sum of the squares of the numpaths to get to points on the transvecting diagonal, which is the sum of the squares of the Binomial coefficients. Today, I've explained that the numpaths to get to (i,j) is (i+j)C(j). Hence to get to (n,n) the numpaths is (2n)C(n) = (2n)!/(n!)^2 !!!! - That is exciting, not factorials.

Analytical answer:

#!/usr/bin/python
""" 'AnalyticalCountPaths.py' 1 argument -> end point (n,n)
     2 arguments -> (n,m)
     assume complete square Euclidean lattice
     starting point = (0,0)
"""
import sys, math
def main():
if len(sys.argv) < 2:
    print "usage: $ python AnalyticalCountPaths.py Natural [Natural]"
elif len(sys.argv) == 2:
    n = int(sys.argv[1])
    m = int(sys.argv[1])
elif len(sys.argv) ==3:
    n = int(sys.argv[1])
    m = int(sys.argv[2])
if (n < 0) or (m < 0):
    print "EEp! Bloink! HAL No Compute, go fall into pole of Riemann Zeta. BoundError"
    print "For now positive integers only please"
else:
    ans = math.factorial(n+m)/(math.factorial(n)*math.factorial(m))
print "Analytical solution: Number of paths from (0,0) to (",n,', ',m,") on a\
complete Euclidean square lattice moving only in the (+,+) direction = ", ans
print "\n Oh my but this problem is beautiful!! Alternate proof of sum of squ\
ares of Binomial coefficients!"
if __name__ == '__main__':
main()

so why do I need to write the code for actually counting this?
Because, I can't believe I am saying this, it is EVEN more beautiful!
AND FINALLY:
#!/usr/bin/python
""" 'codeCountPaths.py Natural [Natural]'
    Single argument -> (n,n)
    code to count the number of paths
    Assumption 1: starting at (0,0)
    Assumption 2: ending at user input (n,m)
    moving in (+,+) direction only, on a complete Euclidean square lattice.
"""
import sys, math
def NumPaths(n,m):
if (n < 0) or (m < 0):
    print "BoundError: n >= 0 and m >= 0. Tricks later, for now positive integer only please"
elif n==0 and m==0:
    numpaths = 1
elif n == 0:
    numpaths = NumPaths(n, m-1)
elif m == 0:
    numpaths = NumPaths(n-1,m)
else: # THE MEAT OF THE CODE
    numpaths = NumPaths(n-1,m) + NumPaths(n,m-1)
return numpaths
def main():
if len(sys.argv) < 2:
    print "usage: $ python codeCountPaths.py Natural [Natural]"
elif len(sys.argv) == 2:
    n = int(sys.argv[1])
    m = int(sys.argv[1])
elif len(sys.argv) ==3:
    n = int(sys.argv[1])
    m = int(sys.argv[2])
ans = NumPaths(n,m) # here is where I ask for the answer just once!
print "Code counted number of paths from (0,0) to (",n,', ',m,") on a complet\
e Euclidean square lattice moving only in the (+,+) direction = ", ans
print "\n Oh my but this problem is beautiful!!"
if __name__ == '__main__':
main()

I'll confess I was nervous when I ran this: I'd already calculated the answer in my head, and was praying to Pascal that my codecounting would be right!

Did I thank _ enough for this problem?
Tomorrow: we'll relax the Euclidean requirement, then the "completeness".

By The Way: I should remind myself that any second year Numerical Path Integral Physics Grad student is probably doing this in their sleep.

A BEAUTIFUL BEAUTIFUL PROBLEM

2011-08-11T21:25:00.000-07:00

& BEAUTIFUL CODE WITHOUT CODE

The problem, as I recall it being described to me, is to calculate the number of ways to get from one vertex of a square lattice (of size N points per side) to the diagonally opposite one, where along either axis each move is always in the same direction, say ++. It is or used to be a popular interview question to test the interviewee’s coding skills.

My zeroth thought … I’ll come back to that later.

My 1.1th thought was of course, an iterative approach – figure out a process to get from the k-1th diagonal to the kth: at each step of the iteration use the distribution of ways to get to the pth point on the k-1th diagonal to calculate the number of ways to get to the qth point on the kth, store this new distribution. Start from k=2 and repeat until k=N. OK, that gets us to the diagonal. Thought 1.1 was divide and conquer: get to the transvecting diagonal between the vertices, square the distribution and then add.

My second thought was to take a recursive approach for the first part (the distribution on the diagonal):

Def diag(k):

If k ==1:

Return [1]

Else:

Oldlist = diag(k-1)

Newlist = [ oldlist[i] + oldlist[i+1] for i in range (0,k) ]

Return newlist

I had forgotten how much more beautiful recursion is than iteration!

All you do is call diag(N), square the elements and sum!

I think the above goes as n**3.

Time to implement and try it:

$ python diagonalPaths.py --n natural number

OK, so for n =1,2,3,4,5 we get 1,2,6,20,70.

So how do we know it is right? Let’s say you’ve debugged the code, validated it for n=1,2,3. It agrees with what the interviewer has. The answer agrees with the consensus, still, how do we know that it is right?

Put another way, can you predict the answer for n = 6 or any n, before you run your code, or at least, in less than n**3 time?

I’ll claim that my code runs in n**1 time. How?

Well, I didn’t code the recursion. Which brings me to my 0^th thought, I can do it analytically, why code it?

One can readily see that the distribution on the diagonal is the binomial, since at every node you are making the choice between +/--, and the number of ‘+’ gets you the position along the diagonal. If we just wanted to get to the diagonal, the number of ways is 2**(n-1). So the answer we want is the sum of the squares of the binomial coefficients, which luckily has a closed form : (2N)!/(N!)**2. If ! goes as N-time, so does my code.

Comment 0: I can also ask the question: “to calculate the number of ways to get from one vertex of a square lattice (of size N points per side)”, where N is:

Non-positive

Rational

Real (for the computer this should be pseudo-real or float I think)

Complex

Matrix (real or complex valued)

Anything which forms a group under multiplication, or anything that forms a group under a group operation which satisfies the properties of multiplication, with the relaxation of the commutativity requirement.

Since n! = Gamma(n+1), any of the above can be calculated (convergence for matrices will be harder to prove).

I wouldn’t have the slightest idea how to code any of the above!

Comment 1: What about lattices of different dimensions? Now here you’ve got me, I can’t avoid coding. If we are only interested in the number of ways of getting to the nearest transvecting diagonal plane, the answer is simply D**(n-1). But it is no longer sufficient to square the multinomial coeff.s and add, since we have to get to the other transvecting planes before getting from the last one to the opposite vertex. There are two such planes in 3D, with normal vector (1,1,1) and at distances 1/sqrt(3) and 2/sqrt(3) from the origin. Similarly in higher dimensions.

Now if you construct the following D dimensional solids: namely to take two D-simplices and to join them on their D-hedron faces, then we are reduced to summing the squares of the multinomial coefficients. But otherwise:

the “coding” value of this problem lies in its extension to higher dimensions.

Comment 1.1: I am going to take a reasonable guess and claim that the answer in D-dimensions is (Dn)!/(n!)**D. Again, by using the gamma function (and extending it to Complexes), we can allow both n and D to take on complex values. I have no idea what a complex dimension looks like (though I suspect that John Conway and others do), but a fractional dimension, or at least a real valued fractal dimension, that I can visualize. The famous Sierpinski gasket, also http://www.jimloy.com/fractals/sierpins.htm has a fractal dimension (Box counting = Hausdorff) of 1.585. Similarly, if I construct the square equivalent (taking away 4 of the nine subsquares of a tictactoization of the square at every recursion), I get a lattice of fractal dimension ln(5)/ln(3) ~1.465 (5 copies, 1/3 linear dimension each). I can readily imagine that the constraints of having to pass through certain lattice points reduces the number of paths relative to the number in a square lattice.

Comment 2: What a beautiful introduction to Path Integrals! The function of paths we are “integrating” above is just 1. But you can readily imagine summing the length L of the path over all paths, or say the phase change over the path, exp(i L), or indeed any other function of paths. If you use L, you are doing geometric optics, if you use some particle action S, why, “welcome to quantum mechanics!”

I’d like to thank _ for suggesting this problem.

theta_etc

trying to embed pdf

Video CPM as a Function of Demand - Complete

Experiment description and Analysis

13 biggest client tactics representing about 9% of total video revenue were duplicated with all prior restrictions (geo, site, device etc.), additionally restricted to Tier I-III inventory and budget-density tested at 1X, 2X, 6X and 12X daily budget on 25, 12, 4 and 2 buckets respectively.

For each of these budget points we have the spend, cost and impressions in the TierI-III inventory.

From other sources (Roni's presentation) we know that about 15% of all our impressions (50% of all avails and 20% of cost) are in Tier I-III,

so for all the video tactics which weren't restricted to TierI-III as part of this test, we uniformly apportioned 20% of spend and cost (assuming margin was same) and 15% impressions in the tested user buckets to TierI-III inventory.

The above spend cost and impressions were added to those for the tested tactics,

and CPM vs daily cost was plotted and fitted to a power law.

Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law.

dailySpend = AnnualSpend/365

dailyCost = (1-margin)*dailySpend

actualDailyCost = (spendFraction + (1-spendFraction)*costFraction) * dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost Revenue CPM = CostCPM/(1-margin)

Effect of Geo, Site, Device and DealID restrictions

DealID restrictions were in effect for all 13 line_items, so effect cannot be measured.

ClickMePushYou

The Puzzle

No, dooon't!

How did you do?

Player size restricted Video CPM as a function of Demand

Experiment description and Analysis

From querying our database

The above spend cost and impressions were added to those for the tested tactics,

and CPM vs daily cost was plotted and fitted to a power law.

Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law.

The Calculator

dailyCost = (1-margin)*dailySpend

actualDailyCost = (spendFraction + (1-spendFraction)*costFraction) * dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost Revenue CPM = CostCPM/(1-margin)

Annual Video spend to Cost CPM calculator

Experiment description and Analysis

The above spend cost and impressions were added to those for the tested tactics,

and CPM vs daily cost was plotted and fitted to a power law.

Why a power law? Because cost (CPM) elasticity of demand (total cost) is the ratio of the logarithmic derivatives, which is just the exponent in a power law.

How do I put in formulas? dailySpend = AnnualSpend/365

dailyCost = (1-margin)*dailySpend

actualDailyCost = (spendFraction + (1-spendFraction)*costFraction) * dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost Revenue CPM = CostCPM/(1-margin)

Effect of Geo, Site, Device and DealID restrictions

DealID restrictions were in effect for all 13 line_items, so effect cannot be measured.

NTA7:Dimensional Sparseness Or What Tokens To Use?

NTA6: Constructing Histograms For A Corpus

Guide to Posts on text Analysis

NTA5: WordPlay

NTA3: StopAndRareWordsToBeOrNotToBeZipf

NTA2: NumericalTextAnalysisOrHamletQuestion

NTA1: NumericalTextAnalysisPreliminaries

NTA0: SampleTextForAnalysis

NTA4: WordHistogramsOrHamlet{question:1}

Factorial calculator

Factorial - JavaScript

Little test bits as I build up to implementing the computational and analytical algorithms in JavaScript. This is the first dynamic webpage I've ever written!

Please enter a natural number less than 171. But feel free to try a larger number, and don't trust answers for non-natural numbers, it is NOT the Gamma function extension to reals or complexes.

test

Mana Vita Will - Javascript

Rate of FemINism amongst LinkedIn members?

Cosine Similarity Murdabad!

Proof of Conjecture for paths in hypercube lattices

simplifying the boundary conditions

Code for A BEAUTIFUL BEAUTIFUL PROBLEM

A BEAUTIFUL BEAUTIFUL PROBLEM

& BEAUTIFUL CODE WITHOUT CODE

actualDailyCost = (spendFraction + (1-spendFraction)costFraction) dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost
Revenue CPM = CostCPM/(1-margin)

actualDailyCost = (spendFraction + (1-spendFraction)costFraction) dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost
Revenue CPM = CostCPM/(1-margin)

actualDailyCost = (spendFraction + (1-spendFraction)costFraction) dailyCost

Cost Based CPM from fit to elasticity curve evaluated at actualDailyCost
Revenue CPM = CostCPM/(1-margin)