The posts make most logical sense in the following order:
NTA0: Sample Text for Analysis
NTA1: Preliminaries
NTA2: Text Analysis Illustrative Example
NTA3: Selecting Stop Tokens
NTA4: Word Histograms
NTA5: WordPlay
NTA6: Word Histogram for the Corpus
NTA7: What tokens to use, dimensional Sparseness
Tate,
ReplyDeleteAll good stuff. Stemming is really cool - I wonder how Porter came up with it. I don't think it applies to other languages.
About 5-6 years back, in order to understand the search technologies better, I implemented a search engine. It's still active - you can try it out at Line Spout www.linespout.com. For obvious reasons of resources and time, it's a "limited" search engine, in that it looks only at "new" content from a set of socially curated sources (aka real time search).
Apart from text analysis, search has stuff like relevancy, accounting for social feedback, and a relevant advertisement auction, all of which I implemented in Line Spout.
In terms of text analysis alone, there are a few further "hard" problems:
(a) conflations introduced by stemming - eg., Federer and Federal stem to the same thing. So whenever I search for Federer on Line Spout, I get news about the Federal Govt. Resolving this needs a dictionary (probably socially curated). The dictionary can also come in handy for synonyms etc. for search expansion.
(b) phrase match - needs a lot more computation and storage.
(c) grep (which I have implemented - see pattern search under options) - can get really slow. Online Map-Reduce could help here (generally this is used on the offline indexing side).
I am reasonably convinced that "dumb" index search is probably just as effective as any more sophisticated approach because
(i) getting the best top ten search results does not have a life-or-death or even a financial accounting level of importance, and
(ii) users retry searches with different queries till they get reasonably what they want.
My suggestion to you would be to focus on your new non-cosine approach that can make matching the query to the content more robust, or any other new "angles" that improve user experience and/or reduce implementation complexity - emphasis on new and different.
-Akash
Hi Ranjeet! It's so cool that you're getting into this field. I do similar work but from a much more 'symbolic', as opposed to 'statistical', vantage point.
ReplyDeleteI just read the first post and it made me wonder if you've ever read anything on formal language theory. Here's a primer: http://www.helsinki.fi/esslli/courses/readers/K10.pdf
Also, since you mention Python, there is an old book called Text Processing in Python by David Mertz that you might find interesting. Mertz favors a *very* functional approach to text processing (his Python looks as much like Haskell/ML as anything I've seen). The whole book is online for free here: http://gnosis.cx/TPiP/
Also, you might find NLTK useful: http://www.nltk.org/
-Dave