More word bagging
Following on from my bag of words post, folks suggested I look at doing n-gram word lists as well, and seeing their frequency distribution. I did that, but I'm not sure if the results are terribly interesting, but it wasn't much extra work, so that's okay. I suspect my sample size was too small ...
The Wikipedia page on N-grams might be more interesting perhaps ?
(the post that started all this is back in the archives a bit)
Posted by smg on August 30, 2005 at 04:46 PM IST #
Total number of rows for
bi-grams: 50282
tri-grams: 64781
4-grams: 62190
5-grams: 55139
Posted by Simos on August 30, 2005 at 04:59 PM IST #
Posted by Tim Foster on August 30, 2005 at 05:02 PM IST #
Posted by smg on August 30, 2005 at 05:19 PM IST #
Heh, seems running the latest Solaris doesn't get you everything :-)
Interesting stat from the spreadsheet. Leaving out the 1, 2 and 3 range of occurances from each n-gram there's an a similar average of occurances for each n-gram:
hough would need to see higher n-grams to be more clear. And since its some stats, how does this mean anything ?-P
Posted by smg on August 30, 2005 at 05:21 PM IST #
the trick is to filter out the stopwords to get rid of the garbage like "the" or "to" - just suppress such n-grams from the output, and you'll see more interesting results.
And the bigger the list, the nicer the results. See this awk script:
BEGIN { FS = "\t" } { if ( (tolower($1)!~/^(a|about|above|accordingly|after|again|against|ah|all|also\ |although|always|am|an|and|and\/or|any|anymore|anyone|are\ |as|at|away|b|be|been|begin|beginning|beginnings|begins|begone\ |begun|being|below|between|but|by|c|ca|can|cannot|come|could\ |d|did|do|doing|during|each|either|else|end|et|etc|even|ever|far\ |ff|following|for|fro|from|further|get|go|go|goes|going|got|had|has\ |have|he|her|hers|herself|him|himself|his|how|i|if|in|into|is|it|its\ |itself|last|lastly|less|ll|many|may|me|might|more|must|my|myself\ |nay|near|nearly|never|new|next|no|not|now|o|of|off|often|oh|on\ |only|or|other|otherwise|our|ourselves|out|over|perhaps|put|puts\ |quite|s|said|saw|say|see|seen|shall|she|should|since|so|some|such\ |t|than|that|the|their|them|themselves|then|there|therefore|these\ |they|this|those|thou|though|throughout|thus|to|too|toward|unless\ |until|up|upon|us|ve|very|was|we|were|what|whatever|when|where\ |which|while|who|whom|whomever|whose|why|with|within|without\ |would|yes|your|yours|yourself|yourselves\ |button|field|menu|option|net|date|today)$/ )) print $1"\t"$2 }I think that stopwords should be customized from the TM you're using, ie. just use the top entries from the word frequency list of the TM.Unfortunately, I have no time experimenting with it myself. But I have used kfNgram program some time ago to experiment. I'm not sure if it's only for Win32 or in Java.
Best, Marcin
Posted by Marcin on September 02, 2005 at 12:44 PM IST #
Best, Marcin
Posted by Marcin on September 02, 2005 at 12:46 PM IST #