Following on from my bag of words post, folks suggested I look at doing n-gram word lists as well, and seeing their frequency distribution. I did that, but I'm not sure if the results are terribly interesting, but it wasn't much extra work, so that's okay. I suspect my sample size was too small ...

The Wikipedia page on N-grams might be more interesting perhaps ?

(the post that started all this is back in the archives a bit)

Comments:

Seems you hit the 32k row limit with the spreadsheet. How many n-grams of freq == 1 were there in total ?

Posted by smg on August 30, 2005 at 04:46 PM IST #

Works with OOo 2.0 Beta 2 :)

Total number of rows for
bi-grams: 50282
tri-grams: 64781
4-grams: 62190
5-grams: 55139

Posted by Simos on August 30, 2005 at 04:59 PM IST #

Yep, running a beta of StarOffice 8 here, and it looks grand too (not surprising since it's OOo2-based as well)

Posted by Tim Foster on August 30, 2005 at 05:02 PM IST #

Heh, seems running the latest Solaris doesn't get you everything :-) Interesting stat from the spreadsheet. Leaving out the 1, 2 and 3 range of occurances from each n-gram there's an a similar average of occurances for each n-gram: N-gram Average w/o 3,2,1 Bi 11.63 Tri 8.47 4-grams 8.7 5-grams 9.97 Though would need to see higher n-grams to be more clear. And since its some stats, how does this mean anything ?-P

Posted by smg on August 30, 2005 at 05:19 PM IST #

Lets try that again

Heh, seems running the latest Solaris doesn't get you everything :-)

Interesting stat from the spreadsheet. Leaving out the 1, 2 and 3 range of occurances from each n-gram there's an a similar average of occurances for each n-gram:

	Average w/o 3,2,1
Bi	11.63
Tri	8.47
4-grams	8.7
5-grams	9.97

hough would need to see higher n-grams to be more clear. And since its some stats, how does this mean anything ?-P

Posted by smg on August 30, 2005 at 05:21 PM IST #

Tim,

the trick is to filter out the stopwords to get rid of the garbage like "the" or "to" - just suppress such n-grams from the output, and you'll see more interesting results.

And the bigger the list, the nicer the results. See this awk script:

BEGIN { FS = "\t" }

{
if (
(tolower($1)!~/^(a|about|above|accordingly|after|again|against|ah|all|also\
 |although|always|am|an|and|and\/or|any|anymore|anyone|are\
 |as|at|away|b|be|been|begin|beginning|beginnings|begins|begone\
 |begun|being|below|between|but|by|c|ca|can|cannot|come|could\
 |d|did|do|doing|during|each|either|else|end|et|etc|even|ever|far\
 |ff|following|for|fro|from|further|get|go|go|goes|going|got|had|has\
 |have|he|her|hers|herself|him|himself|his|how|i|if|in|into|is|it|its\
 |itself|last|lastly|less|ll|many|may|me|might|more|must|my|myself\
 |nay|near|nearly|never|new|next|no|not|now|o|of|off|often|oh|on\
 |only|or|other|otherwise|our|ourselves|out|over|perhaps|put|puts\
 |quite|s|said|saw|say|see|seen|shall|she|should|since|so|some|such\
 |t|than|that|the|their|them|themselves|then|there|therefore|these\
 |they|this|those|thou|though|throughout|thus|to|too|toward|unless\
 |until|up|upon|us|ve|very|was|we|were|what|whatever|when|where\
 |which|while|who|whom|whomever|whose|why|with|within|without\
 |would|yes|your|yours|yourself|yourselves\
 |button|field|menu|option|net|date|today)$/  ))	print $1"\t"$2
}
I think that stopwords should be customized from the TM you're using, ie. just use the top entries from the word frequency list of the TM.

Unfortunately, I have no time experimenting with it myself. But I have used kfNgram program some time ago to experiment. I'm not sure if it's only for Win32 or in Java.

Best, Marcin

Posted by Marcin on September 02, 2005 at 12:44 PM IST #

right, one thing I forgot :( was that it was wordgrams, not only n-grams, that you need to test for... Or any other method of finding frequent 2- or 3- or n-word phrases.

Best, Marcin

Posted by Marcin on September 02, 2005 at 12:46 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by timf