There are two text files, each are 10 million lines, the size of the text file at about 100M. Now need to know that the two documents there is cross-check the number of lines, in other words, we want to know the the number of lines simultaneously in the two documents exist. Each text file here is unique, so they do not have any duplicate rows. Python set could do this very easy and higher efficient than shell, awk.
#!/usr/bin/python
a = set(open(”data.uniq.1″))
b = set(open(”date.uniq.2″))
print len(a; b)
Here I find a blog in Chinese also description this tips
Comments:

I don't believe that Python is faster than AWK, and even if it were, which I *know* he isn't, I can always compile AWK code into a straight binary executable, so as long as Python doesn't get a state of the art compiler, he will *NEVER* be faster than AWK!

Posted by UX-admin on March 11, 2009 at 02:19 AM CST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by williamxue