An Effective way of Removing duplicates in a file
I remember just a few days back I was asked to design a program to remove duplicates from a file. Well the problem seems quite straight forward and simple to approach, but the real hidden question here lies in, how effective can anyone come up with an algorithm to solve this problem. Initially thinking on the tracks of using hashtables, maps, sets etc. I finally landed up designing a custom 'tree' implementation for getting almost the optimal solution which would remove duplicates from a file in a single iteration! (or in O(n) time, for my geek friends! )
Well while surfing internet, I later found that the tree concept I used to solve this problem is pretty much similar to a standard implementation called 'Suffix tree'. Here I am not going to explain you about suffix trees, you can find plenty of matter on this once you google it, rather, I would be explaining how my solution to this problem worked. Well, I don't claim it to be the most optimal implementation, I am sure, someone out there may come up with a solution better than mine. But this one solves the purpose just fine.
The logic here is to keep building the custom tree (which now we know is similar to suffix tree) as we go through the text. Here is the basic algorithm:
- Iterate through the text and do the following steps
- As the beginning of a new word is encountered, start matching it starting from the root node to its children, successively. along with buffering the current word as we proceed.
- if a match is found, try to match the next character in the word with its children
- If a match is not found, add another node representing the current character as a child to the current node.
- When the word ends, and the last character is matched, check its 'final' tag.
- if its true, it means the current word is the repeated one, so don't output the current buffer word
- if its false, it means the current word is encountered for the first time, so output the current buffer and flush the buffer.
Eg. let the text be "CAT CAN BAT". The tree for this sentence will be -
ROOT
/\
C B
/ \
A A
/ \ \
T* N* T*
Note: the (*) mark represents the final state.
Here goes my C/C++ implementation:
Here is the final code to download
Posted at 10:46AM Nov 08, 2009 by Anurag Sharma in Personal | Comments[1]
Nice ..... It can be used to compress the data when very large files are being considered.....I know a tool named UHARC which can compress files very efficiently......I don know the principle behind it but......Been trying to find out.....Many game rippers use it to RIP game files and convert it to smaller size....
I am sure that almost no data loss is there..........
Posted by Vinyas on November 15, 2009 at 08:21 PM IST #