Dimitris Andreou suggested that I try the Boyer Moore string search algorithm (instead of String.search) to see how it would impact the fast grep described earlier. It turns out that while Boyer Moore searching takes much less time than String.contains, the bottleneck in the fast grep is not in the searching step. But still, considering that BoyerMoore is faster, I have incorporated it in the fast grep.
First, here is the source code of a class that performs BoyerMoore searching (initially copied from here):
package com.subhajit.util.grep; import java.util.ArrayList; import java.util.Arrays; import java.util.List; /** * String matching Boyer-Moore. Copied from * http://es.geocities.com/luisja80/rep/StringMatchingBM.java, followed by * refactoring. */ public class BoyerMoore { public static final int ALPHABET_SIZE = 256 ; private int [] bmBC; private int [] bmGs; public int comparisons; private final String pattern; private final char [] patternChars; private final int patternCharsLen; public BoyerMoore ( String pattern ) { super () ; this .pattern = pattern; patternChars = this .pattern.toCharArray () ; patternCharsLen = patternChars.length; // pattern length preBmBc ( patternChars ) ; preBmGs ( patternChars ) ; } public void preBmBc ( char [] x ) { int m = x.length; bmBC = new int [ ALPHABET_SIZE ] ; Arrays.fill ( bmBC, m ) ; for ( int i = 0 ; i < m - 1 ; i++ ) { bmBC [ x [ i ]] = m - i - 1 ; } } public int [] suffixes ( char [] x ) { int m = x.length; int suff [] = new int [ m ] ; int g = m - 1 ; int f = m - 1 ; suff [ m - 1 ] = m; for ( int i = m - 2 ; i >= 0 ; --i ) { if ( i > g && ( i + m - 1 - f ) < m && suff [ i + m - 1 - f ] < i - g ) { suff [ i ] = suff [ i + m - 1 - f ] ; } else { g = i; f = g; while ( g >= 0 && x [ g ] == x [ g + m - 1 - f ]) { --g; } suff [ i ] = f - g; } } return suff; } public void preBmGs ( char [] x ) { int m = x.length; bmGs = new int [ m ] ; int suff [] = suffixes ( x ) ; Arrays.fill ( bmGs, m ) ; int j = 0 ; for ( int i = m - 1 ; i >= - 1 ; --i ) { if ( i == - 1 || suff [ i ] == i + 1 ) { for ( ; j < m - 1 - i; ++j ) { if ( bmGs [ j ] == m ) { bmGs [ j ] = m - 1 - i; } } } } for ( int i = 0 ; i < m - 1 ; i++ ) { bmGs [ m - 1 - suff [ i ]] = m - 1 - i; } } public List<Integer> search ( String text ) { char [] textChars = text.toCharArray () ; int textLen = textChars.length; // string length List<Integer> resultado = new ArrayList<Integer> () ; int j = 0 ; int i = 0 ; comparisons = 0 ; while ( j <= textLen - patternCharsLen ) { for ( i = patternCharsLen - 1 ; i >= 0 && patternChars [ i ] == textChars [ i + j ] ; i-- ) { comparisons++; } if ( i < 0 ) { resultado.add ( j ) ; j += bmGs [ 0 ] ; } else { j += Math.max ( bmGs [ i ] , bmBC [ textChars [ i + j ]] - patternCharsLen + 1 + i ) ; } } return resultado; } public boolean isPresentIn ( String text ) { char [] textChars = text.toCharArray () ; int textLen = textChars.length; // string length int j = 0 ; int i = 0 ; comparisons = 0 ; while ( j <= textLen - patternCharsLen ) { for ( i = patternCharsLen - 1 ; i >= 0 && patternChars [ i ] == textChars [ i + j ] ; i-- ) { comparisons++; } if ( i < 0 ) { return true ; } else { j += Math.max ( bmGs [ i ] , bmBC [ textChars [ i + j ]] - patternCharsLen + 1 + i ) ; } } return false ; } } |
The original source code has been changed in the following ways:
1. The cost of initializing a "BoyerMoore" object has been moved to the constructor.
2. Several key data members have been declared final with a view to making the "match" method thread safe.
3. A new method "isPresentIn" has been added that returns whether the pattern being sought is present in the given text or not, without returning all the positions where it occurs (as does the "match" method).
4. Some variables have been renamed for easier understanding.
The Grep method has been changed to use "BoyerMoore.isPresentIn(...)" instead of String.contains(...):
// Setup the output queue containing FileContent objects // representing matches. The output queue is populated by the file // finders object. final BoyerMoore bm = new BoyerMoore ( text ) ; final BlockingQueue<FileContent> outputQueue = new LinkedBlockingQueue<FileContent> () ; final QueueProcessor<FileContent, FileContent> fileFinders = new QueueProcessor<FileContent, FileContent> ( contentQueue, outputQueue, 5 , new IProcessor<FileContent, FileContent> () { public FileContent process ( FileContent input ) throws InterruptedException, InvocationTargetException { if ( bm.isPresentIn ( input.getContent ()) ){ return input; } else { return null ; } } }) ; |
As I said, this makes a marginal improvement in the overall execution of grep on my setup. Your mileage might vary.