In this post, I present two approaches to directory scanning, one serial and one parallel. Both approaches return a collection of all files and directories contained by a given directory, including the contents (recursively) of each sub-directory thereof. For example, scanning the following directory tree:

…/dir/

…/dir/picture0.jpg

…/dir/picture1.jpg

…/dir/vacation-pictures/

…dir/vacation-pictures/grand-canyon1.jpg

returns a collection containing four entries, viz. …/dir/picture0.jpg, …/dir/picture1.jpg, …/dir/vacation-pictures/ and …/dir/vacation-pictures/grand-canyon1.jpg.

Serial Scanning

In serial scanning, a recursive method lists all files in a given directory, adds these to a list, and, if the file being added is a directory, invokes itself: The SerialDirScanner class, shown below, performs this task:

public class SerialDirScanner {
    private final ConcurrentMap<File, Boolean> files = new ConcurrentHashMap<File, Boolean>();

    public SerialDirScanner() {
        super();
    }

    public Set<File> scan(File dir) throws InterruptedException,
            ExecutionException {
        scan0(dir);
        return files.keySet();
    }

    private void scan0(File dir) {
        for (final File file : dir.listFiles()) {
            files.putIfAbsent(file, Boolean.TRUE);
            if (file.isDirectory()) {
                scan0(file);
            }
        }
    } 
    public static Set<File> listAllContentsUnder(File dir)
            throws InterruptedException, ExecutionException {
        SerialDirScanner scanner = new SerialDirScanner();
        return scanner.scan(dir);
    }
}

The “scan” method invokes the recursive “scan0” method, which does the real work of scanning directories. That method fills up a concurrent map with files and directories as it works. Finally, the “scan” method returns all keys of the concurrent map.

Parallel scanning

Looking at the “scan0” method of the serial scanner shown above, many opportunities for parallelization become apparent. For example, the listing of different sibling directories (that is, directories that both have the same parent) can proceed in parallel, as can the listing of directories that are unrelated to each other (that is, one directory is not an ancestor of the other). The “DirScan” class shown below performs parallel scanning:

public class DirScanner {
    private final ConcurrentMap<File, Boolean> files = new ConcurrentHashMap<File, Boolean>();
    private final ExecutorService executor;
    private final Semaphore sem = new Semaphore(1);
    private final AtomicInteger count = new AtomicInteger();

    public DirScanner(int threads) {
        super();
        executor = Executors.newFixedThreadPool(threads);
    }

    public Set<File> scan(File dir) throws InterruptedException,
            ExecutionException {
        sem.acquire();
        scan0(dir);
        sem.acquire();
        return files.keySet();
    }

    private void scan0(File dir) {
        for (final File file : dir.listFiles()) {
            files.putIfAbsent(file, Boolean.TRUE);
            if (file.isDirectory()) {
                count.incrementAndGet();
                executor.submit(new Runnable() {
                    public void run() {
                        DirScanner.this.scan0(file);
                    }
                });
            }
        }
        if (count.decrementAndGet() < 0) {
            sem.release();
        }
    }
    public void close(){
        executor.shutdown();
    }
    public static Set<File> listAllContentsUnder(File dir, int threads)
            throws InterruptedException, ExecutionException {
        DirScanner scanner = new DirScanner(threads);
        return scanner.scan(dir);
    }
}

In common with the “SerialDirScanner” class, the “DirScanner” class has a public method named “scan” that takes a directory as an argument and returns a Set of File objects. The “scan” method acquires a semaphore of count 1, calls the “scan0” method, and waits to re-acquire the same semaphore. The “scan0” method performs the actual work of scanning directory contents. The “DirScanner” class contains an ExecutorService object named “executor” that is used to schedule scanning tasks.

Note that the “DirScanner” class accepts an integer constructor parameter named “threads”. This parameter is used to initialize the ExecutorService with a fixed thread pool of this size.

The “scan0” method submits a new task to the executor whenever it finds a sub-directory while performing a directory listing. The new “task”, in this case, happens to be a recursive invocation of itself. The “scan0” method increments a counter (an AtomicInteger instance object named “counter”) before the task is actually submitted. Each time the “scan0” method completes listing a directory, it decrements the counter, releasing the semaphore when the counter’s value falls below zero. This allows the “scan” method, which has been waiting to acquire this semaphore, to “wake up” and continue, effectively returning a Set of File objects from the key set of the concurrent map “files”.

Performance comparison and thread counts

To test the performance improvement, if any, of the parallelized directory scanner vis-a-vis the serial scanner, I wrote the following “main” method:

public static void main(String[] args) {
    int status = 0;
    try {
        Set<File> files0 = null;
        Set<File> files1 = null;
        long t0 = System.nanoTime();
        files0 = SerialDirScanner
                .listAllContentsUnder(new File(args[0]));
        t0 = System.nanoTime() - t0;
        System.out.println(t0 + " ticks.");
        t0 = System.nanoTime();
        files1 = DirScanner.listAllContentsUnder(new File(args[0]), 5);
        t0 = System.nanoTime() - t0;
        System.out.println(t0 + " ticks.");
        System.out.println(files0.size() + "," + files1.size());
        System.out.println(files0.size() == files1.size());
    } catch (Throwable exc) {
        status = 1;
        exc.printStackTrace();
    } finally {
        System.exit(status);
    }
}

The argument I pass in to the main method is a directory that is well populated and quite deep. Running this on my development laptop (with the specs Core 2 Duo 2.26 MHz, 4GB, running Windows 7 Evaluation build No. 7100), I get:

1. Initially, before the directory structure is cached by the operating system:

3666120093 ticks.
2375654123 ticks.
39592,39592
true

2. Subsequently, after the directory structure has been cached:

3551936784 ticks.
2340802880 ticks.
39592,39592
true

So, there is a performance gain of roughly 33% using the parallel scanner using five threads (note that the “listAllContentsUnder” method accepts a thread count as the second argument). Using three threads, I get:

3594733052 ticks.
2335042793 ticks.
39592,39592
true

which is not much different from the readings obtained from using five threads. Using nine threads, I get:

3563738170 ticks.
2368012499 ticks.
39592,39592
true

which shows that the performance has actually been reduced by increasing the number of threads. The explanation for this curious observation is that the concurrent map, into which all threads write data, faces increased contention as the number of threads goes up.

Another interesting observation is that the parallel scanner is actually slower at scanning shallow directories. Consider the following output while scanning an example:

2602454 ticks.
10136304 ticks.
17,17
true

Conclusion

Using the parallel directory scanner speeds up the scanning of deep (and well populated) directories by roughly 33%. Shallow directories are better scanned with the serial directory scanner.

Comments:

Can't you make each thread create a map? Then at the very end all of these maps are combined. This would reduce the contention of one single map being targeted. Just thinking out loud; I'm sure this is a bad idea.

Posted by Josef on May 12, 2009 at 09:10 AM CDT #

Oops, just saw your comment. I was working along the same lines that you suggested (one list per thread), and am just about to post the revised code with performance information...just so you know beforehand, there is not much movement in the performance data I have been able to capture, though :-( Anyway, thanks for your input.

Posted by Subhajit Dasgupta on May 12, 2009 at 11:58 AM CDT #

You must check for symbolic links (otherwise you will recurse into the symbolic link). You can use the following snippet:

public static boolean isLink(File file) {

try {
if (!file.exists()) {
return true;
}
else {
String cnnpath = file.getCanonicalPath();
String abspath = file.getAbsolutePath();

return !abspath.equals(cnnpath);
}
}
catch(IOException ex) {
return true;
}
}

Posted by reha gur on May 12, 2009 at 12:40 PM CDT #

Nice post, I think I'll take a crack at implementing your Parallel style into my "FileGrabber".
http://gregoire.org/2009/02/14/file-grabber/

Posted by Mondain on May 12, 2009 at 01:42 PM CDT #

Your problem with 9 threads most probably is *not* the contention ConcurrentHashMap.

It's mostly a problem of unfair work partition. The last, the shallow dir example in the bottom, is the highlight of that: only one thread of the pool get to do all the work (like the serial case), all the rest are idling.

You could use a ThreadFactory and create special threads with a counter in them, which you could increment in the task execution, and count how many files each thread gets to scan, you could make a nice plot out of that. Because measuring the time of a "parallel" execution without saying how much parallelism you actually have isn't too meaningful, each different directory layout gives different parallelism with this approach.

Posted by Dimitris Andreou on May 13, 2009 at 03:27 AM CDT #

Dimitris,

You are correct about unfair work partition, since each thread gets to list *all* of the files in each directory - "fuller" directories result in the tread doing more work than "sparser" ones. But that is the nature of the beast, since you do not know (or are not supposed to know) about the directory you are scanning at the commencement of the scan.

By the way, please see my updated post (which follows immediately afterwards) in which I have removed all contention (by means of a custom thread factory that endows each thread with its private list of files).

Thanks,

Subhajit

Posted by Subhajit Dasgupta on May 13, 2009 at 05:08 AM CDT #

(First of all, sorry about all the typos of the previous message, that was written very fast). Now lets conteniu! :)

So, did reducing contention make it faster? For which amount of threads? (And how many processors you have available?)

I also thought before posting that "damn...how *would* you partition this thing?", but didn't mention anything because I don't see any simple solution (only the problem statement was simple :)). But then, if you don't do any real processing per file, it would be slower to send it to some other thread instead of handling it directly. So if this is the case, you are right that this is the nature of the problem. (It would be different though with some non-trivial processing, e.g. searching for a string in a file).

Posted by Dimitris Andreou on May 13, 2009 at 05:23 AM CDT #

Dimitris,

On thinking about your "fairness" comment, I have thought of one way we can partition the problem and thereby load each thread almost equally. I have the basic mechanism ready, and will post by later today once I have had time to gather some statistics. As a quick note, I have made the partitioned the files in a directory listing (the value returned by dir.listFiles() ) between threads, where each thread works on an fixed "installment" of listing entries....

The preliminary results are marginally better, but I will post the complete solution as soon as I have had time to play with this more.

More later, and thanks very much for the time you have spent over this.

Subhajit

Posted by Subhajit Dasgupta on May 13, 2009 at 10:45 AM CDT #

Oh, right, that was easy after all, since you get an array of Files, not merely an iterator. There should still be a lot of tuning parameters (how big the indivisible chunks are? Finer should be better, but then it can increase contention in the task queue...it should have an optimal value somewhere).

"Marginally" better makes perfect sense. From there, any increase on the per-file processing should increase the relative performance gain.
--
That's what I like about concurrent programming. You first have to solve the problem hundreds of times, so the hundredth-plus-one time you can solve it optimally :)

Thanks Subhajit for an interesting discussion!

Posted by Dimitris Andreou on May 13, 2009 at 11:17 AM CDT #

Dimitris,

Please see my follow on post for the "FairScanner" class at http://blogs.sun.com/adventures/entry/fast_directory_scanning_and_fairness incorporating these ideas. I am still gathering data for the two tuning parameters (so far), namely thread count and "installment size".

Subhajit

Posted by Subhajit Dasgupta on May 13, 2009 at 02:16 PM CDT #

Just after reboot (and I believe on any cold caches scenario) on my machine (XP) parallel file scanning with 2 threads is 50 % slower, like following: 179138 files - 284890 ms conventional file scan, 405515 ms parallel file scan

Posted by Maxim Mossienko on May 14, 2009 at 07:18 AM CDT #

Maxim,

Thanks for trying this out. I did mention in the post that parallel scanning works best for "deep" and "well populated" directories. Sparsely populated directories, and those that are not very deep, do not benefit from parallel scanning. Again, thanks for trying this out. BTW, I wonder if you have had a chance to try out the "FairScanner" presented in a latter post http://blogs.sun.com/adventures/entry/fast_directory_scanning_and_fairness

Subhajit

Posted by Subhajit Dasgupta on May 14, 2009 at 07:58 AM CDT #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by Subhajit Dasgupta