Fast directory scanning and fairness
Dimitris Andreou commented on my previous posting on fast directory scanning by suggesting that the individual threads doing the work of listing directories were not evenly loaded. In other words, the solution presented did not represent an even partitioning of the work between the threads. The uneven distribution results from the files and directories each directory contains: the more this number, the more “work” the thread that processes the directory has to do.
To understand how to didvide the work between the threads, it is important to note that these threads do two things: obtain directory listings (via File.listFiles()) and examine each entry in the array of files returned by that method. The first method (listFiles()) is a library method which must be used as it is (which is another way to say that it cannot be subdivided into constituent parts at the application level). The other task performed by each thread, that is the examination of each directory entry, is the only candidate for partitioning.
One obvious way to perform this partitioning is to have each thread examine a fixed number of directory listing entries, instead of all of these. This fixed number, called an “installment”, is an input parameter to the directory scan. Accordingly, I wrote the following class to “fairly” load the threads. Note that the FairScanner class presented below still does not guarantee that each thread examines the same number of directory elements: all it does is increase the likelihood that this happens at runtime.
Without further ado, here is the code:
public class FairScanner {
private final Semaphore sem = new Semaphore(1);
private final AtomicInteger waitingCount = new AtomicInteger(0);
private final List<ThreadWithList> threads = new ArrayList<ThreadWithList>();
private final int installmentSize;
private static class DirInfo {
private final File dir;
private final File[] listing;
private int index;
public DirInfo(File dir) {
super();
this.dir = dir;
this.listing = dir.listFiles();
this.index = 0;
}
public int getIndex() {
return index;
}
public void setIndex(int index) {
this.index = index;
}
public File getDir() {
return dir;
}
public File[] getListing() {
return listing;
}
}
private final BlockingQueue<DirInfo> queue;
private final class ThreadWithList extends Thread {
private final List<File> files;
private int useCount = 0;
public ThreadWithList() {
super();
files = new ArrayList<File>();
}
public List<File> files() {
return this.files;
}
public void incrementUseCount() {
useCount++;
}
public int getUseCount() {
return useCount;
}
public void run() {
while (true) {
if (FairScanner.this.scan0()) {
break;
}
}
}
}
private FairScanner(int threads, int installmentSize) {
super();
this.installmentSize = installmentSize;
queue = new LinkedBlockingQueue<DirInfo>();
for (int i = 0; i < threads; i++) {
ThreadWithList t = new ThreadWithList();
this.threads.add(t);
}
}
public Collection<File> scan(final File dir) throws InterruptedException,
ExecutionException {
sem.acquire();
queue.add(new DirInfo(dir));
for (ThreadWithList t : threads) {
t.start();
}
sem.acquire();
List<File> ret = new ArrayList<File>();
for (ThreadWithList t : threads) {
ret.addAll(t.files);
}
return ret;
}
private boolean scan0() {
waitingCount.incrementAndGet();
// Remove the next item from the queue.
ThreadWithList thread = ((ThreadWithList) Thread.currentThread());
DirInfo dirInfo = null;
try {
if (waitingCount.get() == threads.size() && queue.isEmpty()) {
sem.release();
return true;
}
dirInfo = queue.take();
} catch (InterruptedException exc) {
Thread.currentThread().interrupt();
return true;
} finally {
waitingCount.decrementAndGet();
}
int index = dirInfo.getIndex();
File[] listing = dirInfo.getListing();
int upperBound = Math.min(index + installmentSize, listing.length);
for (int i = index; i < upperBound; i++) {
thread.files.add(listing[i]);
if (listing[i].isDirectory()) {
DirInfo subdirInfo = new DirInfo(listing[i]);
try {
queue.put(subdirInfo);
} catch (InterruptedException exc) {
Thread.currentThread().interrupt();
return true;
}
}
}
if (upperBound != listing.length) {
dirInfo.setIndex(upperBound);
queue.add(dirInfo);
}
thread.useCount += (upperBound - index);
return false;
}
public void close() {
for (ThreadWithList t : threads) {
if (t.isAlive()) {
t.interrupt();
}
}
for (ThreadWithList t : threads) {
try {
t.files.clear();
t.join();
System.out.print(t.getUseCount() + " ");
} catch (InterruptedException exc) {
Thread.currentThread().interrupt();
return;
}
}
System.out.println();
queue.clear();
}
private static final List<FairScanner> scanners = new ArrayList<FairScanner>();
public static Collection<File> listAllContentsUnder(File dir, int threads,
int installmentSize) throws InterruptedException,
ExecutionException {
FairScanner scanner = new FairScanner(threads, installmentSize);
scanners.add(scanner);
return scanner.scan(dir);
}
public static void flush() {
for (FairScanner scanner : scanners) {
scanner.close();
}
}
}
Note the following items about this class:
1. There is a bounded queue (named “queue”) that holds “DirInfo” objects. These objects contain information about a directory such as a reference to the directory, an array of files representing the directory listing, and an “index” into this array representing the next item to be examined.
2. There is a custom Thread (called “ThreadWithList”) that endlessly loops and calls “scan0”.
3. The “scan0” method (which is endlessly called by instances of ThreadWithList), takes a DirInfo object from the queue, examines the next so many elements of the list of files, adds entries to the list of files, and adds new DirInfo objects to the queue for those entries that happen to be directories.
4. The scan is deemed to be complete when all the threads are waiting to take an element from the queue, and the queue is empty.
5. There are two arguments that can be used to fine tune the behavior of the FairScanner: the “threads” argument specifies the number of threads to use, and the “installmentSize” argument specifies the number of directory listing elements each thread examines at a time.
I have changed the “main” method used to test these classes as follows:
public static void main(String[] args) {
int status = 0;
try {
Set<File> files0 = new HashSet<File>();
Collection<File> files1 = null;
for (int threads = 5; threads < 20; threads++) {
for (int installmentSize = 15; installmentSize < 20; installmentSize++) {
int iterations = 5;
long t0 = System.nanoTime();
for (int i = 0; i < iterations; i++) {
files0 = SerialDirScanner
.listAllContentsUnder(new File(args[0]));
}
t0 = System.nanoTime() - t0;
long ticks0 = (t0 / 10);
System.out.println((t0 / 10) + " ticks.");
t0 = 0;
for (int i = 0; i < iterations; i++) {
long t1 = System.nanoTime();
files1 = FairScanner.listAllContentsUnder(new File(
args[0]), threads, installmentSize);
t1 = System.nanoTime() - t1;
t0 += t1;
FairScanner.flush();
}
System.out.println("Threads = " + threads
+ ",InstallmentSize = " + installmentSize);
long ticks1 = (t0 / 10);
System.out.println((t0 / 10) + " ticks.");
System.out.println(files0.size() + "," + files1.size());
System.out.println(((ticks1 * 100) / ticks0) + " %");
System.out.println();
}
}
} catch (Throwable exc) {
status = 1;
exc.printStackTrace();
} finally {
System.exit(status);
}
}
Observations
Some general observations made so far:
1. Increasing the installment size to a size of fifteen to twenty results in the best performance.
2. A thread count of about ten seems to result in the best performance.
I will post detailed data once I collect it.
Finally, note that this is a work in progress, so please do suggest improvements, shortcomings, etc.. Thanks to Josef and Dimitris for suggesting improvements.
In my experiments (on XP) using Fair scanner with 2 threads / 20 installments produces nice results when files scanned are in os cache and 50% worse results otherwise (usually first time scanning run for the directory), nearly the same as described in my comment http://blogs.sun.com/adventures/entry/fast_directory_scanning#comment-1242303504000
Posted by Maxim Mossienko on May 14, 2009 at 10:01 AM CDT #
Hmm...so caching (or the lack thereof) is playing a part. My initial hypothesis is that the File.listFiles() method is responsible for this behavior, since it is the only method that uses JNI to access the file system. Let me dig deeper and see what I find. Thanks for pointing this out.
Posted by Subhajit Dasgupta on May 14, 2009 at 10:24 AM CDT #