Illustrating Recursive Methods by Flattening Directory Hierarchies in Java
Introduction
Recently, I was pair programming in Java with a less experienced partner when we came to a point where we needed to ensure that a collection of files was ‘flattened,’ containing no nested directories, in preparation for ingesting those files into a digital preservation system. I recognized it as a problem of recursion and was prepared to begin constructing a unit test for this while my partner expressed caution, and a belief that this was a difficult problem that would require much more planning to get right. It made me realize that the type of recursive processing I was envisioning could be difficult to initially wrap your mind around.
I am sure I once experienced this same difficulty. I suppose I quickly learned to be comfortable with this type of programming problem when I began writing a custom product in Zope. By default, Zope uses an object oriented database that is hierarchically organized by a filesystem metaphor, and I ended up walking a hierarchy of objects on a number of occasions, and I was perhaps lucky that I was working in Python; I could leverage Python’s built-in functional programming constructs, and even in more involved cases in which I had to write larger functions, the code was clear and concise. (And of course, when it comes to normal file processing, Python can walk file directories for you.)
File processing is an obvious example where you might naturally apply recursive processing. But basically any hierarchical, or nested data structure is a candidate, such as multi-dimensional arrays or XML documents. (In the case of XML documents, there are probably easier alternatives to recursion such as XPath queries or methods in a DOM library that can slice and dice documents for you.) Basically, if you need to perform an operation on an object that itself may contain similar objects that would require the same operation be performed on them, you may be able to use recursion to create an elegant, robust implementation of that operation.
Our Implementation
In our Java code, I was envisioning a method that would be passed an initial directory that needed to be flattened. As the method iterated over the contents, it would check for sub directories, and upon finding any, the method would
Here is a stripped-down version of our recursive method, which concentrates on the essence of the process:
public void flatten(File directory, File destination) {
for (File aFile: directory.listFiles()) {
if (aFile.isDirectory()) {
this.flatten(aFile, destination);
aFile.delete();
} else {
aFile.renameTo(new File(destination, aFile.getName()));
}
}
}
As you can see, this method is conceptually extremely simple. If it sees a file, it moves it to the root directory and if it sees a directory, it calls itself on that directory, and upon returning, deletes it. In this way, the recursive method takes a depth-first approach to processing the directory hierarchy, which is exactly what we want. That way, we can call the aFile.delete() on that directory immediately after each recursive method call returns and know that everything below that directory has been handled properly.
If this type of method is new to you, you may be wondering if we could just pass the initial directory to the flatten() method and not have the second parameter. This actually highlights an important point about recursive methods: often, you will have at least one variable or object whose purpose is to maintain state information across recursive calls for various reasons. For example, you may want to pass along a logging object that collects messages from each of the method calls. Or you may want to sequentially number each object you are processing, and so, you would pass a variable containing that value to each method call. Or, you may need to pass along information about the state of the initial method call to subsequent recursive method calls, and that is the case in this example, in which subsequent calls to flatten need to know the top most directory to which nested files should be copied.
Here is the final (well, current) version of our method with more fleshed out checks and error handling:
import java.io.File;
import java.io.IOException;
...
public void flatten(File directory, File destination) throws IOException {
if (directory == null || ! directory.exists()) {
throw new IOException("The source directory is null or does not exist.");
}
if (destination == null) {
destination = directory;
}
for (File aFile: directory.listFiles()) {
if (aFile.isDirectory()) {
this.flatten(aFile, destination);
aFile.delete();
} else {
// is the file already at the top level of the hierarchy?
if (! aFile.getParentFile().getAbsolutePath().equals(destination.getAbsolutePath())) {
// is a file by the same name already there?
File movedFile = new File(destination, aFile.getName());
if (movedFile.exists()) {
String filename = aFile.getName();
String checksum = String.valueOf(getChecksumValue(aFile));
if (aFile.getName().contains(".")) {
filename = aFile.getName().split("[.]")[0] + "_" + checksum + "." + aFile.getName().split("[.]")[1];
} else {
filename = filename += "_" + checksum;
}
movedFile = new File(destination, filename);
}
if (! aFile.renameTo(movedFile)) {
throw new IOException("Could not move file from: " + aFile.getAbsolutePath() + " to: " + destination.getAbsolutePath());
}
}
}
}
}
For convenience, we added code to allow you to send null as the second parameter, which provides a shortcut for calling this method initially, although the readability of that calling code will suffer a little. We also added a check to skip the file moving process if the file is already in the root directory, so we don’t waste our time on the first/topmost call to this method.
We also added a check for the existence of the source directory, in this case, just throwing an exception if it doesn’t exist, which made sense in our code. File operations tend to be fairly low-tech with many opportunities for triggering exceptions, so it pays to be careful, even though it adds significant code to an otherwise clear and concise algorithm. And as you can see, in Java, some file operations return a boolean value for success or failure instead of throwing an exception, so you need to be familiar with the File API to know what to check and how to return informative messages to the calling code. In our case, we decided to throw IOExceptions with customized messages to make the error handling of this method more consistent.
Of course, to accommodate our exception throwing, we updated the method declaration accordingly. Exceptions that occur within nested recursive calls to this method will propagate up the call stack to the original call to this method.
Finally, although it may be unlikely, we wanted to account for cases in which a file somewhere in the directory hierarchy may have the same name as a file somewhere else in the directory hierarchy, in which case, without further intervention, there would be a collision that might go unnoticed as one file overwrites another. We decided to check for the existence of a file in the root directory before moving it, and if found, to add that file’s checksum to its base file name before moving it. This is ultimately not a true guarantee that there won’t be this type of collision, but it was relatively easy to implement, and we decided it was ‘good enough.’ For completeness, here is the checksum method that is being called:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.util.zip.Checksum;
import java.util.zip.CRC32;
import java.util.zip.ZipFile;
import java.util.zip.ZipEntry;
...
public long getChecksumValue(File aFile) {
Checksum checksum = new CRC32();
try {
BufferedInputStream is = new BufferedInputStream(
new FileInputStream(aFile));
byte[] bytes = new byte[1024];
int len = 0;
while ((len = is.read(bytes)) >= 0) {
checksum.update(bytes, 0, len);
}
is.close();
}
catch (IOException e) {
e.printStackTrace();
}
return checksum.getValue();
}
Conclusion
Although the final code may look a little daunting, the fundamental approach to implementing a recursive method as shown in the first code sample is simple and elegant. In this case, what was added was exception handling code to make it more robust and easier to debug.
And although this example flattens a directory hierarchy in place, you should now see how to change the code to either flatten a directory in place or create a flattened version of a directory in another location, leaving the original directory intact. But that, as they say, is an exercise left to the reader.
By walking through the implementation as I have, I may have given the impression that the code grew only from careful planning and consideration, which is not the case. We took a test-driven, divide-and-conquer strategy for implementing this method. In the interests of providing a complete example, here is the current test code for this method:
private void setUpDirectoryHierarchy() throws Exception {
/* create the following directory structure
* testFlattenFileStructure
* |-testfile1
* |-testfile2
* |-testdirectory1
* | |-testfile3
* | |-testfile4
* |-testdirectory2
* | |-testfile5
* | |-testfile6
* | |-testsubdirectory3
* | | |-testfile7
* | | |-testfile8
*/
File testDirectory = getHierarchicalTestDirectory();
assertTrue(new File(testDirectory, "testfile1").createNewFile());
assertTrue(new File(testDirectory, "testfile2").createNewFile());
File testSubDirectory1 = new File(testDirectory, "testsubdirectory1");
assertTrue(testSubDirectory1.mkdirs());
assertTrue(new File(testSubDirectory1, "testfile3.txt").createNewFile());
assertTrue(new File(testSubDirectory1, "testfile4").createNewFile());
File testSubDirectory2 = new File(testDirectory, "testsubdirectory2");
assertTrue(testSubDirectory2.mkdirs());
assertTrue(new File(testSubDirectory2, "testfile5").createNewFile());
assertTrue(new File(testSubDirectory2, "testfile6").createNewFile());
File testSubDirectory3 = new File(testSubDirectory2, "testsubdirectory3");
assertTrue(testSubDirectory3.mkdirs());
assertTrue(new File(testSubDirectory3, "testfile7.txt").createNewFile());
assertTrue(new File(testSubDirectory3, "testfile8").createNewFile());
}
private File getHierarchicalTestDirectory() throws URISyntaxException {
String testPath = "resources" + File.separator + "archiveObjects"+ File.separator + "testFlattenFileStructure";
URL testPathURL = this.getClass().getResource(testPath);
File testDirectory = new File(testPathURL.toURI());
return testDirectory;
}
public void testFlattenDirectory() throws Exception {
//create directory hierarchy for testing
setUpDirectoryHierarchy();
// flatten file hierarchy in place
ingest.flatten(getHierarchicalTestDirectory(), null);
ArrayList<String> knownFileNames = new ArrayList<String>();
for (int x=1; x<9; x++) {
knownFileNames.add("testfile"+x);
}
int y = 0;
for (File aFile: getHierarchicalTestDirectory().listFiles()) {
// make sure there are no directories
assertFalse(aFile.isDirectory());
// make sure each file is one we expect to be there
boolean matchedFileName = false;
for (String knownName: knownFileNames) {
if (aFile.getName().startsWith(knownName.split("[.]")[0])) {
matchedFileName = true;
knownFileNames.remove(knownName);
break;
}
}
assertTrue(matchedFileName);
// clean up
aFile.delete();
y++;
}
// make sure we saw all the expected files
assertEquals( y, 8 );
}
As you can see, its somewhat refactored, and we decided it would be better to create the test hierarchy dynamically in Java rather than create it statically on the filesystem and somehow maintain it there. This coding was a little tedious, and we used a lot of method chaining to keep it concise, but ultimately, I think this is a more robust approach. And because creating these files returns a boolean value to indicate success or failure, we made the somewhat unusual decision of wrapping the setup code within JUnit assertions just to be sure that we wouldn’t somehow get a false green bar if these files weren’t created in the first place.
We did create a static directory on the filesystem alongside our test code within which our test files are created, but again, we chose a dynamic method of locating that directory using a trick I often implement in these cases, which I think I originally learned from the O’Reilly Java Cookbook.
The important thing to note about the test data is that we tried to anticipate commonly encountered situations and create a fairly arbitrary arrangement of files and directories: Don’t forget that files may or may not have extensions. Don’t forget to add files to the top level directory to make sure they are handled properly. Remember to add subdirectories several levels deep and also to add multiple files and subdirectories at the same level somewhere in the hierarchy.
I also think its important in cases like this to have two or three instances of each situation in your test code. If you only have one instance of each case you’re testing, you may find yourself writing code that passes your unit test but only works when it finds one (vs. a collection) of items. This approach may also help you discover ‘off-by-one’ errors more easily.

Minor point: getChecksumValue() should have been made private.
Thanks for the feedback. Can you elaborate?
Yes, I would (I was in the process) but then I saw that you already used private in many places shown in your last code snippet. So I think you already understand the difference between public and private methods in a class. Is this supposed to be a trick question?
No trick. (Am I being tricked? How can “can you elaborate” be a trick?
) It just wasn’t clear to me why you thought that particular method would need to be private, although depending on your particular implementation, it might make sense.
Oh, thank you for detailed and helpful article. Yes, using recursive traversal over file system (which obviously has hierarchical nature) is the simplest and convenient way.
However, the approach you’ve suggested may be even more generalized.
For example, some time you need to perform some actions not with all directories, with files of specific type or name pattern and so on.
Some time ago, we developed quite small and simple utility which allows us to handle such a file-system related tasks.
The code of it looks like this (actually, there are several possible implementations, that one is simplest but highlights overall idea):
public class DirectoryTreeVisitor
{
public DirectoryTreeVisitor()
{
}
public static void processDirectoryTree(File aRootDirectory, FileVisitor aVisitor)
{
processDirectoryTree(aRootDirectory, DUMMY_FILTER, aVisitor);
}
public static void processDirectoryTree(File aRootDirectory,
FileFilter aFilter,
FileVisitor aVisitor)
{
processDirectory(aRootDirectory, aFilter, aVisitor);
}
protected static void processDirectory(File aDirectory,
FileFilter aFilter,
FileVisitor aVisitor)
{
File[] files = aDirectory.listFiles();
for (int i = 0; i
That class performs only basic directory traversal, and relies on two additional interface to perform specific processing - FileVisitor which provides implementation of actual actions which should be performed on particular directory or file, and FileFilter which accepts or denies passed file/directory for visiting.
Idea there is quite simple, but by our experience is very convenient.
I hope that approach we use could be helpful for someone else.
Sorry for the mess with formatting on previous post, I’ll try to insert the code again. This is main method used to traverse file system
protected static void processDirectory(File aDirectory,
FileFilter aFilter,
FileVisitor aVisitor)
{
File[] files = aDirectory.listFiles();
for (int i = 0; i
sorry again, just realized that internally “<” was not quoted. Here is the listing:
public class DirectoryTreeVisitor
{
public DirectoryTreeVisitor()
{
}
public static void processDirectoryTree(File aRootDirectory,
FileVisitor aVisitor)
{
processDirectoryTree(aRootDirectory, DUMMY_FILTER, aVisitor);
}
public static void processDirectoryTree(File aRootDirectory,
FileFilter aFilter,
FileVisitor aVisitor)
{
processDirectory(aRootDirectory, aFilter, aVisitor);
}
protected static void processDirectory(File aDirectory,
FileFilter aFilter,
FileVisitor aVisitor)
{
File[] files = aDirectory.listFiles();
for (int i = 0; i < files.length; i++)
{
File file = files[i];
if (file.isDirectory())
{
if (aFilter.accept(file))
{
aVisitor.visitDirectory(file);
}
processDirectory(file, aFilter, aVisitor);
}
else
{
if (aFilter.accept(file))
{
aVisitor.visitFile(file);
}
}
}
}
public static final FileFilter DUMMY_FILTER = new DummyFileFilter();
static class DummyFileFilter
implements FileFilter
{
public boolean accept(File pathname)
{
return true;
}
}
public static class FileOnlyVisitor
implements FileVisitor
{
public void visitDirectory(File aFile)
{
}
public void visitFile(File aFile)
{
}
}
}
Looks like a very nice implementation, thanks for sharing. (And sorry about the formatting issues.)