A Possible Way Forward For Developing Cornell’s OAIS Infrastructure
This post is a recent, non-confidential internal memo I wrote to a co-worker as a sort of post-mortem on a long-term digital preservation project. Although its in very rough form, and alludes to some things that won’t make sense to those outside of Cornell University Library, (and I tend to be pretty loose in how I deal with topics, such as the relationship between performance and scalability for example, and I assume some familiarity with certain technologies), I thought there might be enough meat hear to interest others working in the area of digital preservation.
Problem:
While trying to create a standard OAIS for Cornell, we took an object-oriented (OOP) approach to what is fundamentally a functional or task based problem.
Simply put, there are two broad sets of tasks in preservation processing. First, the files to be ingested (the data) are “normalized”, which might mean doing such things as decompressing files and flattening directory hierarchies. Second, information is gathered to produce a METS XML file (the metadata). When these two tasks are completed, the data and metadata can be ingested into a preservation archive, which in our case, is aDORe.
Our OOP approach made sense initially. OOP is the paradigm we were trained to use, and it was not until we tried to preserve several collections that we had a concrete grasp of how to best approach coding the preservation of arbitrary collections. Just as others like Portico have reported, it was taking us weeks to preserve each new collection. We hadn’t achieved the level of reuse we were hoping for, and with each new collection, we ended up overriding and wrapping large amounts of code. We did achieve reuse in the areas where we refactored code into “helper” classes, including much of our basic file handling and XML generation. At that point it was time to “throw one away” and re-envision how to code a more effective digital preservation framework.
It wasn’t until the Microsoft large scale digitization effort (LSDI) that we also encountered scaling issues which hadn’t been a concern for us up to that point, as we were just trying to get the basic approach right. We observed scaling issues along several dimensions that included both processing speed and memory usage.
Proposed Solution, the Java Helper Layer:
We want to leverage our existing Java code (including comprehensive unit tests) so we don’t have to start from scratch. We can begin by continuing to refactor the code into helper classes/packages that can be called statically for certain tasks as needed for a collection.
We should also aggressively jetison the custom classes we created, such as DigitalObjects, ComponentFiles, etc. and use smaller data structures, such as HashMaps that are not only perfectly adequate to handle our data needs but also have the advantage of being built into the Java standard library. So, when it comes to normalizing files and generating a METS file, I would like to see us only pass around the top-most directory containing the data instead of a DigitalObject composed of ComponentFiles that wrap Java File objects and just work directly with the File objects themselves. (To optimize file I/O, reading files should be an implicit part of the glue layer described later. And while we do associate other bits of data with each file, such as an original name, a Handle, etc., this may be better handled by a smaller data structure built into Java, such as a two dimensional HashMap keyed to a file name or file path, then known attributes. I think this would greatly reduce the memory footprint of the application and increase speed because much fewer custom objects are being created, going a long way to solve our current scaling issues.
Influenced by the REST architectural style and reading about functional programming, specifically the actor based model implemented by Erlang and Scala, I would propose that the functions defined in this helper layer be static, idempotent, stateless, without side effects and if possible, or as much as possible, allow for asynchronous execution. Additionally, data structures should be planned to be thread safe from the start. Although these can be difficult goals to achieve, I think the radically simplified approach I have outlined for the helper layer make these ideals more easily reachable and will allow these function to be invoked in a simpler and more flexible way.
Which brings us to the next stage: this helper layer would give us the building blocks for digital preservation, it would give us reusable chunks of code to perform specific tasks that a given collection may need performed on its data. At that point, we need an efficient way (from both the standpoint of coding and execution) to specify for a given collection which steps need to be performed in what order. In other words, what will be the “glue” that binds these tasks together for each collection?
Proposed Solution, the Glue Layer:
Because digital preservation as I’ve been talking about it is very task or activity oriented, it makes sense to look at a functional paradigm instead of a OOP paradigm for implementing the glue. (Actually, several of the options I will explore such as Scala and JRuby implement both paradigms.) Also, as much as possible, specifying collection specific tasks that are consumed and executed by the glue should be as declarative or configuration oriented as possible. Ideally, the glue layer could become a standardized domain specific language (DSL) for preservation, or even digital asset management in general, and could shape thinking about implementing repositories well beyond CUL.
I think there are a number of potentially compelling options for the glue, and I will present them, considering the following:
- can we focus on coding intelligence into the application, rather than solving technical problems (most technical problems will be dealt with in the helper layer),
- how simple is it to implement declaratively
- how testable is the approach
- how well does it scale
- what is the potential learning curve as well as sustainability issues in our organization
Of these considerations, scalability is probably the trickiest to consider. Projects like LSDI are probably few and far between and although we want the preservation code to scale to handle many collections, I don’t think we can assume that the performance profile will be the same. In LSDI, we have a relatively small number of relatively large digital objects, but in our more typical scenario, we will be working with collections that tend to have larger numbers of smaller digital objects. We might be better off optimizing for the latter scenario and treating projects like LSDI as special cases, perhaps even assuming that we will have to create more customized code in those cases. (LSD projects are not only few and far between, but they tend to have the funding and organizational commitment to support this approach, whereas our bread and butter collections don’t. Also, there seems to be an assumption that LSD projects are more static, maybe because of their storage requirements and subsequent costs, so maintaining a custom code base may be less of an issue as a result.)
In our typical scenario, the static Java helper layer goes a long way toward optimal scalability. Then, I believe that simply coding the glue layer to allow for concurrency will get us most of the way toward our scalability goals. In some cases, we may not need to deal with this explicitly. Erlang handles multiple threads implicitly for example, and from what little I have learned so far, Scala may do something similar with thread pooling. In cases where we need to explicitly deal with multithreading, I would propose that instead of directly programming thread handling, that we implement a thread pooling approach instead. This is a well understood pattern that is easy to implement by comparison, and for which we already have experience. It is also a more fine tunable approach that I believe can provide the best performance.
By the same token, although our design goal is to have each function or component be as isolated and stateless as possible, this may not be realistic. In this case, some of the approaches considered have the notion of interprocess communication at the language level. Other options might include something like JMS, or even just using an old fashioned relational database. In JDK 6 for example, the open source Derby relational database is now built into the Java runtime environment and may be appropriate for this purpose.
The options I will consider for the glue layer are:
- shell scripting
- Groovy
- JRuby and Rake
- vanilla Java, Ant
- ANTLR
- Scala
- Erlang
Shell Scripting. There is not much to say about this approach, I wanted to add it as an illustration more than anything else. On the one hand, this is probably the simplest approach to implement. A collection’s shell script can be launched from a cron job, iterate over a directory, passing directories to functionally independent Java applications for processing. However, the approach is not easily testable, not easily scalable, contains no intelligence and could present maintenance issues.
Groovy. Along with Scala, and to a lesser extent JRuby, Groovy is often discussed as the “new Java syntax.” So, it is Java based, with more powerful syntax and supports both OOP and functional approaches. Its performance is currently lacking compared to similar approaches like JRuby and Scala, and will probably continue this way for the foreseeable future. Although Groovy was the first JVM scripting language to become an official Java standard, it didn’t see widespread adoption until the release of Grails, the Groovy port of Rails. So, its future adoption may be greatly influenced by how well Grails is adopted compared to Ruby on Rails.
JRuby and Rake. This is the most compelling alternative when considering everything except possibly scalability, where it falls short. Because (C, or MRI) Ruby is not concerned with performance, and JRuby is, JRuby quite surprisingly already outperforms MRI. Like Groovy and Scala, it is Java based and supports OOP and functional programming. It is the Java scripting language that gets the most attention, although ironically, it is not as often discussed as the “new Java syntax” as Groovy and Scala are, but instead is most often discussed as the best deployment option for Rails.
As a language, it has the most programming expressiveness and flexibility, perhaps the best option for easily creating an internal DSL. In fact, Rake, the Ruby build language, is an existing internal DSL that could possibly just become the preservation DSL we are seeking. It is task based, and intelligent about task dependencies. It also has special file tasks, which are obviously very applicable in our situation, and file tasks are also intelligent about not running unless they need to (when the output has not yet been created). Also because Rake is an internal DSL, meaning that Rake files are also valid Ruby files, so any custom glue functions for collections can be coded directly in the Rake file resulting in nicely maintainable packaging of collection logic. The execution of Rake may not provide the scalability we are looking for (similar to the shell scripting approach), but we could possibly start with Rake and extend it with our own execution model that includes thread pooling.
Judging JRuby’s performance and how that might impact scalability is complicated. Although performance is a goal of JRuby, it is below that of vanilla Java, and likely will be for the foreseeable future. On the other hand, it is only being used for the glue layer executing vanilla java helpers, albeit still within a JRuby context. And admittedly, at this stage, its hard to know exactly what our performance needs will be when we are trying to preserve many collections, or where the performance bottlenecks will appear.
And sustainability is also a potential issue. Traditionally, Java scripting languages (most notably Jython and Bean Shell) have had mixed acceptance and support in the Java community, but this seems to be changing with the rise of Rails. Because JRuby is currently tied so closely to Rails, its future may be significantly affected by whether or not Rails or Grails wins the hearts and minds of Java Web developers. Then there is Scala, which is positioned to take off as concurrency becomes a more common need/desire and may undercut JRuby adoption in the process. When people discuss a “new Java syntax” Scala is the language most often mentioned.
Vanilla Java and Ant. The analysis of this option is pretty obvious. What I am calling “vanilla” or plain old Java is the least risky choice from a sustainability standpoint. Java is also strong in performance, arguably representing the best combination of enterprise adoption and performance. But Java is also probably the least desirable from the standpoint of programming simplicity, expressiveness and flexibility. Ant, Java’s primary build “language” is task and dependency based like Rake, but uses XML syntax and so, doesn’t have the packaging advantages I mentioned for Rake. Ant is also easily extendable, but perhaps not as easily as Rake, because of the disconnect of the XML syntax vs. the Java task definitions.
ANTLR. Not much to say here except that this tool provides the most flexibility when it comes to creating a domain specific language for preservation. But, its probably not attractive when considering sustainability or performance.
Scala. Java based, OOP and functional paradigms. Employs actor based model for concurrency and message passing and is most often discussed as the “new Java syntax.” However, it appears that the language is still somewhat in flux, and as discussed, it is hard to predict its future vis-a-vis Groovy and JRuby. Its syntax is new, and presents a learning barrier. Also, although Scala directly addresses concurrency at the language level, its scalability is hard to judge. Starting at its foundation, it is a third party scripting language on top of Java, and there are also relatively easy and mature ways of implementing concurrency and message passing (such as thread pooling and JMS) even in plain old Java, that its unclear to me if the advantages of Scala outweigh the drawbacks. I also have to admit that this is the option I am the most unfamiliar with, so I think that further investigation is needed to answer these concerns.
Erlang. From a scalability standpoint, Erlang would appear to be the clear winner by far with its massively scalable lightweight threads. Erlang is a functional language with other highly intriguing features like the ability to hot swap code and its extreme reliability. Erlang can integrate Java components through JInterface.
But, its questionable whether we would be able to take full advantage of Erlang’s lightweight threads, or if we need the ability to hotswap code for example. Because the underlying building blocks of our code will be in vanilla Java, that would seem to mitigate Erlang’s native lightweight threading. (Although again, I don’t know Erlang well enough to say that this is the case, just that its a concern I would want to investigate.) Also, the architecture outlined here is one in which functions and components are isolated from one another, and also the environment is such that this code is not public facing, nor does it impact public facing or mission critical applications in any visible way. So, it would seem that we should be able to easily program functional components in such a way that they can be taken offline, even for significant periods of time, without negatively impacting other components in the application, or the collections being preserved.
Finally, Erlang has the most unique syntax and therefore the most obvious learning barrier. Also, like Scala, the actor based model represents its own learning curve. And although Erlang has had some very high profile recent deployments by Amazon and with CouchDB, I suspect its future is largely tied to how many Java developers adopt Scala as the way to solve concurrency.
Therefore, considering all these options, I think the most compelling are JRuby and Scala, and possibly Erlang, although for each one, more in depth research is needed to make an informed choice.
Of course, the fact that the component or helper layer is in Java casts a long shadow on these choices. This “polyglot programming” approach is a strength when considering JRuby as an option, but a weakness when considering Erlang as an option. Although generally speaking, we want to leverage existing code, especially with its comprehensive unit tests, if for example, Erlang started seeing wide adoption, and we had more projects on the scale of LSDI, it might make sense, assuming we have the appropriate amount of organizational support, to rewrite all or parts of the helper layer in Erlang to fully take advantage of its strengths.
Random Issues
In the glue layer, we should add intelligence to the application, such as tracking the amount of time each task takes and automatically adjusting the number of threads to optimize performance.
The more I think about the issue, the more inclined I am to say that we shouldn’t be preserving “virtual” objects, or objects which have no data or metadata of their own, but which serve to represent virtual relationships to other objects. For example, creating an AIP for an journal issue when the articles are what should really be preserved. Because our AIPs use METS with embedded PREMIS and OAI_DC metadata, we can adequately express containment and other logical/conceptual organizing concepts within the metadata of the objects that we are really trying to preserve. This should not only be adequate for recovery, but may even make it easier.

No comments yet.