Today is "User Group" day at Open Repository 2008. This is where the three main open-source communities, Eprints, DSpace and Fedora Common talk about their initiatives.
I started the day listening to the presentation from Michelle Klimpton, executive director of the DSpace foundation. She walked us through the plans the foundation has around promotion and awareness for DSpace. They have done or will do webinars, maintain discussion lists and the DSpace website, attend and present at community gatherings, organise and facilitate training event, create marketing materials, etc. Looking forward, they want to create a Global Outreach Committee. A forum that takes the foundation's material and localize it to their specific region. Michelle was looking for volunteers to support this effort. Contact her, if you are interested. She also want to get more involved in the coordination of user group meetings. What was my takeaway from her talk ? She has a "we'll do whatever it takes to facilitate an active community" approach. It might sound obvious, but an active community does not happen by accident. It's people like Michelle that feverishly work in the background to grease the wheels. Thanks, Michelle and team.
Next up, was the chief developer of DSpace, Scott, talking about the new features in DSpace 1.5.
There were a couple of significant improvements for the DSpace code base. The main ones are :
- Manakin: a new tool to create web graphical user interfaces for DSpace.
- Maven : the Apache build manager for Java projects, Maven, is now being used to build DSpace.
- Improved workflow : the way you can get content into the repository was made more flexible, and therefore can be easier adjusted to the business processes of DSpace users.
-
- Improved browsing : better use of searches and indexes.
There were also some major "under the hood improvements" :
- SWORD : integration of the ingestion protocol SWORD.
- Leightweight Network Interface : protocol for managing content in DSpace. It's kind of what SWORD does.
- Event system : notifies listeners by events whenever an object changes.
In Scott's Q&A session, he got quite a deluge of questions. Here's a small sample :
- Q: How did the server performance change under DSpace 1.5 ?
- A: We've seen better performance in areas of ingestion. Batch ingestion tests moved ingestion times from half a day to 1 hours.
- Q: What hardware do I need to run DSpace ?
- A: There's guidance in the manual. You do need more than half a Gb of memory.
- Q: What did you measure in terms of performance ? What's the metric ?
- A: Scott measures user experience with Jmeter, and batch ingestion, as mentioned before.
I swapped session, and jumped over to the Eprints user group. The talk was planned to be around "Research Assessment Experience", but when the talk was a about to start, a group decision was made to extend the previous session and talk about new features in Eprints 3.1. Chris Gutteridge, the fast-speaking Eprints chief developer, answered questions about features. I notices quite a bit of shifting towards using Eprints plugins to add new features. Chris' main reason for that was that the Eprints system administrators at Southampton University has to go through hoops to revise Eprints core services. Adding bug fixes and plugin is much less complex. It was interesting to observe how Chris balances his responsibilities as an employee of the Southampton University and his leading role in the Eprints community.
Next Eprints section was on the experiences of Eprints users going through an Research Assessment exercise. From the RAE website, you will learn that the RAE is :
The Research Assessment Exercise is conducted jointly by the Higher Education Funding Council for England (HEFCE), the Scottish Funding Council (SFC), the Higher Education Funding Council for Wales (HEFCW) and the Department for Employment and Learning, Northern Ireland (DEL). The primary purpose of the RAE 2008 is to produce quality profiles for each submission of research activity made by institution.
In a nutshell, this is when the universities are being audited by a governing body in the UK. A positive audit results has positive impact on grants and funding, hence the heightened level of focus in this area. Money talks. Not being an Eprints technology related session, I was about to walk out, when the speaker was announced as being from the Open University, an institution I highly regard. So, I stayed and listened. After all, as a supplier knowing the "scary monsters" of your customers is key.
Here are a couple of challenges I picked up :
- Finding hard evidence for the audit. Example : an art installation that is based on rolling barrels. This was a transient display, and there was hardly any lasting evidence after the installation was dismantled. RAE did not have a category for that.
- Librarians were not skilled in using Eprints, and needed to be trained. After being trained, the good people left. Their market value significantly increased. One of the "culprits" was in the audience and was pointed to by the presenter. Everybody laughed.
- Out-of-the-box metadata for Eprints were insufficient for RAE. Needed customisation e.g. added a field to store the physical location of an arts item. This comment came from a Kingston University employee.
- Researcher needed to be organised to use the repository in line with what the RAE wanted.
I spent the afternoon listening to a series Fedora presentations. First up, my esteemed Sun colleague, Eric Reid. Eric's job at Sun is working with open-source communities making sure their technologies integrates and works well on Sun. His presentation was entitled "Fedora and Honeycomb : A New Buzz in Creating and Managing Large Scale Digital Archives". Eric spoke about
the systems (or better platform) aspects of Fedora Commons. He explained to the audience what a Honeycomb system is, what it can do, how it works. If you want to learn all that yourself, start here. The key point he wanted to make around Honeycomb was its paradigm of object data store, and its RAIN (Redundant Array of Inexpensive Nodes) architecture. He got quite a reaction from the crowd when he mentioned that the calculated "mean time between data loss" is 2 million years. This is music to the ears of people who are in charge of long-term preservation.
Eric made a great point about the integration. The LLStorePlugin for Fedora is available today, and glues the storage back-end interface with the Honeycomb API, thus making Honeycomb a first class storage citizen for Fedora. Note: this interface is for content only. Fedora's metadata store is not yet kept on Honeycomb. We're working on that.
Eric got a bunch of questions after his talk. Here's some of them:
- Q: What happens when it breaks ? Disks do not live for 2 million years ?
- A: Honeycomb takes care of this via self-healing, and flagging the faults to the administrator for attention.
- Q: How do you backup ?
- A: I prefer the notion of multiple Fedora data store for availability, but there is also a NDMP backup interface.
- Q: How many object can you store in a 64TB Honeycomb ?
- A: There are no practical limits to the number of objects that can be stored.
- Q: 64TB per rack is too much. Do you have anything smaller ?
- A: Yes, a half-cell Honeycomb at 16TB is available today. We have even smaller solutions on the roadmaps. We can talk about these under NDA.
By the way, Santosh, a repository research engineer from Microsoft, sat next to me. His presentation feedback was : "This is good stuff. Let me see how we can integrate Microsoft SQL server with Honeycomb". We agreed to exchange Email on this topic. I pointed him to the Honeycomb API documentation and the Honeycomb emulator for further studies. Cool.
Next up was Andreas Aschenbrenner, State and University Library Goettingen, talked about "Using Fedora to manage complex objects". He made some very interesting points on how institutions want to share repositories infrastructure. I think he refered to what we would call "cloud storage capacity" or "Storage as a Service" ala Amazon S2 grid, or on-demand storage for repositories. I liked his approach here. The images that came to mind was that of a network based grid repository service, that a department can buy storage capacity from, and grow as their data needs expand.
Ben O'Steen, Oxford University, talked about the Fedora-based architecture. For me, this was by far the most exciting presentation of the whole week. Here's why. In my mind, the solved all the architectural problems everybody spoke about for the last three days. Here are the highlights :
- Scalability : objects can be placed in any object store on the network, and located via their object meta data. This means that scaling over multiple Fedora instances is a no brainer. Need more storage capacity, just another Fedora instance with another Honeycomb. Need more server resources, just add another virtual Fedora server in a VMware/Solaris/xVM container.
- Open remote access : They chose UUIDs to identify objects, who are being exposed to the world via unique identifiers. This include the RDF relationships of the objects. Ben showed a demo where he created a blog about a paper on the Oxford repository. As the blog entry was posted, the repository picked up the fact an object was being linked to, and added this fact into the metadata store. Cool.
- Extend functionality: They use the JMS interface that interfaces via iCAL for any scheduling needs, e.g. scheduling virus scans, log user logins.
- Ingest : ingesting content via a staging archive, which can be used to moderate data, e.g. cleaning up duplicates, and then move the "clean" data to the real archive.
He also explained the object relationship model using the example of how a book is stored. A page of a book (think a TIFF file of the scanned page) links to a chapter, which then links to book. All via RDF, and query-able in their XML form. These relationships can be defined on the fly.
For further information, see
That was it for me for the day. The evening was a formal awards dinner. After the meal, the winner of the Open Repository Challenge were being announced. Guess who won the $5000 price ? Ben O'Steen and team. In the last two days, Ben and team whipped up the code to exchange the content of a Eprints repository with a Fedora Commons repository. That's what I call interoperability. Well done, Ben and team.