Georg Edelmann's Weblog Georg's Weblog

Friday Apr 04, 2008

Today was the last day of Open Repository 2008. I chose to attend the Fedora User Group session today, which centred around users describing real projects they undertake with Fedora. Right up my alley.

Out of the three presentation, I want to highlight the last one. Marcos Santoz, a research assistant from the University of Koblenz-Landau, Germany, talked about project TAS3. TAS3 stands for "Trusted Architecture for Securely Shared Services". What an ominous title !!! The team used Fedora to build a system with the following goals :

  • manage lifelong generated personal information for individuals in the employability and e-Health sector.
  • support "sticky policies" and "break the glass" (I explain this later with an example)
  • support lots of different data standards (see their website for details).

Pretty abstract, right ? It was for me, until Marcos talked us through two used cases. Here's number one. A doctor treats a severely wounded patient who just arrived in the emergency theatre. The doctor needs the patient's medical record. She accesses the patient record repository, and learns that she does not have the access privileges to the patient's medical history ("sticky policies"). She then tries to access the record a second time, which trigger an audit trail review, which, if successful, clears our doctor for access ("break the glass").

The second used case was less of a life and death scenario. An graduate applies for a position at a company. Our graduates learning records are being maintained via IMS-LIP, yet our hiring manager's system works only on HR-XML. Our graduate wants to grant access to his learning records to the hiring manager. TAS3 will take care of the format and protocol mediation, thus allowing the hiring manager to look at our graduates data without exiting his environment.

That was it. My OR2008 is over. What a blast, I learned just enough to appreciate what I don't know. What I do know, is that whatever tough problems are being solved here, Honeycomb and Sun's technology line-up is a perfect platform for open repositories.

It's good-bye Southampton from me. See you at OR2009 in Atlanta, Georgia.

Today is "User Group" day at Open Repository 2008. This is where the three main open-source communities, Eprints, DSpace and Fedora Common talk about their initiatives.

I started the day listening to the presentation from Michelle Klimpton, executive director of the DSpace foundation. She walked us through the plans the foundation has around promotion and awareness for DSpace. They have done or will do webinars, maintain discussion lists and the DSpace website, attend and present at community gatherings, organise and facilitate training event, create marketing materials, etc. Looking forward, they want to create a Global Outreach Committee. A forum that takes the foundation's material and localize it to their specific region. Michelle was looking for volunteers to support this effort. Contact her, if you are interested. She also want to get more involved in the coordination of user group meetings. What was my takeaway from her talk ? She has a "we'll do whatever it takes to facilitate an active community" approach. It might sound obvious, but an active community does not happen by accident. It's people like Michelle that feverishly work in the background to grease the wheels. Thanks, Michelle and team.

Next up, was the chief developer of DSpace, Scott, talking about the new features in DSpace 1.5.

There were a couple of significant improvements for the DSpace code base. The main ones are :

  • Manakin: a new tool to create web graphical user interfaces for DSpace.
  • Maven : the Apache build manager for Java projects, Maven, is now being used to build DSpace.
  • Improved workflow : the way you can get content into the repository was made more flexible, and therefore can be easier adjusted to the business processes of DSpace users.
  • Improved browsing : better use of searches and indexes.

There were also some major "under the hood improvements" :

  • SWORD : integration of the ingestion protocol SWORD.
  • Leightweight Network Interface : protocol for managing content in DSpace. It's kind of what SWORD does.
  • Event system : notifies listeners by events whenever an object changes.

In Scott's Q&A session, he got quite a deluge of questions. Here's a small sample :

  • Q: How did the server performance change under DSpace 1.5 ?
  • A: We've seen better performance in areas of ingestion. Batch ingestion tests moved ingestion times from half a day to 1 hours.
  • Q: What hardware do I need to run DSpace ?
  • A: There's guidance in the manual. You do need more than half a Gb of memory.
  • Q: What did you measure in terms of performance ? What's the metric ?
  • A: Scott measures user experience with Jmeter, and batch ingestion, as mentioned before.

I swapped session, and jumped over to the Eprints user group. The talk was planned to be around "Research Assessment Experience", but when the talk was a about to start, a group decision was made to extend the previous session and talk about new features in Eprints 3.1. Chris Gutteridge, the fast-speaking Eprints chief developer, answered questions about features. I notices quite a bit of shifting towards using Eprints plugins to add new features. Chris' main reason for that was that the Eprints system administrators at Southampton University has to go through hoops to revise Eprints core services. Adding bug fixes and plugin is much less complex. It was interesting to observe how Chris balances his responsibilities as an employee of the Southampton University and his leading role in the Eprints community.

Next Eprints section was on the experiences of Eprints users going through an Research Assessment exercise. From the RAE website, you will learn that the RAE is :

The Research Assessment Exercise is conducted jointly by the Higher Education Funding Council for England (HEFCE), the Scottish Funding Council (SFC), the Higher Education Funding Council for Wales (HEFCW) and the Department for Employment and Learning, Northern Ireland (DEL). The primary purpose of the RAE 2008 is to produce quality profiles for each submission of research activity made by institution.

In a nutshell, this is when the universities are being audited by a governing body in the UK. A positive audit results has positive impact on grants and funding, hence the heightened level of focus in this area. Money talks. Not being an Eprints technology related session, I was about to walk out, when the speaker was announced as being from the Open University, an institution I highly regard. So, I stayed and listened. After all, as a supplier knowing the "scary monsters" of your customers is key.

Here are a couple of challenges I picked up :

  • Finding hard evidence for the audit. Example : an art installation that is based on rolling barrels. This was a transient display, and there was hardly any lasting evidence after the installation was dismantled. RAE did not have a category for that.
  • Librarians were not skilled in using Eprints, and needed to be trained. After being trained, the good people left. Their market value significantly increased. One of the "culprits" was in the audience and was pointed to by the presenter. Everybody laughed.
  • Out-of-the-box metadata for Eprints were insufficient for RAE. Needed customisation e.g. added a field to store the physical location of an arts item. This comment came from a Kingston University employee.
  • Researcher needed to be organised to use the repository in line with what the RAE wanted.

I spent the afternoon listening to a series Fedora presentations. First up, my esteemed Sun colleague, Eric Reid. Eric's job at Sun is working with open-source communities making sure their technologies integrates and works well on Sun. His presentation was entitled "Fedora and Honeycomb : A New Buzz in Creating and Managing Large Scale Digital Archives". Eric spoke about the systems (or better platform) aspects of Fedora Commons. He explained to the audience what a Honeycomb system is, what it can do, how it works. If you want to learn all that yourself, start here. The key point he wanted to make around Honeycomb was its paradigm of object data store, and its RAIN (Redundant Array of Inexpensive Nodes) architecture. He got quite a reaction from the crowd when he mentioned that the calculated "mean time between data loss" is 2 million years. This is music to the ears of people who are in charge of long-term preservation.

Eric made a great point about the integration. The LLStorePlugin for Fedora is available today, and glues the storage back-end interface with the Honeycomb API, thus making Honeycomb a first class storage citizen for Fedora. Note: this interface is for content only. Fedora's metadata store is not yet kept on Honeycomb. We're working on that.

Eric got a bunch of questions after his talk. Here's some of them:

  • Q: What happens when it breaks ? Disks do not live for 2 million years ?
  • A: Honeycomb takes care of this via self-healing, and flagging the faults to the administrator for attention.
  • Q: How do you backup ?
  • A: I prefer the notion of multiple Fedora data store for availability, but there is also a NDMP backup interface.
  • Q: How many object can you store in a 64TB Honeycomb ?
  • A: There are no practical limits to the number of objects that can be stored.
  • Q: 64TB per rack is too much. Do you have anything smaller ?
  • A: Yes, a half-cell Honeycomb at 16TB is available today. We have even smaller solutions on the roadmaps. We can talk about these under NDA.

By the way, Santosh, a repository research engineer from Microsoft, sat next to me. His presentation feedback was : "This is good stuff. Let me see how we can integrate Microsoft SQL server with Honeycomb". We agreed to exchange Email on this topic. I pointed him to the Honeycomb API documentation and the Honeycomb emulator for further studies. Cool.

Next up was Andreas Aschenbrenner, State and University Library Goettingen, talked about "Using Fedora to manage complex objects". He made some very interesting points on how institutions want to share repositories infrastructure. I think he refered to what we would call "cloud storage capacity" or "Storage as a Service" ala Amazon S2 grid, or on-demand storage for repositories. I liked his approach here. The images that came to mind was that of a network based grid repository service, that a department can buy storage capacity from, and grow as their data needs expand.

Ben O'Steen, Oxford University, talked about the Fedora-based architecture. For me, this was by far the most exciting presentation of the whole week. Here's why. In my mind, the solved all the architectural problems everybody spoke about for the last three days. Here are the highlights :

  • Scalability : objects can be placed in any object store on the network, and located via their object meta data. This means that scaling over multiple Fedora instances is a no brainer. Need more storage capacity, just another Fedora instance with another Honeycomb. Need more server resources, just add another virtual Fedora server in a VMware/Solaris/xVM container.
  • Open remote access : They chose UUIDs to identify objects, who are being exposed to the world via unique identifiers. This include the RDF relationships of the objects. Ben showed a demo where he created a blog about a paper on the Oxford repository. As the blog entry was posted, the repository picked up the fact an object was being linked to, and added this fact into the metadata store. Cool.
  • Extend functionality: They use the JMS interface that interfaces via iCAL for any scheduling needs, e.g. scheduling virus scans, log user logins.
  • Ingest : ingesting content via a staging archive, which can be used to moderate data, e.g. cleaning up duplicates, and then move the "clean" data to the real archive.

He also explained the object relationship model using the example of how a book is stored. A page of a book (think a TIFF file of the scanned page) links to a chapter, which then links to book. All via RDF, and query-able in their XML form. These relationships can be defined on the fly.

For further information, see

That was it for me for the day. The evening was a formal awards dinner. After the meal, the winner of the Open Repository Challenge were being announced. Guess who won the $5000 price ? Ben O'Steen and team. In the last two days, Ben and team whipped up the code to exchange the content of a Eprints repository with a Fedora Commons repository. That's what I call interoperability. Well done, Ben and team.

Day 2 started with a series of extremely interesting presentations on national and international perspectives of open repositories, followed by six talks describing a wide selection of scientific repositories. The afternoon was occupied by talks about models, architectures and framework, and a section on usage. I'll try to pick a representative sample of the day here.

Simon Coles from the University of Southampton did an fun-packed talk about his experiences with repositories and blogs in laboratories. His R4L project aims to address the gap between actual laboratory experiments and the publications of papers. I got quite a bit of contextual understanding around the academic life from his talk. Here's one example. Simon said : "40 years ago a PhD student would determine 3 crystal structures during the course of their study, this can now be done in one day." Now, that's what we call data explosion !

Christian Gumpenberger, Novartis, gave the audience a deep understanding of the trials and tribulations of introducing an Eprints based pharmaceutical repository corporate-wide at Novartis. His talk stood out for me, as it was one of the very few session in which a commercial entity took it upon themselves to organize their knowledge in a repository. Project OAK (Open Access to Knowledge) was a master's lesson on how to navigate the corporate world when it comes to implement a central knowledge data bank. Challenges were many, most of which were in keeping the project going after a successful start. On the technology side, Eprints was according to Christian, the right choice for Novartis. A good thing to say, when you present at the "Home of Eprints".

I the jumped onto the "Models, Architecture and Framework" track for two sessions. One of which was a presentation by Herbert van de Sompel, Los Alamos National Laboratory (LANL), on aDORe Federation Architecture. This was a brilliant talk. Herbert explained how LANL designed and implemented an architecture to federate repositories for scale. Scale at LANL means the 100 million objects in 9200 repositories. Massive scale, I'd say. Tons of ideas popped into my mind here. I could see how one could build hardware platform building blocks that would support the idea of scaling repositories by federation in a completely transparent manner.

My last session of the day was entitled "MESUR: Implications of usage-based evaluations of scholarly status for open repositories" by Johan Bollan, Los Alamos National Laboratory. Just reading the show brochure, this looked like a less interesting topic. Statistics and numbers, right ? Not so. Johan, being a skilled presenter, combined with his fast-paced style, was a blast. The project mined a wide choice of journals and created a graphical model via their citations on how the publications (and therefore the sciences) interconnect. Very interesting. For me, their work was one of the best visualisations of huge datasets I have ever seen. Check out the project's website.

Before I forget, I also attended the Microsoft session from Lee Dirks, Santosh Balasubamanian and Savas Parastatidis. We met the guys in the hotel earlier the same day, and got talking. The folks are working on a research project around using Microsoft technologies for repositories. Build on top of Microsoft SQL server, Santosh and Savas showed a series of impressive demos centered around the ease of development for repository software. From what I have seen during the last couple of days, this is probably the most complete development environment, even at this early stage of the project. It does require the developer to stay within the well-padded Microsoft environment, and as the question and answer session illustrated, cross-platform (read non-Windows) deployment does present a challenge. What did surprise me was the presenters sincere commitment to being open. Have the winds shifted to a more open-source attitude at Microsoft ? I wondered.

This was a long day. Off to the pub for some well-deserved pints of London Pride.