collected thoughts for no particular reason Kristen'sMusings

Friday Apr 18, 2008

The most important step in refining data is writing down what makes it tick. In my last blog entry I outlined the process by which my team refined the data flow for Sun Web sites and mentioned Unified Product Data Model (UPDM). UPDM is a formal means for organizing the information for Sun web sites, with an emphasis on product data. The goal in developing UPDM was to reduce the redundancy and inefficiency in the data sets for the Starlight publishing platform, and then to extend the benefits to e-commerce sites, each of which happen to be hosted on different platforms.

UPDM provides definitions, basic business rules and relationships between facets of product data, including hierarchy, and attributes of each category, product and part. It’s maintained in a simple XML which can be transformed to HTML, spreadsheet, or even UML diagram. The following listing is a snippet from the actual core UPDM 1.0 model used in Starlight.

<?xml version="1.0" encoding="UTF-8"?>
<data-model>
  <label>Data Model Browser - UPDM 1.0</label>
  <explanation>
   <p>The UPDM Data Model Browser describes concepts and attributes that are 
      core to the <b>Sun Unified Product Data Model</b>. This 
      version [UPDM 1.0] covers product data elements as they are represented 
      in <b>Sun.com, shop.sun.com, and Sun Catalogue</b>. Product elements 
      that describe transactions, implementation, or presentation are not 
      included in UPDM...</p>
  </explanation>

  <concept id="product">
    <label>Product</label>
    <explanation>Actual product entity.  Representation of the unit offered 
                 to the market by Sun (i.e. Sun Java System Application Server 
                 Platform Edition 9.0; Sun SPARC Enterprise M5000 Server.)
    </explanation>
    <implementation-guideline>
      Use the id as a stand-in for the product itself
    </implementation-guideline>
    <association ref="swordfish-id"/>
    <association ref="name">
      <constraint>Strictly syndicated through SwoRDFish</constraint>
    </association>
    <association ref="description"/>
    <association ref="image"/>
    <association>
      <concept id="plc-date">
        <label>Product life cycle date</label>
        <explanation>A date on which a change of PLC status occurs</explanation>
        <implementation-guideline>
          <data type="date"/>
        </implementation-guideline>
[...]
        <example>2006-10-10</example>
        <comment>Related to the price effectivity date</comment>
      </concept>
  </concept>

  <concept id="industry">
    <label>Industry</label>
    <explanation>Industry for which suited or targeted</explanation>
    <association ref="swordfish-id"/>
    <implementation-guideline>
      Use the SwoRDFish ID as a stand-in for the industry itself
    </implementation-guideline>
  </concept>
[...]
</data-model>

We developed UPDM 1.0 at a time when there were not as many good options for expressing such data models. RDF and XMI carried too much baggage, and we wanted something simple and clear, although RDF does play an important role in how things are bound together in the implementation, as I’ll discuss another day. Again we can generate all these other representations as needed. As an example the following picture is a UML class diagram generated from the XML above.

OK, a bit of an eye-chart. But, when you develop a broad data model such as UPDM its a real eye-opener, and you gain more than just the end product. You learn a lot about what business problems and business rules are not really well expressed anywhere, and are only to be found in someone’s head. Sometimes you learn about the key tensions between how different roles and departments interpret and process information.

To take one example, at Sun what we sell for hardware, the actual SKUs, are called “parts” in the marketing department, including e-commerce. In many other departments, and in a lot of the vendor software we use these are called “products”. We don’t really market at this level, though. We market the families of these such as “SunFire T2000”, and these are what we call “products” and what others call “product families”.

UPDM itself doesn’t provide any magic to reconcile such differences. It does provide the best you can hope for – a framework for writing down the knowledge so it’s open, shared, and even accessible through code. Then you have half a chance to build some magic on top of the model.

Tuesday Apr 01, 2008

Captain Data Model Chronicles

When I started as data architect for Sun’s Web Publishing Engineering (WPE) department we were just coming out of pilot for Starlight, a unified content platform for Sun’s Web sites. (Back then, my team, Content Management Engineering - CME - didn't even exist yet!) As we began building Starlight, we had a few key goals: increased reuse, improved standardization and globalization. The system is a document-driven platform, which is a good thing, but back in its early days it suffered a bit from lack of a data modeler’s touch. I saw pretty quickly that it wouldn’t scale very well across the wide variety of applications on Sun’s Web space. We had to take a number of steps to improve the system to meet its ambitions.

At the beginning stage the document database was in what you can think of as preschool form. Documents, authoring and presentation templates were designed pretty much ad-hoc, as business needs drove them. There was no real organization to any of this, so that very often when similar needs came along later on, people ended up reinventing the wheel. This led to inefficiency throughout the content life-cycle. On one end, authors would have to get used to multiple templates to create similar documents. On the other end those developing Web applications on Starlight had to create multiple overlapping presentation templates. As a result the workflow was quite a tangle, as I illustrate below.

The first step was to formalize document design. Starlight initially focused on marketing Sun products, and we were exploring how we could better share content and data between the product marketing sites and the various e-commerce sites worldwide. This became an effort to create a Unified Product Data Model (UPDM). We then applied UPDM to formalize the design of many of the Starlight documents, and we used the extensibility of UPDM to establish a model for pages not as closely identified with products. I’d love to discuss UPDM more, because it was a core achievement that provided a foundation for so much of what followed, but for now I’ll continue with the main story.

Once we’d formalized document structures we could identify redundant templates and combine them, and we could also make the workflow much clearer and more efficient. We were able to accomplish this back in 2005 smoothly enough that you probably didn’t notice. For the most part, we made no change to the CMS toolkit or personnel, nor to the resulting pages. All we did was apply data architecture to make these more efficient. Call it data grammar school, illustrated in the following diagram.

Much less confusion. The document design is clearly defined, which makes it easier for Web applications to identify and pull the content they require, and makes useful middleware out of the ad-hoc authoring and presentation support tools. But regardless of how carefully you try to control the proliferation of document templates, you can’t fit every new need into existing ones. You may never regress all the way back to the chaos of preschool, but over time you can definitely lose some of the benefits of all the careful organization. As Starlight grew in application we quickly saw that we needed to organize things even further.

A Web space as broad as Sun’s may have thousands of different permutations of source documents and presentation pages, but for the most part there are basics that you use over and over again. You have titles, links, keywords, images, personal and organizational contact information, prose snippets, and so on. We turned that into a library of document components, defined in RELAX NG, and incorporating UPDM and other standards from inside and outside Sun (as an example of the latter we heavily reused components from XHTML 2.0). As such most of the document templates became nothing more than a bunch of components snapped together, so that the complexity of the middleware and application query no longer has to scale as dramatically with the number of document templates.

At the same time, we had to re-engineer our rendering technology to take advantage of componentization of content. Parallel to our library of document components there is a separate Sun project, Web Design Standards, that defines in fine detail how Sun material should be presented on the Web. We organized our rendering templates so that they gather the needed content components as input, and generate the needed Web Design components as output. The result is a system where building blocks can be readily identified and reused throughout the publication process. Call it data college.

At this point we have the foundations for efficiently creating and routing content, allowing publishers to focus on what they want to communicate and enable in their Web applications. Of course this is only the jumping-off point for tackling even harder problems, such as how to better find and sell our products and services, and how to engage the community more readily on the Web. These are the challenges to which we’ve turned our attention in the past year or so, and the most important factor allowing us to deal with these grown up problems is that finally, at least in CME, Johnny Data can read -- and write.