Friday Oct 03, 2008

The video at www.xdrtb.org is worth watching.   Getting it seen by
millions of people all over the world is a fulfillment of a "TED Prize
wish".   The pictures speak for themselves.   It was not what I expected.


My response to the pictures...

As we move forward in noticing how much Trust is what is called for
between us -- to consciously discern what is trustworthy about trust,
where we can reliably trust each other and the world, where we can let
go of the distrust that eats away,  where we can let go of misplaced
trust in false idols -- it is helpful to see the face of God in each
other.



Peter

-------- Original Message --------
Subject:     37 pictures the world must see
Date:     Fri, 03 Oct 2008 10:30:10 -0500
From:     TED <TED@mail6.subscribermail.com>
Reply-To:     TED <contact@ted.com>
To:     pcudhea@comcast.net



ā€œI’m working on a story that the world needs to know about. I wish for
you to help me break it, in a way that provides spectacular proof of the
power of news photography in the digital age.ā€

James Nachtwey


Today, this major TED Prize wish is coming true.

I urge you to take three minutes out of your day, and click on this link:

www.xdrtb.org

(I recommend clicking on the arrow at bottom right of the video player
to enlarge it.)

Tonight these pictures will be projected in some 50 cities all over the
world.  We wanted you to see them first.

I will follow up shortly with another note about this extraordinary story.



Chris Anderson, TED Curator

Unsubscribe or update your email address.     Email Marketing
TED | 55 Vandam Street | 16th floor | New York, NY 10013


Friday Jul 18, 2008

Solaris iSCSI CHAP and RADIUS Configuration FAQ


Revision History
















Version 0.1 for internal comments



Peter Cudhea



June 26, 2008



Version 0.2 as posted in blog.



Peter Cudhea



July 18, 2008





Scope of this FAQ


This document pulls together some well-known and not-so well-known information about how to configure the CHAP and RADIUS security features on Solaris iSCSI. It is necessarily incomplete, since it doesn't go into detail about any of the other aspects of iSCSI configuration.


What is iSCSI?


Internet SCSI (iSCSI) is a network protocol for allowing block-based access to remote storage devices across the Internet. Essentially, it allows mounting remote block devices on a target as if they were local block devices. The iSCSI protocol is described in RFC 3720 and associated other RFCs.


In iSCSI terminology, a particular iSCSI initiator system sets up a session to a particular target system requesting access to a particular storage target. A target device can then export multiple Logical Units, usually called LUNs. In iSCSI, each initiator node and each target node has a unique name, in either IQN or EUI format. For now the details of these formats are not relevant. A storage session links one initiator, one target, and one particular LUN, and makes that LUN available on the initiator system.


What iSCSI options does Solaris support?


Solaris supplies an implementation for the iSCSI initiator (which allows a system to import remote storage as if it were a local disk) and an iSCSI target (which allows a system to export local storage to be mountable by remote initiators). Most of this document is about the iSCSI target.


In addition to the classic iSCSI target, an advanced version of the iSCSI target is under development as part of the Common Multi-Protocol SCSI Target (COMSTAR) project on opensolaris.org. The COMSTAR iSCSI target will support iSCSI Extensions for RDMA (iSER). See the iSER and COMSTAR pages at opensolaris.org for more information on this project.


This document only goes into detail about the classic Solaris iSCSI target.


What is iSCSI target discovery?


There are multiple ways to set up a session from the initiator to a particular LUN on the target. This overall process is called target discovery. Target discovery interacts with authentication because targets that are not visible through target discovery do not need to be authenticated. Solaris supports three flavors of target discovery:



  • Static Discovery - an entry on the initiator side directly identifies by name (eui or iqn) the particular target that it wants to connect to.

  • SendTargets Discovery - the initiator is configured with the IP address of a target system. All the targets configured that are visible to that initiator on that system are automatically discovered.

  • iSNS Discovery - the target registers its storage targets with a storage name services server. (the iSNS server) The initiator can ask the iSNS server which storage targets are available. This document does not go into detail about iSNS discovery.


What is a "local initiator"?


The Solaris iSCSI target can be configured with information about the initiators that may wish to connect to that target. The information about each initiator is recorded using the commands


iscsitadm create initiator -N <initiator-iqn> <initiator-alias>


iscsitadm modify initiator <options> <initiator-alias>


What is an Access Control List?


The access control list is a target-side per-target parameter that specifies a limited list of initiators that is allowed to see that target. The entries in an iSCSI target ACL should be a list of initiator aliases from the target's set of local initiators.


For a target without an ACL, all non-authenticated or authenticated initiators are allowed to list and to connect to that target.


For a target with an ACL, all initiators whose iqn matches an entry on the ACL are allowed to list that target, i.e. to see it with iscsiadm list discovery-address -v. Only authenticated initiators are allowed to connect to it.


Note that target-side Access Control Lists are not used with iSNS. The work of access control is completely delegated to the iSNS server.


What is iSCSI Authentication?


The initiator and the target must establish a relationship of trust. A target without an access control list automatically "trusts" all non-authenticated and authenticated initiators. A target with an access control list will only "trust" the initiators on that list, i.e. the initiators whose initiator-iqn is already known to the target. But simply knowing the right name is no security.


In classic storage area networks such as fiber channel networks, much of the trust between nodes is achieved by the physical security (i.e. isolation) of the nodes and links in the network.


Full iSCSI security depends on full security at the Internet Protocol level, such as is provided using the IPSec internet security suite. This document does not go into detail of how to use IPSec features with iSCSI.


iSCSI Authentication is a method whereby an initiator node and target node can assure themselves that each one has been specifically configured to connect with the other, i.e. that each side shares a "secret" that only the other side should know. iSCSI authentication can be used whether or not IPSec is used.






What is CHAP (Challenge Handshake Authentication protocol)?


iSCSI uses the Challenge Handshake Authentication Protocol (CHAP) as described in RFC1994 to verify the identity of the agent on the other side of the wire. In CHAP, each agent is configured with a CHAP name and CHAP secret, which are essentially a username and user password for that agent. The other side is then configured with a password database that includes the CHAP name and CHAP password for each agent that should be allowed to connect.


In CHAP, the passwords themselves never travel over the Internet in the clear. Instead, a Challenge token is sent across the wire. Only an agent that possesses the matching secret can return the appropriate Response. See RFC1994 for more details on this exchange, and RFC3720 for more details on how the exchange is embedded in iSCSI login request and response messages.


What is unidirectional authentication?


The simplest way of configuring CHAP authentication is for the target to authenticate initiators.


On the initiator side, an entry is created to specify the outgoing CHAP name and CHAP password for that initiator node. For the Solaris initiator, this is done with the iscsiadm modify initiator-node command.


iscsiadm modify initiator-node -H <initiator-CHAP-name>


iscsiadm modify initiator-node -C


Enter Secret: <initiator-CHAP-secret>


Repeat Secret: <initiator-CHAP-secret>


iscsiadm modify initiator-node -a CHAP


On the target side, the administrator must specify the user name and password to expect from the initiator. The way to do this on Solaris is to create a local initiator entry on the target for each initiator that might wish to connect via CHAP. The entry specifies the entity name (eui or iqn), and the incoming CHAP name, and the CHAP password for that entity. It is also possible to create a local initiator that does not specify a CHAP secret.


For the Solaris target, creating a local initiator is done with the iscsitadm create/modify initiator command.


iscsitadm create initiator -N <initiator-iqn> <initiator-alias>


iscsitadm modify initiator -H <initiator-CHAP-name> <initiator-alias>


iscsitadm modify initiator -C <initiator-alias>


Enter Secret: <initiator-CHAP-secret>


Repeat Secret: <initiator-CHAP-secret>


Authentication will be successful whenever there is a local initiator on the target that matches the iqn, the CHAP name, and the CHAP secret as configured on the initiator.


Any gotchas? Why is Unidirectional Authentication still failing?


Are you using a version of Nevada earlier than snv_93, or any version of Solaris 10 before at least Update 7?


Be aware of bug 6667315 and its following workaround. In earlier releases, even if all the above steps were followed unidirectional authentication would still fail. The bug was that unless an outgoing CHAP name and CHAP secret were configured on the target system, the CHAP negotiation would never be triggered, and authentication would always fail. The workaround is to configure an outgoing CHAP name and CHAP secret on the target system as described in the section on bidirectional authentication below. The outgoing CHAP information is not actually used for unidirectional authentication; it just has to be configured so that CHAP negotiation is triggered.


What is bidirectional authentication?


In bidirectional authentication, the initiator also gets a chance to authenticate the target (no rogue initiators AND no rogue targets). In this case, the target also configures an outgoing CHAP name and CHAP secret. If the initiator wishes to use bi-directional authentication, then it must request bi-directional authentication, and the initiator must be configured with an incoming CHAP name and CHAP secret for the target that matches the target's own information.


For the Solaris iSCSI target to participate in bidirectional CHAP, the target must be configured with an outgoing CHAP name and CHAP secret. The same secret is used for all separate initiators that wish to authenticate the target.


iscsitadm modify admin -H <target-CHAP-name>


iscsitadm modify admin -C


<target-CHAP-secret>


<target-CHAP-secret>


For the Solaris iSCSI initiator, bi-directional CHAP may be configured once the target is visible to the client. Configuration involves requesting bidirectional CHAP as well as setting up the expected incoming CHAP information for that target. There are four separate parameters that must be set.


iscsiadm modify target-param -a CHAP <target-iqn>


iscsiadm modify target-param -H <target-chap-name> <target-iqn>


iscsiadm modify target-param -C <target-iqn>


<target-CHAP-secret>


<target-CHAP-secret>


iscsiadm modify target-param -B enable <target-iqn>


Note that the secrets used in the initiator->target direction and the target->initiator direction must not match, or else a man-in-the-middle attack is possible. iSCSI partners are supposed to check incoming requests to validate that they do not share secrets.


The iscsiadm modify target-param command modifies parameters that are specific to one particular storage target. One interesting fact is that the -a CHAP flag from the above sequence overrides the -a CHAP flag for the initiator-node. Even if the initiator-node is set to -a none, if the target-param is set to -a CHAP, then CHAP will be requested. Moreover, if the initiator-node is set to -a CHAP, if the target-param specifies -a none, then CHAP will not be requested. When CHAP is used in bi-directional authentication, the outgoing CHAP name and CHAP secret are taken from the initiator-node of that initiator. In particular, note that once a system has been configured to use bi-directional authentication even if iscsiadm modify initiator-node -a none has been done to disable unidirectional authentication, the CHAP name and CHAP secret that were already defined for that initiator-node can still be used for bidirectional authentication. This is one observable difference between an initiator system that has never used CHAP and a system that used CHAP at one time and then turned it off.


What triggers CHAP authentication?


CHAP authentication is triggered in two different cases:



  1. 1. The initiator system requests authentication=CHAP during the iSCSI login process. For the Solaris initiator, this happens whenever iscsiadm modify initiator-node -a CHAP is selected.

  2. 2. The target system has a local initiator entry that matches the initiator's iSCSI name AND the local initiator is mentioned in the ACL for the storage target that is being logged into AND the local initiator includes a CHAP name and CHAP target for the initiator.


What is the result of authentication?


For each initiator that attempts to connect to the target, the results of authentication will be one of the following four states:



  • non-authenticated

  • the target did not have a local initiator configured that matched the incoming initiator.

  • the initiator was not configured to request CHAP.

  • named-but-not-authenticated

  • the target did have a local initiator configured that matched the incoming initiator.

  • the local initiator was not configured with an expected incoming CHAP name and CHAP secret

  • the initiator was not configured to request CHAP

  • unidirectional authentication complete

  • the initiator was configured with an outgoing CHAP name and CHAP secret

  • the initiator was configured to request CHAP

  • the target did have a local initiator configured that matched the incoming iqn

  • the target's local initiator was configured to match the initiator's incoming CHAP name and CHAP secret

  • bidirectional authentication complete

  • the initiator was configured with an outgoing CHAP name and CHAP secret

  • the initiator was configured for bidirectional authentication, i.e. it has an target-param entry for the desired storage target, and that target-param contains an incoming CHAP name and CHAP secret that matches the target's actual CHAP information, and the target-param was configured to request bidirectional authentication

  • the target did have a local initiator configured that matched the incoming iqn

  • the local initiator was configured with an incoming CHAP name and CHAP secret that matched the initiator's

  • the target was configured with an outgoing CHAP name and CHAP secret.

  • the secret used in the initiator->target direction did not match the secret used in the target->initiator direction.

  • authentication-failed

  • CHAP was requested either at the initiator-side or the target side, and one of the above conditions was not met.


How do I configure the target to "require CHAP authentication"?


Putting a few of the above answers together, the only way for a target to require CHAP authentication is to



  • have a local initiator for each expected initiator that will try to connect

  • have an ACL on each possible target that mentions which initiators can connect to it.

  • Each local initiator must specify a CHAP name and CHAP secret for that initiator.


With the above setup, the only initiators that will be allowed to connect to the target are those whose are on the ACL and whose CHAP name and CHAP secret match those of the local-initiator node that matches the initiator-iqn.


If the target administrator does not configure an ACL for a particular target, then as described above non-authenticated initiators are allowed to connect to that target. Hence, "require CHAP authentication" is not achieved.


If the target administrator does not configure a CHAP secret for each expected initiator, then non-authenticated initiators can impersonate one of the named-but-not-authenticated initiators mentioned in the ACLs. So once again, "require CHAP authentication" is not achieved.


The request to have an easier way to configure the target to require CHAP authentication is being tracked in RFE 6707657.


What is RADIUS, and how does it relate to CHAP?


Either the target side or the initiator side can offload the work of managing CHAP names and CHAP secrets to a RADIUS server. In this case, each entities outgoing CHAP name and CHAP secret are still configured in the normal place (either iscsiadm modify initiator-node for the initiator or iscsitadm modify admin for the target. But the matching incoming CHAP names and CHAP secrets are now configured on the RADIUS server. (Actually, the target-side must still have the accurate CHAP names; only the CHAP secret is delegated to RADIUS).


To make the communications with the RADIUS server secure, a shared "RADIUS secret" is configured between the radius client system (whether initiator or target) and the RADIUS server.


The RADIUS server is then configured with a database of expected usernames and passwords that match the expected set of CHAP names and CHAP secrets.


Note that for a RADIUS server to perform CHAP authentication, the CHAP secrets must be stored in the clear (non-encrypted) in the database on the RADIUS server. This is a constraint built into the CHAP protocol. Only with the password in the clear can the proper CHAP challenges and responses be computed. This characteristic of RADIUS has led some experts to recommend against the use of RADIUS for storing iSCSI CHAP secrets. Many bloggers contend that the convenience of configuring all your secret checking in one place is outweighed by the risk of having a clear-text database of all these usernames and passwords existing in some form on your network. It is a tempting plum to be plucked.


How do I configure the Solaris iSCSI Target to use RADIUS?


This is used on the target side to delegate to the RADIUS server the job of checking incoming CHAP names and CHAP secrets from the initiators. All other configuration information on the target is configured as usual.


iscsitadm modify admin -r <RADIUS-IP-address>


issitadm modify admin -P


Enter Secret: <RADIUS-shared-secret-for-target-ip-address>


Repeat Secret:<RADIUS-shared-secret-for-target-ip-address>


iscsitadm modify admin -R enable


How do I "Require CHAP Authentication" when using target-side RADIUS?


Currently (see RFE 6727351), it is difficult to "require CHAP authentication" when using RADIUS. The steps required are:



  • Create a local-initiator for each initiator that will wish to connect. The local initiator must have an incoming CHAP name that matches the initiator system's actual outgoing CHAP name. It also needs to have a CHAP secret that is non-null, but the actual contents of the CHAP secret are ignored.

  • Each target must have an access control list that mentions the specific initiators that should be able to view and mount that target.


In other words, the ONLY part of authentication that is currently being delegated to the RADIUS server is the actual password maintenance. All the rest of CHAP processing on the target, including CHAP name maintenance, is exactly the same whether or not RADIUS is being used.


How do I configure the Solaris iSCSI Initiator to use RADIUS?


This is used in bidirectional authentication to delegate to the RADIUS server the job of checking the target's CHAP name and CHAP secret. All other configuration information for bidirectional authentication should be set up as described above.


iscsiadm modify initiator-node -r <RADIUS-IP-address>


iscsiadm modify -P


Enter Secret: <RADIUS-shared-secret-for-initiator>


Repeat Secret: <RADIUS-shared-secret-for-initiator>


iscsiadm modify initiator-node -R enable


Be aware of bug 6679007. The capability for a Solaris iSCSI initiator to use a RADIUS server was not working until release snv_92.


How do I configure the FreeRADIUS server to check CHAP secrets?


I have done most testing on FreeRadius. FreeRadius is not currently shipped as part of Solaris. It can be downloaded from the Internet. I used someone else's FreeRadius server, so I didn't have to build it from scratch. I also have personal and dedicated use of this radius server, which facilitates debugging since I can stop and start the radius server whenever I want, as well as running it in debug mode.


That being said, here is a summary of the steps I took to configure the FreeRadius server to authenticate a new iSCSI initiator or target.


Modify clients.conf in /etc/raddb or equivalent directory to include an entry for the IP address of the initiator or target that is using RADIUS. This entry will also include the shared secret that was configured above.


Modify the users file in /etc/raddb to include an entry for the CHAP names and CHAP secrets that should be verified. To RADIUS, these look just like usernames and passwords.


Start the RADIUS server in debug mode and watch the output to see the results of each connection attempt. I manually restarted the RADIUS server each time I modified either of the above configuration files. Though I am not sure that was necessary, it was sufficient for my testing.


How do you run a CHAP "experiment"?


This is where target discovery comes in. The best method to experiment is to turn off your discovery method on the initiator system, change some CHAP parameters, and then turn on discovery again. Each time discovery is turned on, the initiator system discovers which storage targets are available, and provisions as many of them as possible to be available to the initiator system's OS level.


For example, to use SendTargets discovery, first tell the initiator system the IP address of the target system:


iscsiadm add discovery-address <IP-address of target system>


Verify the list of visible targets by executing


iscsiadm list discovery-address -v


For example:


-bash-3.2# iscsiadm list discovery-address -v


<snip>


Target name: iqn.1986-03.com.sun:02:361bf4f4-eb16-428c-9bd0-b00b4aecb2b1.tgt-no-acl


Target address: 129.148.168.14:3260, 1




You can set up for a new experiment by disabling discovery:


iscsiadm modify discovery -t disable


This will make all the targets disappear from the initiator side. Note you can not disable sendtargets discovery if any of the known targets are currently in use. I tend to use a dedicated initiator and target system for my iSCSI experiments, so I can make sure that all targets are not in use. Occasionally, I must reboot the initiator system to convince it that some particular target should not be in use.


You can verify that all the initiator-connections are now gone.


iscsiadm list target


Set up a series of parameters for an experiment, such as setting CHAP parameters on both initiator and target systems, and then re-enable sendtargets discovery:


iscsiadm modify discovery -B enable


How can I tell my storage is connected?


Signs that storage is now connected and usable by other software on the initiator system:



  • On the target system, iscsitadm list target -v shows a connection for each initiator that has connected to the target. Your initiator will show up in this list.


-bash-3.2# iscsitadm list target -v


Target: tgt-no-acl


iSCSI Name: iqn.1986-03.com.sun:02:361bf4f4-eb16-428c-9bd0-b00b4aecb2b1.tgt-no-acl


Connections: 1


Initiator:


iSCSI Name: iqn.1986-03.com.sun:01:00000000782b.4846f53d


Alias: perelandra


ACL list:


TPGT list:


LUN information:


LUN: 0


GUID: 010000101803255600002a004859729e


VID: SUN


PID: SOLARIS


Type: disk


Size: 200M


Status: unknown




On the initiator system, the output of iscsiadm list target -S includes an OS device name for each accessible target. If your connection worked, your target will show up in this list, and will have an OS Device name (e.g. /dev/rdsk/ c0t2d0s0).


-bash-3.2# iscsiadm list target -S


<snip>


Permanent link to this entry | Comments [1] | Comments have been disabled.

Friday Oct 26, 2007


I recently wrote a "Sun StorageTek 5800 System Concepts" section for the Sun StorageTek 5800 API Programming Guide. It is a bit long for a blog post, but this section stands on its own as an introduction to the programming model for Honeycomb and the 5800 system.   This blog is a good way to make it
available on a wide basis.  So here goes.

The version that ships with the product may be much improved from what goes here, and should be considered the definitive reference.

=========================

5800 System Overview

The Sun StorageTek 5800 System is an object-based storage archive appliance for fixed-content data and metadata. The 5800 system is designed from the ground up to be reliable, affordable, and scalable, and to integrate data storage with intelligent data retrieval and query. It is designed to store huge amounts of data for decades at a time. At that scale, issues of how and where the data is stored--and how that changes over time--can be quite cumbersome. The 5800 system usage model is designed to abstract away those issues, so the customer application can deal with just the data.

A custom Application Programming Interface (the 5800 Client API) is provided so customer applications can take advantage of all the features in the 5800 system usage model. The API supports the capabilities to:


  • store a new object into the archive (storeObject)
  • associate new metadata with an existing object to create a new object (storeMetadata, also known as addMetadata)
  • retrieve the data from an object that was previously stored (retrieveData)
  • retrieve the metadata from an object that was previously stored (retrieveMetadata)
  • delete an object (delete)
  • query for matching objects given a query expression of desired object characteristics (query)

Two distinct realizations of the 5800 system API exist for Release 1.1, a Java API and a C API.  In the following discussion, we use the vocabulary and Class names from the Java API as an aid to exposition. In all cases, a simple equivalent using the C API is available. The Java API and the C API use the same on-the-wire protocol.  The on-the-wire protocol, which is based on client-server HTTP exchange,  is designed as the place where new realizations of the 5800 programming model can be added (e.g. to add bindings for additional programming languages).  For purposes of this document, only the API itself is
relevant regardless of how it is implemented.

A Word About Honeycomb

The original code name for the project that grew into the 5800 system was "Project Honeycomb". This name lives on as the name of an OpenSolaris community that is bringing the Honeycomb software stack into the world of Open Source. The first realization of the Honeycomb storage model as a real product is the 5800 system as described here. As a model for programmable storage systems, however, the Honeycomb API has a broader reach than just this one system. The programming model is designed to scale both up and down to any storage archive system that wishes to abstract away issues of how data is stored from how it is used. As a concession to both the past and the future, the string "honeycomb" and the initials "hc" still live on in certain aspects of the API described here. When the API is used in contexts outside of the 5800 system, it will be called the  "Project Honeycomb API" .

The 5800 System Data Model

Each object in the 5800 archive consists of some arbitrary bytes of data together with associated metadata that describes the data. Once an object is stored, it is immutable. The 5800 programming model does not allow the data or the metadata associated with an object to be changed once the object has been stored, i.e. the system is a Write-Once Read-Multiple (WORM) archive. Each object corresponds to a single stream of data and a single set of metadata; there are no "grouped objects" or "compound objects" other than by application convention. Each object is separated at birth from all the other objects. The customer application is shielded from needing to know how the object is stored; it need only know how to retrieve it. Internally, several objects might "share" the same underlying storage; the application never needs to know or care about this.

A stream of data is stored in the object archive using storeObject. Once stored, each such object is associated with an object identifier or objectid. The storeObject operation takes both a stream of data and an optional set of metadata information and returns an objectid. The objectid can be remembered outside the 5800 system and may later be used to retrieve the data associated with that object using the retrieveObject operation. The retrieveObject operation takes an objectid as input and returns a stream of bytes as output that are identical to the bytes stored during the storeObject operation. Both the storeObject and retrieveObject operations handle the data in a streaming manner. Not all of the data need be present in client memory or in server memory at the same time; this is crucial point for dealing with huge objects.

From a customer application, the store of an object into the archive is an all-or-nothing event. Either the object is stored or it isn't; there are no partial stores. If a store operation is interrupted, the entire storeObject call fails with an exception. Once an object id is returned to the customer application, the object is known to be durable. (In the face of an outage that causes some data loss, the system should be no more likely to lose a newly stored object than any other object.) There are no transactional semantics between different stores: there is no way to tie together two different store operations so that both either succeed together or fail together. [A stored object may or may not immediately be queryable; see below for the discussion of the 5800 system query integrity model.]

The 5800 system stores and retrieves arbitrary binary data.   For release 1.1, data sizes up to 400 GB are tested and supported. Using sizes even smaller than this may be appropriate as a "best practice".

The 5800 System Metadata Model

Metadata means "data about the data": it characterizes the data and helps to determine how the data should be interpreted. In addition, metadata can be used to facilitate querying the 5800 system for objects that match a particular set of search criteria.

For the 5800 system, the supported metadata option is in the form of name-value fields stored with each object. The set of possible fields is defined in the metadata schema. Setting up a metadata schema is an important system administration task that is described in the 5800 System Administration Guide. It is analogous to the process of database design that goes into creating a data management application. The metadata schema determines what field names, types, and lengths may be used with the metadata stored with each object. In addition,the layout of fields into tables within the schema, together with the definition of views that speed certain searches, determine which kinds of queries about that metadata will be both possible and effective. As such, the metadata schema should match the characteristics of the expected range of applications that will deal with the stored data.

The underlying software is designed to support multiple different kinds of metadata to aid in searching. For example, eventually there might be a specialized index to facilitate full-text search within the data objects. The API guide describes only the API for dealing with name-value metadata.

Fields in the schema can be either queryable or non-queryable. The values for non-queryable fields may be retrieved later but may not be used in queries. The 5800 system supports only single-valued fields. Each object can have only a single name-value pair of a given name. There is no built-in support for multiple valued fields, such as a list of separate authors of a book stored as multiple values of a single named field, nor for the queries that would make sense with multi-valued fields.

Each data object is associated with a set of name-value pairs at the time the object is stored.

Some metadata, the system metadata is assigned by the 5800 system as each object is stored. For example, each object contains an "object creation time" (system.object_ctime) and an objectid (system.object_id), both of which are assigned by the system at the time an object is created. Some metadata, the computed metadata, is implicit in the stored data, and is made explicit at the time of the object store. For example, the system exposes the object data length as a metadata field (system.object_size). In addition, the 5800 system computes a SHA1 hash of the stored data as the data is stored and stores the hash as a metadata field (system.object_hash). There is also an associated field (system.object_hash_alg) to specify which hash algorithm was used in computing the system.object_hash. It is currently always set to "sha1".

Finally, some metadata, the user metadata is supplied by the customer application in the API call at the time an object is stored. Each store operation is allowed to include a NameValueRecord that indicates a set of name-value pairs to be associated with the data object as metadata. Each name in the name-value record must match a field name from the metadata schema; in addition, the data value supplied for each field must match the type and length for the field as specified in the schema. If the names or values supplied for the user metadata do not match the active schema, then an exception is generated and the object is not stored.

The metadata associated with an object is immutable. There is no operation to modify the metadata associated with an object after the object has been stored. Instead, the storeMetadata operation can be used to create a completely new object by associating new user metadata with the underlying data and system-metadata of an existing object. The storeMetadata operation does NOT merge the new metadata in with the metadata from the original OID; it replaces the metadata with a completely new set. To accomplish a merge of new field values into existing metadata, the customer application must manually retrieve the existing metadata from the original object, perform the merge into a single NameValueRecord on the client side, and then call storeMetadata to create a new object with the merged metadata.

When creating a new object using storeMetadata, both the system.object_id and the system.object_ctime are replaced, to indicate that a new object has been created. The metadata computed from the object data itself (system.object_length, system.object_hash_alg, and system.object_hash) does not change. Both the storeObject and the storeMetadata operations return a SystemRecord value that includes all of the system-assigned fields. While retrieving the objectid is the most common use of the SystemRecord, the other system fields can also be helpful. For example, the customer application might use the system.object_length, the system.object_hash_alg and the system.object_hash fields to verify that the data as stored matches the data as present in the customer application. If a hash independently computed on the client matches the hash stored on the 5800 system, then the data store has been validated.

The metadata values associated with an object can be retrieved using the retrieveMetadata operation. The retrieveMetadata operation takes an objectid as input, and returns the entire set of user, system, and system-computed metadata. The retrieved metadata is in the form of a NameValueRecord that contains the value of each field as originally stored. The system fields occur using their field names, e.g. the field system.object_ctime contains the object creation time. There is no operation to retrieve just a single field or a subset of fields by supplying a list of field names. The retrieveMetadata operation retrieves the values of both queryable and non-queryable fields.

The 5800 System Query Model

One of the primary methods for retrieving data is to specify the characteristics of the desired data and then let the system find it for you. In the 5800 system, a query expression specifies a set of conditions on metadata field values. The system then returns a list of all the objects whose metadata values match the query conditions. Each object is considered individually without reference to any other objects. There are no queries that compare fields in one object with fields in a different object.

Query expressions can use much of the power of Structured Query Language (SQL). See chapter 4 of the API guide for a detailed description of query syntax and query semantics, including a description of exactly what it means for an object to match a query. For now, the following points are enough. Each query expression combines SQL functions and operators, field names from the metadata schema, and literal values. There are no query expressions that select objects based on the data stored in the object itself; all queries apply only to the metadata fields associated with the object. Only queryable fields can be used in query expressions. For an object to show up in a query result set, the object must have a value for each of the fields mentioned in the query (i.e. there is an implicit INNER JOIN between the fields in the query).

A query may optionally specify that the result set should include not just the objectid of each matching object, but also the values from a set of selected fields of each matched object . The value retrieved by Query With Select for some field may be a canonical equivalent of the value originally stored in that field. For example, values in numeric fields may have been converted to standard numeric format. Trailing spaces at the end of string fields will have been truncated. (The value that is returned will be some value that would match the original data as stored, in the SQL sense.) To be included in the result set, an object must include values for all queried fields and all selected fields. In other words, there is an implicit INNER JOIN between all the fields in the query and in the select list.

There are significant limitations on which queries may be executed efficiently, or at all. Please refer both to Chapter 4 of the API guide on query syntax, as well as the section on Best Practices and Programming Considerations, for details of these limitations.

There are no ordering guarantees between queries and store operations that are proceeding at the same time. If an object is added to the 5800 system while a query is being performed, and the object matches the query, then the object may or may not show up in the query result set.

The 5800 System Query Integrity Model

The result set of any query will only ever return results that match the query. But will it return ALL the matching results? That is the notion of query completeness, referred to here as query integrity. 100% query integrity for a result set is defined as a state in which the result set contains all the objects in the 5800 system that match that particular query. The 5800 system is not always in a state of 100% query integrity. Various system events can induce a state in which the set of objects that are available for query is smaller than the total set of objects stored in the archive. Each query result set supports an operation (isQueryComplete) whereby the customer application can ask, once all the results from the query result set have been processed, whether that set of results constitutes a complete set.

The 5800 system notion of query integrity as actually implemented is somewhat looser than the notion of 100% query integrity. Even if a query result set indicates the result set is complete, we allow certain objects, known as store index exceptions to be missing from the query result set, as long as they were communicated to the customer application at the time the object was stored. A store index exception is communicated to the customer application at the time of store by means of a method SystemRecord.isIndexed. A value of false from isIndexed means that the object is not immediately available for query. A store index exception is said to be resolved when the object becomes
available for query. The checkIndexed method can be used to attempt to
resolve a store index exception under program control.


Some people find it helpful to consider the actual implementation details. The format of records as stored in the reliable and scalable object archive is not suitable for fast query. To enable searching, the queryable fields from the metadata are indexed in a query engine that can provide fast and flexible query services. The query engine is basically an SQL database. This is why the 5800 system's query language can borrow so heavily from SQL. At various times, the data as indexed in the query engine can get out of date compared to what is stored in the archive. When this happens, query result sets are not known to be complete until the contents of the query engine can be brought back upto date with the actual contents of the archive again. A store index exception is an object for which the original store of the object into the archive succeeded, but at least some part of the insert into the query engine (database) did not succeed. The object may or may not show up in all of the queries that it matches. The checkIndexed operation checks if the object has been added to the query engine, and attempts to insert it if not. If the insert into the query engine succeeds, the object is thereby restored to full queryability.

All store index exceptions will also eventually be resolved automatically by ongoing system healing. To provide insight into how far the process of ongoing system healing has gotten, each query result set also exports a method getQueryIntegrityTime that can be used to get detailed status on which store index exceptions might still be unresolved. The query integrity time is a time such that all store index exceptions from before that time have been resolved. There is an "ideal" query integrity time, which is the time of the oldest still-unresolved store index exception: an ideal implementation when asked for the query integrity time would always report this ideal value. In actual implementation, the reported query integrity time might be hours or even days earlier than the ideal query integrity time, depending on how far the ongoing system healing has progressed.

Deleting Objects from the 5800 System

The 5800 system client API exports an operation to delete a specific object as specified by its objectid. Once a delete operation completes normally, subsequent attempts to retrieve that object will fail with an exception. In addition, the object will stop showing up in query result sets that match the original object metadata. There are no transactional guarantees regarding ordering of queries and delete operations that are occurring at the same time. If an object is being deleted at the same time that a query that matches that object is being performed, then that object may or may not show up in the query result set, with no guarantee either way.

When all objects that share an underlying block of data bytes have been deleted, the block of data bytes itself will be scavenged and returned to the supply of free disk space. But all details of how objects are stored, and how and whether they ever share data -- or ever are scavenged -- are properly speaking outside of the scope of this API.

Delete operations are all-or-nothing,with some caveats. Specifically, if a delete operation fails with an error, it is possible that the object is not fully deleted but is temporarily not queryable. Such an object is in an analogous state to a store index exception. The queryability of such an object will eventually be resolved by automatic system healing.  Alternatively, the customer application may choose to re-execute the delete operation until it succeeds, or until it fails with an error that indicates the object is already deleted.

====================== 

 The 5800 System Virtual View Model

 I realized after writing the above that another topic of frequent confusion is the interaction between the "API model" as described here and the "virtual view model" as used in the 5800 system's WebDAV interface.  There is a good description of setting up a WebDAV view in the 5800 System Admin guide.   But that section does not discuss the semantics of how objects may appear when they are accessed both via the view model and by the API model at the same time.  So I will put in a few quick outline points here.  This really deserves to be a blog post of its own.  But for now, some of the key points are:

  • The view definition corresponds to a pre-defined query.  Objects that contain values for ALL of the fields mentioned in the view will match the query and hence will show up in the view
  • Conversely, objects that are missing a value for one of the fields in a view will not show up in the view.
  • There is a "canonical mapping" from typed data values as used in the API to string data values (in URL format) as used in WebDAV.
  • When given a filename to browse, WebDAV runs a query that selects ANY ONE OBJECT that matches the query to display as part of the view.   There is no support to display multiple objects that all match the query.
  • When storing an object via a writable view, the URL of the reference is used to determine values for fields that are specified in the view.  No other metadata fields are given any value.
That's enough for now. 

Tuesday Oct 02, 2007

The Spirit of Prophecy came upon me and anointed the Boston Red Sox unto battle.

[Read More]

Honeycomb is hitting the sweet spot its founders originally aimed for. Here we acknowledge the past and set the stage for the future.

[Read More]

Friday Mar 23, 2007

An article on bees as victims of GM crops.[Read More]

In which I introduce this blog and tell why it's called "A Honeycomb in a Garden".

[Read More]

This blog copyright 2008 by pcudhea