free web site hit counter

The Sun BabelFish Blog

Don't panic !

Wednesday May 07, 2008

BOF-5911: Building a Web 3.0 Address Book

To give everyone a chance to try out the So(m)mer Address Book, I have made it available via Java Web Start: just click on the picture to the right, and try it out.

The Address Book is currently demoware: it shows how one can build virally an open distributed social network client that solves the social network data silo problem (video). No need to have an account on every social networking site on which you have friends, and so maintain your data on each one. You can simply belong to one network and link to all your friends wherever they are. With one click of a button you can publish your social network to your own web server, using ftp, scp, WebDAV, or even Atom. You can then link to other people who have (or not in fact), a foaf file. By pressing the space bar when selecting a friend, the Address Book with then GET their file. So you can browse your social network.

To get going you can explore my social network by dragging my foaf file icon onto the first pane of the application.

In BOF-5911 which I will be presenting on Thursday at 7:30pm I will be presenting the social networking problem, demonstrating how the So(m)mer Address Book solves it, and showing in detail how it is build, what the problems are, and what work remains. I will also discuss how this can be used to create global single sign on based on a network of trust.

Monday Apr 21, 2008

FOAF & SSL: creating a global decentralised authentication protocol

Following on my previous post RDFAuth: sketch of a buzzword compliant authentication protocol, Toby Inkster came up with a brilliantly simple scheme that builds very neatly on top of the Secure Sockets Layer of https. I describe the protocol shortly here, and will describe an implementation of it in my next post.

Simple global ( passwordless if using a device such as the Aladdin USB e-Token ) authentication around the web would be extremely valuable. I am currently crumbling under the number of sites asking me for authentication information, and for each site I need to remember a new id and password combination. I am not the only one with this problem as the data portability video demonstrates. OpenId solves the problem but the protocol consumes a lot of ssl connections. For hyperdata user agents this could be painfully slow. This is because they may need access to just a couple of resources per server as they jump from service to service.

As before we have a very simple scenario to consider. Romeo wants to find out where Juliette is. Juliette's hyperdata Address Book updates her location on a regular basis by PUTing information to a protected resource which she only wants her friends and their friends to have access to. Her server knows from her foaf:PersonalProfileDocument who her friends are. She identifies them via dereferenceable URLs, as I do, which themselves usually (the web is flexible) return more foaf:PersonalProfileDocuments describing them, and pointing to further such documents. In this way the list of people able to find out her location can be specified in a flexible and distributed manner. So let us imagine that Romeo is a friend of a friend of Juliette's and he wishes to talk to her. The following sequence diagram continues the story...

sequence diagram of RDF+SSL

The stages of the diagram are listed below:

  1. First Romeo's User Agent HTTP GETs Juliette's public foaf file located at http://juliette.net/. The server returns a representation ( in RDFa perhaps ) with the same semantics as the following N3:

    @prefix : <#> . 
    @prefix foaf: <http://xmlns.com/foaf/0.1/> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix todo: <http://eg.org/todo#> .
    @prefix openid: <http://eg.org/openid/todo#> .
    
    <> a foaf:PersonalProfileDocument;
       foaf:primaryTopic :juliette ;
       openid:server <https://aol.com/openid/service>; # see The Openid Sequence Diagram .
    
    :juliette a foaf:Person;
       foaf:name "Juliette";
       foaf:openid <>;
       foaf:blog </blog>;    
       rdfs:seeAlso <https://juliette.net/protected/location>; 
       foaf:knows <http://bblfish.net/people/henry/card#me>,
                  <http://www.w3.org/People/Berners-Lee/card#i> .
    
    <https://juliette.net/protected/location> a todo:LocationDocument .
    

    Romeo's user agent receives this representation and decides to follow the https protected resource because it is a todo:LocationDocument.

  2. The todo:LocationDocument is at an https URL, so Romeo's User Agent connects to it via a secure socket. Juliette's server, who wishes to know the identity of the requestor, sends out a Certificate Request, to which Romeo's user agent responds with an X.509 certificate. This is all part of the SSL protocol.

    In the communication in stage 2, Romeo's user agent also passes along his foaf id. This can be done either by:

    • Sending in the HTTP header of the request an Agent-Id header pointing to the foaf Id of the user. Like this:
      Agent-Id: http://romeo.net/#romeo
      
      This would be similar to the current From: header, but instead of requiring an email address, a direct name of the agent would be required. (An email address is only an indirect identifier of an agent).
    • The Certificate could itself contain the Foaf ID of the Agent in the X509v3 extensions section:
              X509v3 extensions:
                 ...
                 X509v3 Subject Alternative Name: 
                                 URI:http://romeo.net/#romeo
      

      I am not sure if it would be correct use of the X509 Alternative names field. So this would require more standardization work with the X509 community. But it shows a way where the two communities could meet. The advantage of having the id as part of the certificate is that this could add extra weight to the id, depending on the trust one gives the Certificate Authority that signed the Certificate.

  3. At this point Juliette's web server knows of the requestor (Romeo in this case):
    • his alleged foaf Id
    • his Certificate ( verified during the ssl session )

    If the Certificate is signed by a CA that Juliette trusts and the foaf id is part of the certificate, then she will trust that the owner of the User Agent is the entity named by that id. She can then jump straight to step 6 if she knows enough about Romeo that she trusts him.

    Having Certificates signed by CA's is expensive though. The protocol described here will work just as well with self signed certificates, which are easy to generate.

  4. Juliette's hyperdata server then GETs the foaf document associated with the foaf id, namely <http://romeo.net/> . Romeo's foaf server returns a document containing a graph of relations similar to the graph described by the following N3:
    @prefix : <#> . 
    @prefix foaf: <http://xmlns.com/foaf/0.1/> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix wot: <http://xmlns.com/wot/0.1/> .
    @prefix wotodo: <http://eg.org/todo#> .
    
    <> a foaf:PersonalProfileDocument;
        foaf:primaryTopic :romeo .
    
    :romeo a foaf:Person;
        foaf:name "Romeo";
        is wot:identity of [ a wotodo:X509Certificate;
                             wotodo:dsaWithSha1Sig """30:2c:02:14:78:69:1e:4f:7d:37:36:a5:8f:37:30:58:18:5a:
                                                 f6:10:e9:13:a4:ec:02:14:03:93:42:3b:c0:d4:33:63:ae:2f:
                                                 eb:8c:11:08:1c:aa:93:7d:71:01""" ;
                           ] ;
        foaf:knows <http://bblfish.net/people/henry/card#me> .
    
  5. By querying the semantics of the returned document with a SPARQL query such as
    PREFIX wot: <http://xmlns.com/wot/0.1/> 
    PREFIX wotodo: <http://eg.org/todo#> 
    
    SELECT { ?sig }
    WHERE {
        [] a wotodo:X509Certificate;
          wotodo:signature ?sig;
          wot:identity <http://romeo.net/#romeo> .
    }
    

    Juliette's web server can discover the certificate signature and compare it with the one sent by Romeo's user agent. If the two are identical, then Juliette's server knows that the User Agent who has access to the private key of the certificate sent to it, and who claims to be the person identified by the URI http://romeo.net/#romeo, is in agreement as to the identity of the certificate with the person who has write access to the foaf file http://romeo.net/. So by proving that it has access to the private key of the certificate sent to the server, the User Agent has also proven that it is the person described by the foaf file.

  6. Finally, now that Juliette's server knows an identity of the User Agent making the request on the protected resource, it can decide whether or not to return the representation. In this case we can imagine that my foaf file says that
     @prefix foaf: <http://xmlns.com/foaf/0.1/> .
    
     <http://bblfish.net/people/henry/card#me> foaf:knows <http://romeo.net/#romeo> .  
     
    As a result of the policy of allowing all friends of Juliette's friends to be able to read the location document, the server sends out a document containing relations such as the following:
    @prefix contact: <http://www.w3.org/2000/10/swap/pim/contact#> .
    @prefix : <http://juliette.org/#> .
    
    :juliette 
        contact:location [ 
              contact:address [ contact:city "Paris";
                                contact:country "France";
                                contact:street "1 Champs Elysees" ]
                         ] .
    

Todo

  • Create an ontology for X509 certificates.
  • test this. Currently there is some implementation work going on in the so(m)mer repository in the misc/FoafServer directory.
  • Can one use the Subject Alternative name of an X509 certificate as described here?
  • For self signed certificates, what should the X509 Distinguished Name (DN) be? The DN is really being replaced here by the foaf id, since that is where the key information about the user is going to be located. Can one ignore the DN in a X509 cert, as one can in RDF with blank nodes? One could I imagine create a dummy DN where one of the elements is the foaf id. These would at least, as opposed to DN, be guaranteed to be unique.
  • what standardization work would be needed to make this

Discussion on the Web

Friday Apr 18, 2008

The OpenId Sequence Diagram

OpenId very neatly solves the global identity problem within the constraints of working with legacy browsers. It is a complex protocol though as the following sequence diagram illustrates, and this may be a problem for automated agents that need to jump around the web from hyperlink to hyperlink, as hyperdata agents tend to do.

The diagram illustrates the following scenario. Romeo wants to find the current location of Juliette. So his semantic web user agent GET's her current foaf file. But Juliette wants to protect information about her current whereabouts and reveal it only to people she trusts, so she configures her server to require the user agent to authenticate itself in order to get more information. If the user agent can prove that is is owned by one of her trusted friends, and Romeo in particular, she will deliver the information to it (and so to him).

The steps numbered in the sequence diagram are as follows:

  1. A User Agent fetches a web page that requires authentication. OpenId was designed with legacy web browsers in mind, for which it would return a page containing an OpenId login box such as the one to the right. openid login box In the case of a hyperdata agent as in our use case, the agent would GET a public foaf file, which might contain a link to an OpenId authentication endpoint. Perhaps with some rdf such as the following N3:
    <> openid:login </openidAuth.cgi> .
    
    Perhaps some more information would indicate which resources were protected.
  2. In current practice a human user notices the login box and types his identifying URL in it, such as http://openid.sun.com/bblfish This is the brilliant invention of OpenId: getting hundreds of millions of people to find it natural to identify themselves via a URL, instead of an email. The user then clicks the "Login button".
    In our semantic use case the hyperdata agent would notice the above openid link and would deduce that it needs to login to the site to get more information. Romeo's Id ( http://romeo.net/ perhaps ) would then be POSTed to the /openidAuth.cgi authentication endpoint.
  3. The OpenId authentication endpoint then fetches the web page by GETing Romeo's url http://romeo.net/. This returned representation contains a link in the header of the page pointing Romeo's OpenId server url. If the representation returned is html then this would contain the following in the header
     <link rel="openid.server" href="https://openid.sun.com/openid/service" />
    
  4. The representation returned in step 3, could contain a lot of other information too. A link to a foaf file may not be a bad idea as I described in foaf and openid. The returned representation in step 3 could even be RDFa extended html, in which case this step may not even be necessary. For a hyperdata server the information may be useful, as it may suggest a connection Romeo could have to some other people that would allow it to decide whether it wishes to continue the login process.
  5. Juliette's OpenId authentication endpoint then sends a redirect to Romeo's user agent, directing it towards his OpenId Identity Provider. The redirect also contains the URL of the OpenId authentication cgi, so that in step 8 below the Identity Provider can redirect a message back.
  6. Romeo user agent dutifully redirects romeo to the identity provider, which then returns a form with a username and password entry box.
  7. Romeo's user agent could learn to fill the user name password pair in automatically and even skip the previous step 6 . In any case given the user name and password, the Identity Provider then sends back some cryptographic tokens to the User Agent to have it redirect to the OpenId Authentication cgi at http://juliette.net/openidAuth.cgi.
  8. Romeo's Hyperdata user agent then dutifully redirects back to the OpenId authentication endpoint
  9. The authentication endpoint sends a request to the Openid Identity provider to verify that the cryptographic token is authentic. If it is, a conventional answer is sent back.
  10. The OpenId authentication endpoint finally sends a response back with a session cookie, giving access to various resources on Juliette's web site. Perhaps it even knows to redirect the user agent to a protected resource, though that would have required some information concerning this to have been sent in stage 2.
  11. Finally Romeo's user agent can GET Juliette's protected information if Juliette's hyperdata web server permits it. In this case it will, because Juliette loves Romeo.

All of the steps above could be automatized, so from the user's point of view they may not be complicated. The user agent could even learn to fill in the user name and password required by the Identity Provider. But there are still a very large number of connections between the User Agent and the different services. If these connections are to be secure they would need to protected by SSL (as hinted at by the double line arrows). And SSL connections are not cheap. So the above may be unacceptably slow. On the other hand it would work with a protocol that is growing fast in acceptance.

It is is certainly worth comparing this sequence diagram with the very light weight one presented in "FOAF & SLL: creating a global decentralised authentication protocol".

Thanks again to Benjamin Nowack for bringing the discussion on RDFAuth to thinking about using the OpenId protocol directly as described above. See his post on the semantic web mailing list. Benjamin also pointed to the HTTP OpenID Authentication proposal, which shows how some of the above can be simplified if certain assumptions about the capabilities of the client are made. It would be worth making a sequence diagram of that proposal too.

Thursday Apr 17, 2008

semantic camp paris

picture of Karima Rafes

A couple of weeks ago I attended the second Semantic Bar Camp which took place at the Orange research labs at Issy les Moulineaux, near Paris. This was a great opportunity to meet many of the French researchers in the Semantic Web space, to take part in the French debate, and to help convince interested parties of the reality of the technology.

Jean Rohmer of the large French defense group Thales played the role of the devil's advocate, arguing that the Semantic Web was just pie in the sky theory without practical applications. We delved into various aspects of the theory of the Semantic Web, and I underlined how the biological/evolutionary aspect of language, the Academie Francaise notwithstanding, was a key aspect in understanding the evolution of the web of data. But the best argument was a simple demonstration of the Beatnik Address Book, which showed how hyperdata could solve the serious problem of 2008: the growing number of closed social networks. At the next camp I hope we will be able to delve much more deeply into how to build real practical applications.

Many thanks to Karima Rafes for organizing this well attended bar camp ( pictures ). Stephane Lauriere from XWiki and who is on the Nepomuk Semantic Desktop project, also posted some photos. And I would like to recommend Alexandre Passant's blog to all french speaking readers.

KiWi: Knowledge in a Wiki

KiWi logo

Last month I attended the European Union KiWi project startup meeting in Salzburg, to which Sun Microsystems Prague is contributing some key use cases.

KiWi is a project to build an Open Source Semantic Wiki. It is based on the IkeWiki [don't follow this link if you have Safari 3.1] Java wiki, which uses the Jena Semantic Web frameworks, the Dojo toolkit for the Web 2.0 functionality, and any one of the Databases Jena can connect to, such as PostgreSQL. KiWi is in many ways similar to Freebase in its hefty use of JavaScript, and its emphasis on structured data. But instead of being a closed source platform, KiWi is open source, and builds upon the Semantic Web standards. In my opinion it currently overuses JavaScript features, to the extent that all clicks lead to dynamic page rewrites that do not change the URL of the browser page. This I feel unRESTful, and the permalink link in the socialise toolbar to the right does not completely remove my qualms. Hopefully this can be fixed in this project. It would be great also if KIWI could participate fully in the Linked Data movement.

The meeting was very well organized by Sebastian Schaffert and his team. It was 4 long days of meetings that made sure that everyone was on the same page, understood the rules of the EU game, and most of all got to know each other. (see kiwiknows tagged pictures on flickr ). Many thanks also to Peter Reiser for moving and shaking the various Sun decision makers to sign the appropriate papers, and dedicate the resources for us to be part of this project.

You can follow the evolution of the project on the Planet Kiwi page.

Anyway, here is a video that shows the resourceful kiwi mascot in action:

Friday Mar 28, 2008

RDFAuth: sketch of a buzzword compliant authentication protocol

Here is a proposal for an authentication scheme that is even simpler than OpenId ( see sequence diagram ), more secure, more RESTful, with fewer points of failure and fewer points of control, that is needed in order to make Open Distributed Social Networks with privacy controls possible.

Update

The following sketch led to the even simpler protocol described in Foaf and SSL creating a global decentralized authentication protocol. It is very close to what is proposed here but builds very closely on SSL, so as to reduce what is new down to nearly nothing.

Background

Ok, so now I have your attention, I would like to first mention that I am a great fan of OpenId. I have blogged about it numerous times and enthusiastically in this space. I came across the idea I will develop below, not because I thought OpenId needed improving, but because I have chosen to follow some very strict architectural guidelines: it had to satisfy RESTful, Resource oriented hyperdata constraints. With the Beatnik Address Book I have proven - to myself at least - that the creation of an Open Distributed Social Network (a hot topic at the moment, see the Economist's recent article on Online social network) is feasible and easy to do. What was missing is a way for people to keep some privacy, clearly a big selling point for the large Social Network Providers such as Facebook. So I went on the search of a solution to create a Open Distributed Social Network with privacy controls. And initially I had thought of using OpenId.

OpenId Limitations

But OpenId has a few problems:

  • First it is really designed to work with the limitations of current web browsers. It is partly because of this that there is a lot of hopping around from the service to the Identity Provider with HTTP redirects. As the Tabulator, Knowee or Beatnik.
  • Parts of OpenId 2, and especially the Attribute Exchange spec really don't feel very RESTful. There is a method for PUTing new property values in a database and a way to remove them that does not use either the HTTP PUT method or the DELETE method.
  • The OpenId Attribute Exchange is nice but not very flexible. It can keep some basic information about a person, but it does not make use of hyperdata. And the way it is set up, it would only be able to do so with great difficulty. A RESTfully published foaf file can give the same information, is a lot more flexible and extensible, whilst also making use of Linked Data, and as it happens also solves the Social Network Data Silo problems. Just that!
  • OpenId requires an Identity Server. There are a couple of problems with this:
    • This server provides a Dynamic service but not a RESTful one. Ie. the representations sent back and forth to it, cannot be cached.
    • The service is a control point. Anyone owning such a service will know which sites you authenticate onto. True, you can set up your own service, but that is clearly not what is happening. The big players are offering their customers OpenIds tied to particular authentication servers, and that is what most people will accept.
As I found out by developing what I am here calling RDFAuth, for want of a better name, none of these restrictions are necessary.

RDFAuth, a sketch

So following my strict architectural guidelines, I came across what I am just calling RDFAuth, but like everything else here this is a sketch and open to change. I am not a security specialist nor an HTTP specialist. I am like someone who comes to an architect in order to build a house on some land he has, with some sketch of what he would like the house to look like, some ideas of what functionality he needs and what the price he is willing to pay is. What I want here is something very simple, that can be made to work with a few perl scripts.

Let me first present the actors and the resources they wish to act upon.

  • Romeo has a Semantic Web Address Book, his User Agent (UA). He is looking for the whereabouts of Juliette.
  • Juliette has a URL identifier ( as I do ) which returns a public foaf representation and links to a protected resource.
  • The protected resource contains information she only wants some people to know, in this instance Romeo. It contains information as to her current whereabouts.
  • Romeo also has a public foaf file. He may have a protected one too, but it does not make an entrance in this scene of the play. His public foaf file links to a public PGP key. I described how that is done in Cryptographic Web of Trust.
  • Romeo's Public key is RESTfully stored on a server somewhere, accessible by URL.

So Romeo wants to find out where Juliette is, but Juliette only wants to reveal this to Romeo. Juliette has told her server to only allow Romeo, identified by his URL, to view the site. She could have also have had a more open policy, allowing any of her or Romeo's friends to have access to this site, as specified by their foaf file. The server could then crawl their respective foaf files at regular intervals to see if it needed to add anyone to the list of people having access to the site. This is what the DIG group did in conjunction with OpenId. Juliette could also have a policy that decides Just In Time, as the person presents herself, whether or not to grant them access. She could use the information in that person's foaf file and relating it to some trust metric to make her decision. How Juliette specifies who gets access to the protected resource here is not part of this protocol. This is completely up to Juliette and the policies she chooses her agent to follow.

So here is the sketch of the sequence of requests and responses.

  1. First Romeo's user Agent knows that Juliette's foaf name is http://juliette.org/#juliette so it sends an HTTP GET request to Juliette's foaf file located of course at http://juliette.org/
    The server responds with a public foaf file containing a link to the protected resource perhaps with the N3
      <> rdfs:seeAlso <protected/juliette> .
    
    Perhaps this could also contain some relations describing that resource as protected, which groups may access it, etc... but that is not necessary.
  2. Romeo's User Agent then decides it wants to check out protected/juliette. It sends a GET request to that resource but this time receives a variation of the Basic Authentication Scheme, perhaps something like:
    HTTP/1.0 401 UNAUTHORIZED
    Server: Knowee/0.4
    Date: Sat, 1 Apr 2008 10:18:15 GMT
    WWW-Authenticate: RdfAuth realm="http://juliette.org/protected/*" nonce="ILoveYouToo"
    
    The idea is that Juliette's server returns a nonce (in order to avoid replay attacks), and a realm over which this protection will be valid. But I am really making this up here. Better ideas are welcome.
  3. Romeo's web agent then encrypts some string (the realm?) and the nonce with Romeo's private key. Only an agent trusted by Romeo can do this.
  4. The User Agent then sends a new GET request with the encrypted string, and his identifier, perhaps something like this
    GET /protected/juliette HTTP/1.0
    Host: juliette.org
    Authorization: RdfAuth id="http://romeo.name/#romeo" key="THE_REALM_AND_NONCE_ENCRYPTED"
    Content-Type: application/rdf+xml, text/rdf+n3
    
    Since we need an identifier, why not just use Romeos' foaf name? It happens to also point to his foaf file. All the better.
  5. Because Juliette's web server can then use Romeo's foaf name to GET his public foaf file, which contains a link to his public key, as explained in "Cryptographic Web of Trust".
  6. Juliette's web server can then query the returned representation, perhaps meshed with some other information in its database, with something equivalent to the following SPARQL query
    PREFIX wot: <http://xmlns.com/wot/0.1/>
    SELECT ?pgp
    WHERE {
         [] wot:identity <http://romeo.name/#romeo>;
            wot:pubkeyAddress ?pgp .
    } 
    
    The nice thing about working at the semantic layer, is that it decouples the spec a lot from the representation returned. Of course as usage grows those representations that are understood by the most servers will create a de facto convention. Intially I suggest using RDF/XML of course. But it could just as well be N3, RDFa, perhaps even some microformat dialect, or even some GRDDLable XML, as the POWDER working group is proposing to do.
  7. Having found the URL of the PGP key, Juliette's server, can GET it - and as with much else in this protocol cache it for future use.
  8. Having the PGP key, Juliette's server can now decrypt the encrypted string sent to her by Romeo's User Agent. If the decrypted string matches the expected string, Juliette will know that the User Agent has access to Romeo's private key. So she decides this is enough to trust it.
  9. As a result Juliette's server returns the protected representation.
Now Romeo's User Agent knows where Juliette is, displays it, and Romeo rushes off to see her.

Advantages

It should be clear from the sketch what the numerous advantages of this system are over OpenId. (I can't speak of other authentication services as I am not a security expert).

  • The User Agent has no redirects to follow. In the above example it needs to request one resource http://juliette.org/ twice (2 and 4) but that may only be necessary the first time it accesses this resource. The second time the UA can immediately jump to step 3. [but see problem with replay attacks raised in the comments by Ed Davies, and my reply] Furthermore it may be possible - this is a question to HTTP specialists - to merge step 1 and 2. Would it be possible for a request 1. to return a 20x code with the public representation, plus a WWWAuthenticate header, suggesting that the UA can get a more detailed representation of the same resource if authenticated? In any case the redirect rigmarole of OpenId, which is really there to overcome the limitations of current web browsers, in not needed.
  • There is no need for an Attribute Exchange type service. Foaf deals with that in a clear and extensible RESTful manner. This simplifies the spec dramatically.
  • There is no need for an identity server, so one less point of failure, and one less point of control in the system. The public key plays that role in a clean and simple manner
  • The whole protocol is RESTful. This means that all representations can be cached, meaning that steps 5 and 7 need only occur once per individual.
  • As RDF is built for extensibility, and we are being architecturally very clean, the system should be able to grow cleanly.

Contributions

I have been quietly exploring these ideas on the foaf and semantic web mailing lists, where I received a lot of excellent suggestions and feedback.

Finally

So I suppose I am now looking for feedback from a wider community. PGP experts, security experts, REST and HTTP experts, semantic web and linked data experts, only you can help this get somewhere. I will never have the time to learn these fields in enough detail by myself. In any case all this is absolutely obviously simple, and so completely unpatentable :-)

Thanks for taking the time to read this

Thursday Mar 20, 2008

how binary relations beat tuples

Last week I was handed a puzzle by Francois Bry: "Why does RDF limit itself to binary relations? Why this deliberate lack of expressivity?".

Logical Equivalence Reply

My initial answer was that all tuples could be reduced to binary relations. So take a simple table like this:

User IDnameaddressbirthdaycoursehomepage
1234Henry Story21 rue Saint Honoré
Fontainebleau
France
29 Julyphilosophyhttp://bblfish.net/
1235Danny AyersLoc. Mozzanella, 7
Castiglione di Garfagnana
Lucca
Italy
14 Jansemwebhttp://dannyayers.com

The first row in the above column can be expressed as a set of binary relations as shown in this graph:

The same can clearly be done for the second row.

Since the two models express equivalent information I would opt aesthetically for the graph over the tuples, since it requires less primitives, which tends to make things simpler and clearer. Perhaps that can already be seen in the way the above table is screaming out for refactoring: a person may easily have more than one homepage. Adding a new homepage relation is easy, doing this in a table is a lot less so.

But this line of argument will not convince a battle worn database administrator. Both systems do the same thing. One is widely deployed, the other not. So that is the end of the conversation. Furthermore it seems clear that retrieving a row in a table is quick and easy. If you need chunks of information to be together that beats the join that seems to be required in the graph version above. Pragmatics beats aesthetics hands down it seems.

Global Distributed Open Data

The database engineer might have won the battle, but he will not win the war [1]. Wars are fought at a much higher level, on a global scale. The problem the Semantic Web is attacking is global data, not local data. On the Semantic Web, the web is the database and data is distributed and linked together. On the Semantic Web use case the data won't all be managed in one database by a few resource constrained superusers but distributed in different places and managed by the stake holder of that information. In our example we can imagine three stake holders of different pieces of information: Danny Ayers for his personal information, Me for mine, and the university for its course information. This information will then be available as resources on the web, returning different representations, which in one way or another may encode graphs such as the ones below. Note that duplication of information is a good thing in a distributed network.

By working with the most simple binary relations, it is easy to cut information up down to their most atomic unit, publish them anywhere on the web, distributing the responsibility to different owners. This atomic nature of relations also makes it easy to merge information again. Doing this with tuples would be unnecessarily complex. Binary relations are a consequence of taking the open world assumption seriously in a global space. By using Universal Resource Identifiers (URIs), it is possible for different documents to co-refer to the same entitities, and to link together entities in a global manner.

The Verbosity critique

Another line of attack similar to the first could be that rdf is just too verbose. Imagine the relation children which would relate a person to a list of their children. If one sticks just with binary relations this is going to be very awkward to write out. In a graph it would look like this.

image of a simple list as a graph

Which in Turtle would give something like this:

:Adam :children 
     [ a rdf:List;
       rdf:first :joe;
       rdf:rest [ a rdf:List;
            rdf:first :jane;
            rdf:rest rdf:nil ];
     ] .

which clearly is a bit unnecessarily verbose. But that is not really a problem. One can, and Turtle has, developed a notation for writing out lists. So that one can write much more simply:

:Adam :children ( :joe :jane ) .

This is clearly much easier to read and write than the previous way (not to speak about the equivalent in rdf/xml). RDF is a structure developed at the semantic level. Different notations can be developed to express the same content. The reason it works is because it uses URIs to name things.

Efficiency Considerations

So what about the implementation question: with tables oft accessed data is closely gathered together. This it seems to me is an implementation issue. One can easily imagine RDF databases that would optimize the layout in memory of their data at run time in a Just in Time manner, depending on the queries received. Just as the Java JIT mechanism ends up in a overwhelming number of cases to be faster than hand crafted C, because the JIT can take advantage of local factors such as the memory available on the machine, the type of cpu, and other issues, which a statically compiled C binary cannot do. So in the case of the list structure shown above there is no reason why the database could not just place the :joe and jane in an array of pointers.

In any case, if one wants distributed decentralised data, there is no other way to do it. Pragamatism does have the last word.

Notes

  1. Don't take the battle/war analogy too far please. Both DB technologies and Semantic Web ones can easily work together as demonstrated by tools such as D2RQ.

Wednesday Mar 19, 2008

Semantic Web for the Working Ontologist

I am really excited to see that Dean Allemang and Jim Hendler's book "Semantic Web for the Working Ontologist" is now available for pre-order on Amazon's web site. When I met Dean at Jazoon 2007 he let me have a peek at an early copy of this book[1]: it was exactly what I had been waiting a long time for. A very easy introduction to the Semantic Web and reasoning that does not start with the unnecessarily complex RDF/XML [2] but with the one-cannot-be-simpler triple structure of RDF, and through a series of practical examples brings the reader step by step to a full view of all of the tools in the Semantic Web stack, without a hitch, without a problem, fluidly. I was really impressed. Getting going in the Semantic Web is going to be a lot easier when this book is out. It should remove the serious problem current students are facing of having to find a way through a huge number of excellent but detailed specs, some of which are no longer relevant. One does not learn Java by reading the Java Virtual Machine specification or even the Java Language Specification. Those are excellent tools to use once one has read many of the excellent introductory books such as the unavoidable Java Tutorial or Bruce Eckel's Thinking in Java. Dean Allemang and Jim Hendler's books are going to play the same role for the Semantic Web. Help get millions of people introduced to what has to be the most revolutionary development in computer science since the development of the web itself. Go and pre-order it. I am going to do this right now.

Notes

  1. the draft I looked at 9 months ago had introductions to ntriples, turtle, OWL explained via rules, SPARQL, some simple well known ontologies such as skos and foaf, and a lot more.
  2. The W3C has recently published a new RDF Primer in Turtle in recognition of the difficulty of getting going when the first step requires understanding RDF/XML.

Friday Mar 07, 2008

Drupal’s future is the semantic web

Dries Buytaert the author of the PHP based Drupal content management system, gave a very interesting presentation at DrupalCon 2008 where he layed out how the future of Drupal in the Semantic Web. See this very interesting Google video for some very clear explanation:

More information from:

Wednesday Mar 05, 2008

Opening Sesame with Networked Graphs

Simon Schenk just recently gave me an update to his Networked Graphs library for the Sesame RDF Framework. Even though it is in early alpha state the jars have already worked wonders on my Beatnik Address Book. With four simple SPARQL rules I have been able to tie together most of the loose ends that appear between foaf files, as each one often uses different ways to refer to the same individual.

Why inferencing is needed

So for example in my foaf file I link to Simon Phipps- Sun's very popular Open Source Officer - with the following N3:

 :me foaf:knows   [ a foaf:Person;
                    foaf:mbox_sha1sum "4e377376e6977b765c1e78b2d0157a933ba11167";
                    foaf:name "Simon Phipps";
                    foaf:homepage <http://www.webmink.net/>;
                    rdfs:seeAlso <http://www.webmink.net/foaf.rdf>;
                  ] .
For those who still don't know N3 (where have you been hiding?) this says that I know a foaf:Person named "Simon Phipps" whose homepage is specified and for which more information can be found at the http://www.webmink.net/foaf.rdf rdf file. Now the problem is that the person in question is identified by a '[' which represents a blank node. Ie we don't have a name (URI) for Simon. So when the Beatnik Address Book gets Simon's foaf file, by following the rdfs:seeAlso relation, it gets among others something like
[] a foaf:Person;
   foaf:name "Simon Phipps";
   foaf:nick "webmink";
   foaf:homepage </>;
   foaf:knows [ a foaf:Person;
                foaf:homepage <http://www.buzzword-compliant.com/>;
                rdfs:seeAlso <http://www.buzzword-compliant.com/foaf.rdf>;
             ] .
This file then contains at least two people. Which one is the same person? Well a human being would guess that the person named "Simon Phipps" is the same in both cases. Networked Graphs helps Beatnik make a similar guess by noting that the foaf:homepage relation is an owl:InverseFunctionalProperty.

Some simple rules

After downloading Simon Phipps's foaf file and mine and placing the relations found in them in their own Named Graph, we can in Sesame 2.0 create a merged view of both these graphs just by creating a graph that is the union of the triples of each .

The Networked Graph layer can then do some interesting inferencing by defining a graph with the following SPARQL rules

#foaf:homepage is inverse functional
grph: ng:definedBy """
  CONSTRUCT { ?a <http://www.w3.org/2002/07/owl#sameAs> ?b .  } 
  WHERE { 
       ?a <http://xmlns.com/foaf/0.1/homepage> ?pg .
       ?b <http://xmlns.com/foaf/0.1/homepage> ?pg .
      FILTER ( ! SAMETERM (?a , ?b))   
 } """^^ng:Query .
This is simply saying that if two names for things have the same homepage, then these two names refer to the same thing. I could be more general by writing rules at the owl level, but those would be but more complicated, and I just wanted to test out the Networked Graph sail to start with. So the above will add a bunch of owl:sameAs relations to our NetworkedGraph view on the Sesame database.

The following two rules then just complete the information.

# owl:sameAs is symmetric
#if a = b then b = a 
grph: ng:definedBy """
  CONSTRUCT { ?b <http://www.w3.org/2002/07/owl#sameAs> ?a . } 
  WHERE { 
     ?a <http://www.w3.org/2002/07/owl#sameAs> ?b . 
     FILTER ( ! SAMETERM(?a , ?b) )   
  } """^^ng:Query .

# indiscernability of identicals
#two identical things have all the same properties
grph: ng:definedBy """
  CONSTRUCT { ?b ?rel ?c . } 
  WHERE { ?a <http://www.w3.org/2002/07/owl#sameAs> ?b .
          ?a ?rel ?c . 
     FILTER ( ! SAMETERM(?rel , <http://www.w3.org/2002/07/owl#sameAs>) )   
  } """^^ng:Query .
They make sure that when two things are found to be the same, they have the same properties. I think these two rules should probably be hard coded in the database itself, as they seem so fundamental to reasoning that there must be some very serious optimizations available.

Advanced rules

Anyway the above illustrates just how simple it is to write some very clear inferencing rules. Those are just the simplest that I have bothered to write at present. Networked Graphs allows one to write much more interesting rules, which should help me solve the problems I explained in "Beatnik: change your mind" where I argued that even a simple client application like an address book needs to be able to make judgements on the quality of information. Networked Graphs would allow one to write rules that would amount to "only believe consequences of statements written by people you trust a lot". Perhaps this could be expressed in SPARQL as

CONSTRUCT { ?subject  ?relation ?object . }
WHERE {
    ?g tr:trustlevel ?tl .
    GRAPH ?g { ?subject ?relation ?object . }
    FILTER ( ?tl > 0.5 )
}
Going from the above it is easy to start imagining very interesting uses of Networked Graph rules. For example we may want to classify some ontologies as trusted and only do reasoning on relations over those ontologies. The inverse functional rule could then be generalized to
  PREFIX owl: <http://www.w3.org/2002/07/owl#>
  PREFIX : <https://sommer.dev.java.net/ontologies/beatnik#>

  CONSTRUCT { ?a owl:sameAs ?b .  } 
  WHERE { 
      GRAPH ?g { ?inverseFunc a owl:InverseFunctionalProperty . }
      ?g a :TrustedOntology .

       ?a ?inverseFunc ?pg .
       ?b ?inverseFunc ?pg .
      FILTER ( ! SAMETERM (?a , ?b))   
 }

Building the So(m)mer Address Book

I will be trying these out later. But for the moment you can already see the difference inferencing brings to an application by downloading the Address Book from subversion at sommer.dev.java.net and running the following commands (leave the password to the svn checkout blank)


> svn checkout https://sommer.dev.java.net/svn/sommer/trunk sommer --username guest
> cd sommer
> ant jar
> cd misc/AddressBook/
> ant run
Then you can just drag and drop the foaf file on this page into the address book, and follow the distributed social network by pressing the space bar to get foaf files. To enable inferencing you currently need to set it in the File>Toggle Rules menu. You will see things coming together suddenly when inferencing is on.

There are still a lot of bugs in this software. But you are welcome to post bug reports, or help out in any way you can.

Where this is leading

Going further it seems to me clear that Networked Graphs is starting to realise what Guha, one of the pioneers of the semantic web, wrote about in this thesis "Contexts: A Formalization and Some Applications", which I wrote a short note on Keeping track of Context in Life and on the Web a couple of years ago. That really helped me get a better understanding of the possibilities of the semantic web.

Thursday Feb 28, 2008

sparqling international calling codes

The other day I was looking for a list of international calling codes. Since most of them are listed in Wikipedia, it occurred to me it would be easy to get all that information in a nice easy to use format by querying DBPedia with SPARQL. So I wrote a very light weight SPARQL client (source code available here). Download the jar and you can then run the following query:

hjs@bblfish:0$ java -jar Sparql.jar > results.n3

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

CONSTRUCT {  ?cntry dbp:callingCode ?code ;
                    rdfs:label ?name . 
} WHERE {
        ?cntry dbp:callingCode ?code .
        OPTIONAL { ?cntry rdfs:label ?name . }
}


^d
That is after typing the command line java -jar Sparrql.jar > results.n3 I pasted the SPARQL query (in blue above) and ended the input with control-d, which on unix is the end-of-file character. This sent the query to DBPedia, and returned a long list of answers which were place in results.n3 of which the first set is
<http://dbpedia.org/resource/Abu_Dhabi_%28emirate%29> <http://www.w3.org/2000/01/rdf-schema#label> "\u963F\u5E03\u624E\u6BD4\u914B\u957F\u56FD\""@zh ,
		"Abu Dhabi (emirate)"@en ,
		"Abu Dhabi (emirato)"@it ,
		"\u0410\u0431\u0443-\u0414\u0430\u0431\u0438 (\u044D\u043C\u0438\u0440\u0430\u0442)\""@ru ;
	<http://dbpedia.org/property/callingCode> "971-2"@en .

In the above case the calling code should proabaly not be tagged with an @en. So the data still needs to be cleaned up a little at present. It would be nice to be able to quickly fix the data when one notices something like this. Most of the other results are in xsd:integer format, which I think is also not quite right. The literal string is a better representation of a calling code I think.

Anyway the data is easy to clean up. And we have an example of a very simple but useful query.

Monday Feb 25, 2008

Semantic Bar Camp London and Flue

Last Saturday early early morning I took the train to London to go to the weekend Semantic Bar Camp that was held at Imperial College, in the computer science department I studied in. I arrived, late, because I had missed the train in Paris by one minute, and so missed getting an overview of the event. On arrival I was asked to put my name down for a presentation and stick the paper on the board on the first empty slot available. 15 minutes later I improvised a talk on Linked Data. I did not realize that there were a lot of microformats people in the audience with little semantic web experience, so I did not take care enough to lay some important foundations, and show how microformats information should be able to work well with information in an RDF database [1]. I demonstrated the Beatnik Address Book and gave an overview of why this was now filling a really important gap, enabling distributed social networks, a topic on which I have written a lot recently. It inspired Dan Brickley who has been working on SPARQL over XMPP to give me some code and show how this could be integrated into Beatnik... It seems pretty easy to do. What would the use case be though...

There were a number of very interesting talks over the weekend. Daniel Lewis collected a few of the blogs covering the event. Ian Davis presented the work he has been leading on Open Data Licences (pic). Yves Raimond and his team presented some interesting work on semantics and music and an advanced inferencing engine based on SWI Prolog called Henry (picture). Tom Shelley from the Economist got us all asking questions on the pros and cons of personal knowledge in a short presentation (picture). The more information is known on us the better services can be offered, but also what are the risks? Is this not a reason one may end up needing agent technology: ie one may prefer programs to move rather than data to move? Georgi Kobilarov gave a nice overview of the very useful Linked Data project DBPedia (picture)...

All during the weekend I felt very tired which I put down for a while to the trip from Paris. On Monday morning as my condition had gotten much worse it became clear that that I had caught a virus. For two days I could hardly get out of bed, struck by a vicious flue, which has only just left me today. On Friday I was too tired to do any thinking work, so I went to see the Du Champ, Man Ray and Picabia exhibition at the Tate Modern, where you can see Du Champ's irreverent rendition of the Mona Lisa - below the picture are written the letters "L.H.O.O.Q" which if pronounced speedily enough sounds like "Elle a chaud au cul".

Notes

  1. All I need is some XSLT or Xquery transform to turn microformatted html into RDF (any well known format will do). Mind you, at a later microformat talks it turns out that this may not be quite so easy, as it seems that that the microformat community has not yet agreed on a clear grammar...

Friday Feb 15, 2008

Proof: Data Portability requires Linked Data

Data Portability requires Linked Data. To show this let me take a concrete and topical example that is the core use case of the Data Portability movement: Jane wants to move her account from social network A to social network B. And she wants to do this in a way that entails the minimal loss of information.

Let us suppose Jane wants to make a rich copy, and that she wants to do this without hyperdata. Ideally she would like to have exactly the same information in the new space as she had in the old space. So if Jane had a network of friends in social network A she would like to have the same network of friends in B. But this implies moving all the information about all her friends from A to B, including their social network too. For after all the great thing about one's friends is how they can help us make new friends. But then would one not want to move all the social network of one's friends too? Where does it stop? As William Blake said so well in Auguries of Innocence

        To see a world in a grain of sand,
	And a heaven in a wild flower,
	Hold infinity in the palm of your hand,
	And eternity in an hour.
the problem is that everything is linked in some way, and so it is impossible to move one thing and all its relations from one place to another using just copy by value, without moving everything. A full and rich copy is therefore impossible.

So what about pragmatically limiting ourselves to some subset of the information? We have to reduce our ambitions. So let us limit the data Jane can move to just her personal data and closest social network. So she copies some subset of the information about her friends over to network B. Nice, but who is going to keep that information up to date? When Jane's friend Jack moves house, how is Jane going to know about this in her new social network? Would Jack not have to keep his information on social Network B up to date too? And now if every one of Jack's 1000 friends moves to a different social network, won't he have to now keep 1000 identities up to date on each of those networks? Making it easy for Jane to move social network is going to make life hell for Jack it seems. Well of course not: Jack is never going to keep the information about himself up to date on these other social networks, however limited it is going to be. And so if Jane moves social network she is going to have to leave her friends behind.

The solution of course is not to try to copy the information about one's friends from one social network to another, but rather to move one's own information over and then link back to one's friends in their preferred social network. By linking by reference to one's friends identity one reduces to a minimum the information that needs to be ported whilst maintaining all the relationships that existed previously. Thus one can move one's identity without loss.

The rest follows nearly immediately from these observations. Since the only way to refer to resources in a global namespace is via URIs ( and the most practical way currently is to do this with URLs ), URI's will play the role of pointers in our space. This is the key architectural decision of the semantic web. So by giving people URLs as names we can point to our friends wherever they are, and even move our data without loss. All we need to do when we move our foaf file is to have the web server serve up a HTTP redirect message at the old URL, and all links to our old file will be redirected to our new home.

Notes

Wednesday Feb 06, 2008

replacing ant with rdf

Tim Boudreau just recently asked "What if we built Java code with...Java?". Why not replace Ant or Maven xml build documents with Java (Groovy/Jruby/jpython/...) scripts? It could be a lot easier to program for Java programmers, and much easier to understand for them too. Why go through xml, when things could be done more simply in a universal language like Java? Good question. But I think it depends on what types of problem one wants to solve. Moving to Java makes the procedural aspect of a build easier to program for a certain category of people. But is that a big enough advantage to warrant a change? Probably not. If we are looking for an improvement, why not explore something really new, something that might resolve some as yet completely unresolved problems at a much higher level? Why not explore what a hyperdata build system could bring to us? Let me start to sketch out some ideas here, very quickly, because I am late on a few other projects I am meant to be working on.

The answer to software becoming more complicated has been to create clear interfaces between the various pieces, and have people specialise in building components to the interfaces. It's the "small is beautiful" philosophy of Unix. As a result though, as software complexity builds up, every piece of software requires more and more pieces of other software, leading us from a system of independent software pieces to networked software. Let me be clear. The software industry has been speaking a lot about software containing networked components and being deployed on the network. This is not what I am pointing to here. No I want to emphasise that the software itself is built of components on the network. Ie. we need more and more a networked build system. This should be a big clue as to why hyperdata can bring something to the table that other systems cannot. Because RDF is a language whose pointer system is build on the Universal Resource Identifier (URI) it eats networked components for lunch, breakfast and dinner. (see my Jazoon presentation).

Currently my subversion repository consists of a lot of lib subdirectories full of jar files taken from other projects. Would it not be better if I referred to these libraries by URL instead? The URL where they can be HTTP gotten from of course? Here are a few advantages:

  • it would use up less space in my SubVersion repository. A pointer just takes up less space than an executable in most cases.
  • it would use up less space on the hard drive of people downloading my code. Why? Because I am referring to the jar via a universal name, a clever IDE will be able to use the local cached version already downloaded for another tool.
  • it would make setting up IDE's a lot easier. Again because each component now has a Universal Name, it will be possible to link up jars to their source code once only.
  • the build process, describing as it does how the code relates to the source, can be used by IDEs to jump to the source (also identified via URLs) when debugging a library on the network. (see some work I started on a bug ontology called Baetle)
  • Doap files can be then used to tie all these pieces together, allowing people to just drag and drop projects from a web site onto their IDE, as I demonstrated with Netbeans
  • as IDE gain knowledge of which components are successors to which other components, from such DOAP files, it is easy to imagine them developing RSS like functionality, where it scans the web for updates to your software components, and alerts you to those updates which you can then test out quickly yourself.
  • The system can be completely decentralised, making it a WEB 3.0 system, rather than a web 2.0 system. It should be as easy as having to place your components and your RDF file on a web server served up with the correct mime types.
  • It will be easy to link up jars or source code ( referred to as usual by URLs ) to bugs (described via something like Baetle ). Making it easy to describe how bugs in one project depend on bugs in other projects.

So here are just a few of the advantages that a hyperdata based build system could bring. They seem important enough in my opinion to justify exploring this in more detail. Ok. Well, let me try something here. When compiling files one needs the following: a classpath and a number of source files.

@prefix java: <http://rdf.sun.com/java/> .

_:cp a java:ClassPath;
       java:contains ( <http://apache.multidist.com/cocoon/2.1.11> <http://openrdf.org/sesame/2.0/> ) .

_:outputJar a java:Jar;
       java:buildFrom <src>;
       java:classpath _:cp .

_:outputJar 
        :pathtemplate "dist/${date}/myprog.jar";
        :fullList <outputjars.rdf> .
If the publication mechanism is done correctly the relative URLs should work on the file system just as well as they do on the http view of the repository. Making a jar would then be a matter of some program following the URLs to download all the pieces (if needed), put them in place and use that to build the code. Clearly this is just a sketch. Perhaps someone else has already had thoughts on this?

Friday Feb 01, 2008

3 semantic web talks for JavaOne 2008

At least 3 semantic web talks were accepted for JavaOne 2008, taking place on May 6-9 in San Francisco. There may be more, but the following I am sure of:

  • A talk by Dean Allemang on practical ontology writing based on his soon to be published book "The Working Ontologist". I am really looking forward to it coming out, as it is a book that should help cut down the learning curve dramatically.
  • Über programmer Tim Boudreau and I will be presenting Beatnik: Building an Open Social Network Browser at a Birds of a Feather session. We will look at both the client and server side components and how the theory developed by Dean can turn into a practical product that solves real problems: the data silo effect of current social networking sites.
  • Finally some key players will be joining the "Developing Semantic Web Applications on the Java™ Platform" panel where we will hopefully start a discussion and get feedback on what can be done to bring many many more of the 5 million Java developers on board the semantic web. This panel discussion ( the list of panelists is not complete yet ) will be hosted by Rob Frost of BEA and I.

Hopefully this should allow the 20 thousand or so attendees joining us at JavaOne to get a good overview of the the practical developments in this area. And if they like it, the Semantic Conference in San Jose will be taking place a week later from the 18th to the 22nd of May where they will be able meet many of the leading companies and researchers in this area.

For detailed session information see my later post.

Tuesday Jan 15, 2008

Data Portability: The Video

Here is an excellent video to explain the problem faced by Web 2.0 companies and what Data Portability means. It is amazing how a good video can express something so much more powerfully, so much more directly than words can. Sit back and watch.


DataPortability - Connect, Control, Share, Remix from Smashcut Media on Vimeo.

Feeling better? You are gripped by the problem? Good. You should now find that my previous years posts start making a lot more sense :-)

Will the Data Portability group get the best solution together? I don't know. The problem with the name they have chosen is that it is so general, one wonders whether XML is not the solution to their problem. Won't XML make data portability possible, if everyone agrees on what they want to port? Of course getting that agreement on all the topics in the world is a never ending process.... Had they retained the name of the original group this stemmed from, Social Network Portability then one could see how to tackle this particular issue. And this particular issue seems to be the one this video is looking at.

But the question is also whether portability is the right issue. Well in some ways it is. Currently each web site has information locked up in html formats, in natural language (or even sometimes in jpegs (see the previous story of Scoble and Facebook), in order to make it difficult to export the data, which each service wants to hold onto as if it was theirs to own.

Another way of looking at this is that the Data Portability group cannot so much be about technology as policy. The general questions it has to address are question of who should see what data, who should be able to copy that data, and what they should be able to do with it. This does indeed involve identity technology insofar as all of the above questions turn around questions of identity ("who?"). Now if every site requires one to create a new identity in order to access one's data one has the nightmare scenario depicted in the video, where one has to maintain one's identity across innumerable sites. As a result the policy issue of Data Portability does require one to solve the technical problem of distributed identity: how can people maintain the minimum number of identities on the web? (ie not one per site) Another issue that follows right upon the first is that if one wants information to only be visible to a select group of people - the "who sees what" part of the question - then one also needs a distributed way to be able to specify group membership, be it friendship based or other. The video again makes that point very clearly why having to recreate one's social network on every site is impractical.

What may be misleading about the term Data Portability is that it may lead one to think that what one wants is to copy one's social information from one social service to another. That would just automate the job of what the video illustrates people having to do by hand currently. But that is not a satisfactory solution. Because one cannot extract a graph of information from one space to another without loss. If I extract my friends from LinkedIn into FaceBook, it is quite certain that Facebook will not recognise a large number of the people I know on LinkedIn. Furthermore the ported information on FaceBook would soon be out of date, as people updated their network and profiles on LinkedIn. Unless of course Facebook were able to make a constant copy of the information on LinkedIn. But that's impossible right? Wrong! That is the difference between copy by value and copy by reference. If FaceBook can refer to people on LinkedIn, then the data will always be as up to date as it can be. So this is how one moves from DataPortability to Linked Data, also known as hyper data.

Sunday Jan 06, 2008

2008: The Rise of Linked Data

Here is my one prediction for 2008. Social Networking's breakdown will lead to the rise of Linked Data. Here is the logic:

  1. Social Networking sites have grown tremendously over the last few years fuelled by huge profits from advertising dollars. When I worked at AltaVista it was well known that the more you knew about your users the more valuable an ad became. If you know all the friends, interests, habits of someone, and you know what they are doing right now, you can suggest exactly the right product at the right time to them. The cost of a simple add on AltaVista was $5 per thousand page views. If you knew a lot about what someone was looking for the value could go up to $50.
  2. The allure of profit is leading to an ever increasing number of players in this space. See the Social Networking 3.0 talk at Stanford earlier in 2007.
  3. This in turn leads to a fracturing of the Social Networking space. As more players enter the space, each ends up with a smaller and partial view of the whole graph or social relations.
  4. Which is leading to the need for Social Network Portability, and more generally Data Portability. Users such as Scoble want to use their data on their own computer and link it together. Social Network Providers such as Plaxo or Facebook have a financial interest in helping their users move with their social network to their service. Facebook helps users extract all the information from GMail. Plaxo wants to help users extract all the information from every other social network.
  5. Privacy concerns will mount tremendously as a result. Each social network will increase in their users the fear of giving their data over to other "spamming" services, to defend their position. But to do this they will make it more and more difficult to extract the data from their service, annoying and so going against their users desires for linking their information. This will seem more and more like an issue for anti trust involvement as the ire of more and more people mount.

The force of the above logic will release the energy needed for an investment in Linked Data tools such as Beatnik, since it solves all the problems mentioned above - at the expense of killing the dream some investors may have had of a world where they own Nineteen Eighty Four like, the world.

Data Portability: Scoble Right or Wrong and beyond

Scoble explains Video

In this video Scoble explains how he got thrown off Facebook.

Here is a short summary, but the video is well worth watching as the emotions come through much better...
Facebook, which asks its users for their Gmail password in order to extract all the contacts someone has from their mail history and build up a possible list of friends, Facebook which scans the web for information to suggests friendships you may have, that same Facebook does not want anyone, including YOU, to be able to extract the data in your account on their web site even were it only into your own electronic address book. To do this they encode all email addresses as images which make it very difficult for a computer to decode, and so makes it tedious to move and use that information. So when Scoble tried to extract his 5000 friends using Optical Character Recognition - an idea suggested by Plaxo which wants to be a hub of people information - , Facebook noticed this and cut off his account. (I think he may have been reinstated now - but whether there is a point in belonging to such a service is a serious question now).As a result Scoble and other have asked people to join the conversation on the Data Portability group.

This clearly is a very important issue. But his solution to the issue was not the best one. By using Plaxo - which wants to be the social graph hub of the web - to extract his data, he would have been able to do what clearly he should be able to do, namely add his contact information easily to Outlook. But he did this at the cost of allowing a third entity to gather a lot of information about him and his contacts. CNET's The Scoble scuffle: Facebook, Plaxo at odds over data portabili