Tags, keywords, and inconsistency
Here's an interesting fact upon which I'll base the rest of my argument: people are horribly inconsistent when assigning keywords to documents. If you give two people the same document and ask them to assign a set of keywords to describe it, then the sets of keywords that they assign will agree only about 20% of the time. This was one of the problems that lead to the development of full text indexing systems. If we couldn't choose a few keywords from a document, we would use every word in the document as a keyword!
This so-called inter-indexer inconsistency is kinda-sorta the Halting Problem for Information Retrieval. If you can convince yourself that a problem you're looking at is really the keyword assignment problem, then you can pretty safely say that people will be inconsistent when doing whatever it is you're studying.
For example, people are inconsistent when assigning hypertext links within and between documents. Peter Willett did a study for the British Library that showed that, given a large document, people will tend to assign hypertext links between different paragraphs. During my Ph.D., I showed a similar result for links between documents. There's a good summary in an ACM Computing Surveys article from 1999.
So, what does this mean? Let's say we have a system where people manually assign keywords to documents (as far as I can tell, this is what tagging is, but I'm happy to be corrected) and let's also say that people can run queries against this index of keywords. You can think of such a query as an attempt by the searcher to assign keywords to a document that he or she would like to get in response to the query. The problem is that the person who originally assigned a tag to the document and the searcher who "assigned a tag" to the document are going to be inconsistent, so the searcher won't pick quite the same tag.
So, that's why James Governor is wrong and Tim Bray is right about tagging: it's not really a new way of indexing documents, it's actually an old way that didn't work very well.
The only method that has been shown to improve the consistency of keyword assignment is to assign keywords from a fixed vocabulary (see, for example, MeSH, the Medical Subject Headings). Maintaining such a vocabulary is a non-trivial task. Obviously, the Technorati tagging system is not "controlled" in this sense (or possibly in any sense), but I'm wondering whether its web-scale nature can provide some benefit that one would not expect.
Here's what I mean: if hypertext link assignment is inconsistent, how come Google's PageRank does such a good job of finding relevant pages? The answer, at least in part, is that there are a lot of pages and a lot of links out there so that some agreement can be reached (at least on things that lots of people care about!) If the Technorati tags can be organized and available in such a way that taggers are facing a recognition problem (as opposed to a recall problem), I'm wondering whether we couldn't get some of the benefits of a controlled vocabulary, at least for the popular tags.
So, that's why James Governor is right and Tim Bray is wrong about tagging: it may be more like assigning keywords from a controlled vocabulary.
Ultimately, I think Tim's caution is warranted, not that I think my opinion will keep people from tagging. This whole issue needs to be the subject of some actual retrieval evaluations.


Posted by James Governor on May 13, 2005 at 11:16 AM EDT #
Ken makes a very good point (probably better than I did :-). Actually, Einat Amitay, a Ph.D. student from my Macquarie University days did an interesting thesis around extracting descriptions of Web pages from the links pointing to them. If you're interested it's here. Part of what she did was up on labs.google.com for a while, but it appears to be gone now.
I also think that autoclassification could help people do a better job of assigning tags consistently (i.e., using the content of the document as an indicator for tagging) while not stepping on their toes, creativity-wise. But it's possible that's just because I do document autoclassification research!
Posted by Stephen Green on May 13, 2005 at 11:46 AM EDT #
Posted by Joel Dinda on May 13, 2005 at 01:00 PM EDT #
I had a look at technorati's tags when I was writing the entry and I was surprised how many morphological varaitions there were (e.g., blog and blogs are treated separately.) If we really want to build folksonomies, it seems like we could at least try to control for this kind of obvious variation...
Posted by Stephen Green on May 13, 2005 at 01:59 PM EDT #