various bits and bytes about t Jesslog

Tuesday Jan 13, 2009

I've spent quite a bit of time browsing Wikipedia, as I'm sure many others have. There's a seemingly infinite set of articles, each linking to related information. I began to wonder how I wound up on a given page and if there were any patterns in my navigation. To investigate this, I created a small Firefox extension to capture my navigation history when browsing Wikipedia. This provided the raw data I needed which I then post-processed with Perl to create a Graphviz dot file. The result can be seen in this example.

I'm not sure what I'll gain from this, but I'll look at the results periodically. If nothing else, I'll at least learn some relationships between articles that I would otherwise overlook.

If you're interested in trying it out, leave me a comment and I'll zip up the programs and add some instructions.

Tuesday Jul 08, 2008

Earlier today a friend asked me how to go about running a certain SQL query using the NDB API. This tends to be a common question so I figured this would be a good opportunity to show an example. The query in question looks something like this:

    select count(*),emailaddr from email_example group by emailaddr having count(emailaddr) > ?;

This is a fairly basic query as it does not involve any joins. There is also an ordered index on the "emailaddr" field which makes this quite simple. Essentially we perform a scan on the ordered index, and count the number of times we see an email address in a row. If this number is greater than the parameter, then we print the row. Here's the code:

  NdbTransaction trans = ndb.startTransaction();
  /* perform a scan on the ordered index on the email address */
  NdbIndexScanOperation op = trans.getSelectIndexScanOperation("emailaddr",
                                         "email_example", LockMode.LM_Read);
  op.getValue("emailaddr"); /* read only the emailaddr field */
  NdbResultSet rs = op.resultData();
  trans.executeCommit();
  
  int emailCount = 0;
  String email = "";
  /* iterate over addresses in index */ 
  while(rs.next())
  {
    String curEmail = rs.getString("emailaddr");
    /* if we have a new email, finish the previous one */
    if(!curEmail.equals(email) && !email.equals(""))
    {
      /* print if it matches */
      if(emailCount > x)
        System.out.println("Email: " + email + ", count = " + emailCount);
      /* reset the count for the next address */
      emailCount = 0;
    }
  
    /* set the previous to the current value */
    email = curEmail;
    /* increment this email count */
    emailCount++;
  }
  
  /* after the loop we still have to check if the last address matched */
  if(emailCount > x)
    System.out.println("Email: " + email + ", count = " + emailCount);

I've attached the full code example here including some sample data. To verify that our program performs correctly, we can check the results against the output of the SQL query. Here's the output from the example program:

Connected. Running query.
Email: a@a.com, count = 2
Email: c@c.com, count = 2
Email: d@d.com, count = 2
Email: f@f.com, count = 3 

And the same result through MySQL:

+----------+-----------+
| count(*) | emailaddr |
+----------+-----------+
|        2 | a@a.com   |
|        2 | c@c.com   |
|        2 | d@d.com   |
|        3 | f@f.com   |
+----------+-----------+

This program should be pretty simple to understand if you're familiar with the NDB API. The basic element here is the index scan operation. If you find yourself wondering about how to implement an NDB API program for an existing SQL query, keep this operation in mind as it will often be used. In some future posts, I'll cover some examples of the other operations and how to combine them with the index scan.