Wednesday Apr 23, 2008

I've been writing a specialized web crawler for Aardvark, our blog recommender. The crawler is fairly straightforward - pull blog feeds, add them the the Aura DataStore, and look for URLs to other blogs. When the crawler finds a new URL it has to decide whether or not it should crawl that URL to look for blogs. The first part of this decision is to check whether or not we've already crawled the URL. This is simple enough for a small crawler - when a URL is crawled, toss the URL into a HashSet, and whenever it is time to crawl a new URL, first check the HashSet - if the URL is already there, it has already been crawled, so it can be skipped. However, this gets to be a bit trickier when the crawler wants to keep track of millions of URLs. The HashSet will no longer comfortably fit in memory. So what to do? We could keep the URLs in a sorted file that we could quickly search through, or we could use a database to store the URLs and construct a query to see if a URL is already in the database. Or we can just use our search engine. As the old saying goes, when you have a hammer, every problem looks like a nail - and since we have a search engine, to us, everything looks like a search problem. And so with our trusty search engine in hand, we created a persistent hashset that can easily deal with millions and millions of members. And the code was surprisingly easy to write. Here are the core functions:
    /** adds a string to the persistent set */
    public void add(String s) {
        if (!contains(s)) {
            inMemoryCache.add(s);
            if (inMemoryCache.size() > MAX_MEMORY_CACHE_SIZE) {
                SimpleIndexer indexer = searchEngine.getSimpleIndexer();
                for (String cachedString : inMemoryCache) {
                    indexer.startDocument(cachedString);
                    indexer.endDocument();
                }
                indexer.finish();
                inMemoryCache.clear();
            }
        }

    /** tests to see if in item is in the set */
    public  boolean contains(String s) {
        if (inMemoryCache.contains(s)) {
            return true;
        } else {
            return searchEngine.isIndexed(s);
        }
    }
The code was surprising fast too. With a million items in the set, a 'contains' operation took a couple of hundred microseconds to execute.

So, this is one way you can abuse a search engine, (the other way is to make it crawl ;)

Update: Steve has blogged about the rest of the story.

Friday Feb 01, 2008

Sometime around 2:30AM this morning, we received notification that our JavaOne proposal has been accepted. WooHoo.  Now it is time to make sure that all the work (and it is a lot of work) gets done before March 14 when the slides are due. Yikes - that's 6 weeks from now. Time to get to going!

Friday Oct 26, 2007

I just received this in my inbox:


Dear Paul,

The 2008 JavaOne Conference Call for Papers is Open!

JavaOne, Sun's 2008 Worldwide Developer Conference, is seeking proposals for technical sessions and Birds-of-a-Feather (BOFs) sessions for this year's Conference.

Attracting over 15,000 developers and leaders in the developer community. From Industry leaders, to experienced developers to developers starting out - this conference is one that brings together some of the industry's best and brightest.

The JavaOne conference is your opportunity to reach this specialized community by educating and sharing your experience and expertise with the developer community. Please go to http://www.cplan.com/sun/javaone08/cfp to review the guidelines and instructions to submit your proposal(s).

If you have questions regarding the Call for Papers process, please contact j1papers@sun.com .


Wednesday Sep 12, 2007

It is time to start thinking about  JavaOne 2008. The JavaOne call-for-papers is coming soon, so if you are doing something new and/or cool with Java technology, it is time to start thinking about that JavaOne talk proposal. Giving a JavaOne talk can be quite a bit of work (and a bit of stress too), but it is also a lot of fun - and an excellent way to get exposure for one's work.  Here's the talk I gave at JavaOne 2007:

Search Inside the Music: Using Signal Processing, Machine Learning, and 3-D Visualizations to Discover New Music

Thursday Aug 30, 2007

Project Darkstar has just issued its first open-source release. Project Darkstar is the game industry’s first open source, enterprise grade, highly scalable, online game server.

Tuesday Aug 14, 2007

I did quite a bit of coding over the weekend while off the 'net - which is a bit like walking the tightrope without a net, since I can't check in my changes into the code repository.  This morning my app just stopped working  - bad news - without any CVS history to save me.  Ah, but then I remembered netBeans 6 has a new local history feature that lets you see your changes and revert to previous 'versions' of the code, even if you haven't checked the code in yet.   With this feature, I was quickly able to find the place that I accidentally munged the web.xml file while trying to add some Security Constraints.  I reverted to the working version and Voila! - I was back in business. Thanks Netbeans!

 

Wednesday Aug 01, 2007

Getting a cascading style sheet to work properly to work properly is always difficult for me.  Sometimes (most of the time) things just don't work they way I think they should.  This problem gets magnified when working on a web app - the edit/build/deploy/test cycle can take 60 seconds or more, add to that the problem of the browser or web server occasionally caching the CSS so I don't even see the changes that I thought I made and it can get downright frustrating.

 Coming to the rescue is the Web Developer Toolbar.  This firefox plugin adds a toolbar with lots of nifty tools useful for developing and debugging web apps.  The toolbar has some really useful features for debugging CSS.  Turn on "View Style Information" and whenever you mouse-over an element in your browser, it put a box around it and shows  you the style information like so:

html > body > div #banner > div .bannerStatusBox > table .bannerStatusBox > tbody > tr > td .bannerLeft

This can be very helpful if you are using a toolkit like GWT that attaches its own stylenames to widgets.

Perhaps the most powerful tool in the Web Developer Toolbar is the CSS Edit feature.  This feature allows you to live-edit the CSS of the current page. Make a change to the CSS and instantly see how it affects the page.  The edit/build/deploy/test loop drops from 60 seconds to 0 seconds.  It is an extremely useful way to debug and experiment.  When you are happy with the updated CSS, you can save it back to your project directory, so that the next time you do the formal build/deploy your CSS changes will be there. 

 There are lots of other nifty features in the Web Developer Toolbar. If you are doing any web development, I don't see how you can live without it.

Wednesday May 30, 2007

Bruce Johnson, tech lead of the Google Web Toolkit posts that the GWT 1.4 release candidate is now available for download.  Highlights in this release (from my point of view) are:

  • Size and speed improvements - the GWT are doing everything they can to reduce the startup time, including using ImageBundles to reduce the number of http requests needed to fetch static images.  There goal is 300ms maximum for an application  load time.
  • Some new widgets - RichTextArea, SuggestBox, PushButton, DisclosurePanel and Splitters.
  • Support for mouse wheel events
  • Lots of bug fixes (although I've yet to encounter a GWT bug).
See the release notes and the complete list of enhancements.

Wednesday May 16, 2007

Some music highlights from JavaOne 2007:

  • 'Real' bumper music - In previous years, the music at the keynotes was canned, non-commercial music.  The type of music you could buy a license for and play forever.  I can still remember the trombone solo in one of the frequently played tracks.  This year, they played 'real' music - that is, commercial music - songs by OK Go,  U2 w. Mary J. Blige, songs that most people would recognize .  Same for the tech sessions - the real music was nice.   I did notice on the keynote webcasts, that I've watched, they've replaced the commercial music with the canned music, sigh ... such is the state of music licensing on the web.
  • DJ Anon - The openining session started with electronic music by DJ Anon - the music set a good vibe for the day.
  • Tech Sessions -  There were only two tech session directly related to music: a talk on the fascinating jFugue, and my talk on Search Inside the Music. 
  • Mini music bof - A bunch of us that have an interest in music and Java went out for dinner one evening.  Next year, we should have a real BOF around music and Java.
  • School of Rock - one of the music highlights of JavaOne was during the walk back to the hotel one evening.  There was a stage set up in the middle of Union Square, and a bunch of kids were playing music.   I stopped and watched 3 bands play - these kids were remarkable - playing some extremely difficult songs (like Zappa).  The kids were from the School Of Rock. Awesome stuff.


Tuesday May 15, 2007

At JavaOne this year I went to a lot of talks, at least 30 of them.  After a few of them I started to notice some very bad patterns, antipatterns if you will, in many of the talks.  These are things one should never hear or see in a talk. JavaOne was filled with them, even from some very high profile speakers, from some very visible companies (and yes, many of them were from Sun).  If you are going to give a talk at JavaOne, please don't adopt these practices:

  • Give the same talk as last year.  Yep, I saw two very well attended talks (1,000+) attendees, where the subject matter (and even the slides) of the talk was the same as last year's talk.
  • Tell us about your broken demo - if you are not showing  a demo because it is broken, don't even bother telling us about it.  One talk had 4 demos in the slides, but 3 of them were missing, broken, or never written.
  • Tell us your demo is cool.  I'll be the judge of what I think is  cool.  One talk started with the speaker telling us he was going to 'knock our socks off', yawn.
  • Tell us you are skipping content because you ran out of time - If you are running out of time, adapt, but don't narrate the process.
  • Forget what was in your talk.  During one very high profile talk, the speaker got to a slide that said Demo. The speaker said ... hmmm I wrote these slides two months ago,  I'm not sure what I was thinking about when I wrote this slide. My advice - practice your talk, if you find a  slide for a demo that you can't remember, delete the slide (they let you delete slides right up until your talk).

Monday May 14, 2007

During the intense week of JavaOne, I've built up a list of technologies that I want to explore after learning about them in Moscone.  Here are some of the things that are on my post-JavaOne To Do list:

  • Play with JFugue - its a simple api to use to create music.  Lots of fun.
  • Play with JavaFX Script - looks like lots of fun - but not fully baked yet.
  • Start using Netbeans 6.0 - there are lot's of really powerful features in NB 6 - including a whole bunch of new editing features.
  • Learn more about the Java Persistence API and JavaDB
  • Learn more about the Swing Application Framework 
  • Play with Sun Spots (I have a couple on my desk as I type this).
  • Work with enthusiastic interns that do amazing things.
  • Think about ways to use MPk20/Project Wonderland/Darkstar for building a shared music space
  • Internalize the best practices for wildcards, bounded types and bounded wildcards as put forward by Josh Bloch
  • Explore using the JMonkey Engine for the next generation of 3D visualizations.
Add this to the tasks that I delayed during the run-up to JavaOne, plus the inbox with 2,000 emails in it ... and it looks like it will be a while before I get caught up.  But it is all fun.

Thursday May 10, 2007

I was really looking forward to the GWT talk at JavaOne (and apparently so were many others, there seemed to be a thousand in the room during the talk).  The talk was presented by two of the core GWT team members.  They did a super job giving the talk - they were relaxed, were polished in their presentation, used humor, didn't get fazed by AV glitches.  They talked a bit about the philosophy behind GWT. They put some stakes in the ground - 'static typing  is good, dynamic typing is not so good', lots of sexy widgets not so important as user experience.   They want to bring software engineering to web development.  They claim (and I agree) that it is hard to engineer a pile of steaming javascript.  Java is much better for all its 'abilities' - maintability, adaptability, etc. 

They had some good metrics about how users react to load time.  If your app takes too long, your users will leave.  They have a 300ms goal for a load time of a rich GWT application - and they are working hard always to reduce load time.  They do this in a number of ways - they use the browser cache to help of course, they make the code they ship over the wire be very compressible, and they reduce the overall number of http requests (compare that to what is typically done with a traditional ajax app, where there may be many trips to the server just to get all of the javascript).  GWT build separage JS bundles for the different browsers, so if you are running in firefox, you don't have any IE specific JS.  One really neat optimization they are doing is called 'image bundles'.  Apps that have lots of static images can take a long time to load since each image is a separate HTTP request.  GWT can bundle all of these images into a single image. So your app only has to do a single HTTP request to get the images. They can then use windowing within this larger image to display the various images.  A very clever technique.  GWT also includes a timing framework so you can easily instrument your app so you can see how long it takes the app to load.

They showed some demos ... I was a bit disappointed here since they just showed the demos that are available on the GWT site. I was also disappointed that they didn't talk much about where they were going, what new features were on the roadmap.

During the Q/A one audience member asked about how to write a GWT app that could be crawled by search engines like google.  The GWT guys knew about this problem and they are indeed working hard on making GWT sites crawlable.

 All in all, a good session.

 


Sunday May 06, 2007

I arrived in SFO at around 1PM (after sitting 100 yards from the gate for a frustrating hour waiting for another plane to leave the gate). I checked into the hotel (the venerable Sir Francis Drake), and walked down to Moscone for the early registration and to see if anything was going on. There were lots of folks there ready to help. It took all of 2 minutes to register. I got my speaker badge, my nice J1 backpack (only a few goodies inside), and my lunch tickets. I was also able to register for CommunityOne - the combined NetBeans and Glassfish event.


<welcome/>
<registration/>

I'm looking forward to a fun day tomorrow.  

Friday Apr 20, 2007

For one of the Search Inside the Music demos that I'll be showing at the Sun Labs open house next week, I've been building up a database of music artists and related info.  I've been gathering this data  from many different places on the web using a crawler.  It's not a fast process - it takes about a month to build up enough data to make it useful.   My crawler collects the data and writes it out to a set of text files so that later it can be indexed with the our nifty search engine.

I tested the whole process using a small crawl of the web on my Linux laptop.  When I was happy that everything was working fine, I started the crawl running on one of  our large Solaris servers.  Unfortunately, there was a lurking bug ... a Java programming 101 kind of bug that would make the data collected from the month long crawl be wrong.

Music artist names very international.  There's Björk, there's José Feliciano there's Mötley Crüe and Motörhead,  (there's even a whole genre of music called umlaut metal).  My mistake was forgetting that when  writing a text file in Java (using  a PrintWriter for instance), the default encoding used is the encoding of the operating system.  Now for my Linux laptop, the default encoding is UTF-8 which can handle all of the umlauts and accents.  But for our Solaris server, the default encoding is plain, old ASCII.  With its 7 bits, ASCII can't represent any of the rich characters that are needed to represent all of the artist names.  When I indexed my 30 days of data and started looking at the results I was very sad to see 'bj?rk"  and "m?tley cr?e". 

With our open house demo just 5 days from now, there's no way for me to recrawl the data and save it to disk using the proper encoding.  Luckily, when I did the initial crawl, I resolved all of the artists to a MusicBrainz ID.  I'm able to turn this ID back into the canonical name for the artist, so I am able to patch the names without having to do a recrawl.  Whew!

So my lesson for the day is ... don't rely on the default encoding when reading and writing text. Now, back to getting the rest of the demo to work.
 

Monday Apr 16, 2007

For an application that I'm writing, I wanted to be able to persist a large number of objects.  I wanted to avoid the complexities of a real database and the opaqueness and fragility of Java object serialization.  What I really wanted was the ability to turn any object into a a bit of XML that I could save and load at will, without having to do a lot of work defining the mapping between the Java object and the XML.  

There are a number of  standard Java APIs to go back and forth between Java and XML, but they all seem a bit complex for what I wanted to do.   Luckily, I stumbled upon XStream.  XStream is a simple API for doing exactly what I wanted:  serialize objects to XML and back again.  The API is quite simple to use.  To serialize an object to a file I use the code:

            XStream xstream = new XStream();
BufferedWriter writer = new BufferedWriter(new FileWriter(file));
xstream.toXML(myObject, writer);
writer.close()

And to load the object I use:

                BufferedReader reader = new BufferedReader(new FileReader(file));
myObject = (MyObject) xstream.fromXML(reader);
reader.close();

It couldn't be any easier than that. 

XStream is fast, simple, and easy to use.  Highly recommended.


 

This blog copyright 2008 by plamere