Thursday December, 22 2005
Chris Rijk pointed out some caching ideas in his comment on my last post about some of the changes that went into Roller 2.1. I was going to respond as a comment, but it got a bit too long so I decided a new post was more appropriate.
Chris,
I agree that caching data is one way to solve the problem, but it is not the only way. I tend to consider three general approaches to caching in web applications, each has it's own benefits ...
1. Data Caching. As you mentioned, this means caching your objects rather than caching rendered content and you can quickly access those cached objects to render a page. This approach is ideal for very dynamic content which could be changing at any minute, like a discussion forum. However, with this approach you also spend time doing rendering on each request, which lengthens your request processing time.
2. Cache small pieces of rendered content. This is the portal approach where you are rendering lots of little bits of content and caching that, then pulling multiple pieces of your cached content together to make the full page. This is a great approach when your content is only semi dynamic, yet needs to be reused in many different ways on different pages. Unfortunately this is typically the hardest cache system to implement and requires a great level of caching control.
3. Cache fully rendered pages. This is the easiest approach and typically the fastest, you simply render the page and cache it as a whole. The problem with this approach is that your pages can't have any truly dynamic elements that change per request.
I think that depending on your application, how dynamic the content is, and how much content and memory you have you will choose some combination of the three of these cache methods.
Now, to relate this specifically to Roller. The caching in Roller is basically completely #3 at the momement primarily because the content on a weblog really isn't that dynamic. Remember, a weblog is really just a fancy name for "website" with a really easy publishing system built in. We could possibly use option #2, however our problem is that all the weblog content is user controlled via Velocity templates, so it's hard for us to figure out what pieces of content to cache. And we could definitely use #3.
There will certainly be a places in Roller where data caching is ideal, but for right now we'll probably stick with caching the fully rendered content. Our biggest problem is really just memory, we need more room for our caches as the site gets more and more content and that problem won't change whether we're caching pages or data.

As another side-note, what kind of hit rate does the site get? Apart from the implementation issues with the current caching, I'm rather surprised you're having performance issues given the page based caching. Even with the "slow" per-request rendering required with data-based caching and with old hardware, aceshardware.com still easily handles the infamous "Slashdot Effect". On something like a Sun Fire T2000, it could probably handle well over 1000 hits/sec.
In terms of where most of the time is spent in processing requests on aceshardware.com, for pages showing a lot of posts (ie forum index pages), the main overhead is actually the date rendering speed (with Java's built-in date classes). I was seriously tempted to write my own, but it was a bit too much effort. Most other pages render in a few miliseconds - if the site was running on a top-speed Opteron, we'd have to use nanoTime to measure the speed properly. This is with carefully written, though clean and simple, JSPs. I've no idea what kind of page rendering overheads Roller's Velocity templates would have.
Incidentally, there is some extra processing that goes on at aceshardware.com. We compress the HTML output (for supporting clients). It doubles the CPU overhead, but saves us lots of bandwidth. Since you're doing page based caching, I would suggest (assuming you're not doing it already), that you GZIP the HTML on creation, and keep the compressed version in memory. Then, for supporting clients, you simply return the GZIP'd version. For unsupporting (a few percent by page hits), you'd uncompress the data on the fly. This could reduce the memory consumption of the cache by about 20x, allowing more pages to the cached. As far as I can tell, you're GZIP'ing the sent HTML, but I've no idea if you're compressing the in-memory data.
Regarding object caching, page caching, and sub-page caching, I've done all 3. Certainly which is best depends on many factors - for a start, adding object caching after the fact can take a lot of programming effort. (I didn't mean to imply that caching the data is the only way - actually my true preference would be using Java object databases, meaning the DB can handle the caching automatically). It certainly sounds like for Roller as it is today, that page based caching is the way to go.
The main downside I can see is that it will limit what features can be added in the future. For example, user specific features (even if only based on cookies and not requiring registration), can improve the user experiance - driving up interest in the site. But that would break the page caching model, unless you do some funky stuff with Javascript.
Posted by Chris Rijk on December 22, 2005 at 12:55 PM PST #
I agree with pretty much everything you said.
We get excellent hit rates on our smaller caches, like the main page, planet page, and xml feeds ... usually around 90%. The problem is the weblog page cache, which is often below 20%. ugh. That's not to say we can't handle the "Slashdot Effect", blogs.sun.com has been slashdotted many times, but we can obviously do better. I really wasn't looking to optimize the weblog page cache in Roller 2.1, my main goal was just to lay down the ground work.
I don't know how efficient Velocity rendering is, but I would guess it's quite reasonable. Probably not as fast as jsps or servlets, but I bet it's hardly noticable as long as the templates aren't too complex.
I really like your gzip idea. I haven't given much thought to our gzip filter, but your idea makes complete sense. I'll definitely look into that.
Luckily, Roller has a pretty good backend architecture and currently we are using Hibernate as an ORM tool, so the object level caching can actually be left up to Hibernate. I've not particularly well versed in Hibernate, so I'm not even sure what the state of our current object level caching is, but it's something I plan to look at for Roller 2.2
Posted by Allen on December 22, 2005 at 01:17 PM PST #