Caffeinated Bitstream

Bits, bytes, and words.

Roller

Migrating from Apache Roller to Hugo and Isso

After almost ten years of using Apache Roller to power this blog, I'm making the leap to the Hugo static site generator. Roller served me well, but after years of watching Roller+Tomcat use hundreds of megabytes of memory on my server, I decided that it was overkill for my needs.1 The only major feature which absolutely demands dynamically generated pages is the comments, and I've migrated that functionally to a distinct service using Isso.

My goals are:

  • Reduce the server resource usage.
  • Allow blog posts to be created, managed, and revisioned using the same tools I use to manage software projects — Vim, Git, etc.
  • Reduce the deployment effort of the server-side software. (Like many Go apps, Hugo is a single statically linked binary with no dependencies to worry about.)

For my own future benefit, I'm providing my notes on this migration below.

Migration steps

Most of the migration process was straightforward, although a bit time consuming. I unfortunately don't have a "roller-to-hugo" script that magically converts a blog, as several categories of content assets needed to be manually adapted to the Hugo way of doing things. I can outline the basic steps, though:

  1. Create a Hugo theme. I wanted the blog to look and feel the same in Hugo as it did in Roller, so I needed to create a custom Hugo theme. I was already using custom stylesheets and Velocity templates on Roller, so simply extracting these assets from the database's webpage table into files got me most of the way there. I then needed to touch up the files to convert Velocity markup into Go templates and adapt the pagination scheme.
  2. Port blog posts. Roller stores blog entries in the weblogentry table, and their associated tags in the roller_weblogentrytag table. I wrote a one-off Python script to create Hugo content files out of this data.
  3. Port static content. This was simply a matter of finding the Roller resources directory in the filesystem, and copying it to the Hugo static/resources directory.
  4. Port RSS and Atom feeds. The current version of Hugo does not have built-in support for Atom feeds, so I used a template-generated solution as described here. I also needed to update the <head><link ... /></head> references in my templates to point to the new feed URLs.
  5. Port comments. I used the Isso comment server to support comments. Isso works similarly to Disqus, except it is self-hosted.
    1. Install Isso in a Docker container for isolation and ease of management.
    2. Map the blog's /isso URLs to Isso.
    3. Add the client HTML bits to inject the Isso comments into pages.
    4. Configure Isso: Basic configuration (dbpath, host, [server].listen), logging, SMTP notifications, moderation, and guard settings (rate limits).
    5. Import comments. I wrote a one-off Python script to import comments, paying careful attention to properly initialize the voters Bloom filter bitmask in Isso's SQLite database.
  6. Map old Roller URLs to Hugo URLs. I configured some 301 (permanent) redirects on my web server so that existing links to block posts and feed URLs will continue to work.

I have a few ugly Python scripts for migrating data from Roller to Hugo/Isso, but I'll hold off on posting them unless someone really wants to see them.

Pros and cons

Hugo pros:

  • When I first started working through the Hugo Quickstart Guide, it seemed like a lot of steps. However, after playing around with it for a while, everything seems really easy and straightforward.
  • As expected, the resource usage is low. Hugo is a single, self-contained ~6MB binary. Since static pages are generated, there is no persistent resource usage.
  • Blog posts can be composed completely offline, and tested using Hugo's built-in web server. When ready, I can push the Git commit(s) and rebuild the site on the server. Rebuilding my blog from scratch takes about 200 milliseconds.

Hugo cons:

  • No built-in support for Atom feeds. (But it's easy to add via a template.)
  • It's not obvious how trackbacks would be implemented with Hugo.
  • Dynamic web apps have the luxury of providing the correct MIME type with every document that is delivered. Since Hugo is generating static files, I now rely on the web server to determine the MIME type based on filename extensions. This may be an obstacle to preserving some URL schemes when migrating to Hugo.2 I ended up restructuring the web site to use Hugo-friendly URLs, and adding permanent redirects to map old Roller URLs to Hugo URLs.

Isso pros:

  • As a self-hosted solution, Isso avoids some of the privacy concerns that people have with third-party solutions such as Disqus.
  • Notification and moderation of comments via mail.

Isso cons:

  • Isso is a very simple, no-frills service. It accepts and regurgitates comments, but not much more.
  • There is no visible feedback when the guard rate-limits are hit, so the user doesn't receive any hint about why the comment is not being posted.
  • It doesn't seem practical to add the comment count below each entry on the main page, as I did with Roller.
  • I haven't figured out how to configure Isso to use my correct base URL in mail notifications, so I have to tweak the URLs when approving or deleting comments. (The host option seems to not be useful here.)
  • Isso seems to wake up every 500 milliseconds to do something, even when it is not being actively used:
    # strace -p 4890
    strace: Process 4890 attached
    select(5, [4], [], [], {0, 85673})      = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    
    Perhaps this is a function of Werkzeug. Despite the 500ms wakeup, the CPU utilization seems to be negligible.

Footnotes

  1. To be fair, there are probably lots of opportunities to tweak the parameters of Tomcat and Roller to tune the resource usage. Perhaps the JVM heap size and/or the size of internal caches could be adjusted. Also, memory usage of specific services on Linux can be notoriously difficult to determine. (Top is currently showing that resident memory usage of my Tomcat server is 286MB.)
  2. I suppose someone could manually add web server configuration rules to match URL patterns to the right MIME types, but this seems needlessly manual and brittle.
  3. Isn't it fun reading through all the footnotes?
Taming Roller's URL strategy

When I decided to start this blog, I installed the Roller 4.0 weblog software. Many different blogs can run in one instance of Roller, and the URLs for the blogs are arranged as subdirectories of a master Roller URL. For instance, if you installed Roller to be /roller, then your blogs might have URLs like /roller/my_blog, /roller/potato_farming_in_pocatello, and /roller/i_like_lettuce. That's fine for many uses, but I prefer to have more concise URLs. I'd like my blog to be referenced from the root of the web site, like /my_blog. Why should I have to conform to how Roller thinks I should set up my web site?

Configuring a web server to remap incoming URLs is a simple matter, and can be easily accomplished with tools like Apache's mod_rewrite. Such remapping techniques are well known and I won't bother going into detail about it here. However, what about outgoing URLs? With the rewrite rules in place, users can easily access the blog at /my_blog, but all the links on the page point back to the ugly URLs. It's a simple matter to redirect these in the web server so that they work, but it's nasty and wasteful to be constantly sending redirect messages, and besides... what would Googlebot think of such shenanigans?

I get stubborn about these sorts of things, so I decided to roll up my sleeves and see what was going on under the hood of Roller. It turns out that Roller provides a URLStrategy interface that can be swapped out programmatically with different implementations to provide different URL behaviors. Also, to my astonishment, Roller 4.0 is hooked together with Guice -- the lightweight dependency injection system developed at Google. I haven't worked with Guice, but I have used the Spring Framework's inversion of control library for dependency injection, so I know that these systems are built to make it easy to swap out components -- just what I'm looking for!

Unfortunately, there's no facility in Roller to configure the Guice dependency injection at runtime -- as far as I can tell, the Guice configuration is hard-wired in the code. (As opposed to an XML configuration file, as is common in Spring applications.) However, there is a runtime property to select the class that does the configuring -- the so-called Guice module. This means that to provide my own URLStrategy, I must supply at least two new classes: a custom replacement for Roller's JPAWebloggerModule class which is used to configure Guice, and the replacement for the MultiWeblogURLStrategy class which currently defines the URL behavior.

I start by configuring Roller to use my custom Guice module, by adding this line to my roller-custom.properties file:

guice.backend.module=com.davidsimmons.roller.CustomWebloggerModule
This property will configure Roller to use my CustomWebloggerModule class, which configures Guice with all the same bindings as Roller's own JPAWebloggerModule class does, except it binds my CustomURLStrategy class to the URLStrategy interface instead of the default MultiWeblogURLStrategy implementation. My CustomURLStrategy extends MultiWeblogURLStrategy, so it is almost the same, except that it knows about special weblogs that I want to be referenced from the root of the web site. For these weblogs, my custom class post-processes the URL to remove the first component of the URL path. Voilà! All my links now reflect the friendly version of the URL.

There is one gotcha -- when I reparented the blog, important cookies stopped working because they were tied to the /roller path. To solve this for the JSESSIONID, I configured my Tomcat servlet container to use an empty path by including emptySessionPath="true" in the Connector attributes. There are a few more path-dependent cookies that Roller uses that may come back to bite me... we'll see.

The sample code is available here: simmons-customurl.tar.bz2