Fork me on GitHub

Caffeinated Bitstream

Bits, bytes, and words.

Cursive: Writing terminal applications in Rust

As a learning exercise to sharpen my Rust programming skills, I recently toyed with writing a small program that uses a terminal-based user interface which I built using the Cursive crate developed by Alexandre Bury. Cursive provides a high-level framework for building event-driven terminal applications using visual components such as menu bars, text areas, lists, dialog boxes, etc. Conceptually, developing with Cursive is one level of abstraction higher than using a library such as ncurses, which provides a more raw interface to managing screen contents and translating updates to the terminal's native language. In fact, Cursive defaults to using ncurses as one of several possible backends, and allows setting themes to customize various text colors and styles.

Why write terminal applications?

In today's software world, no one writes terminal applications expecting them to be a hit with the masses. Graphical applications (e.g. desktop apps or web apps) provide a uniquely intuitive interface model that allows users to quickly become productive with a minimal learning curve, offer a high-bandwidth flow of information to the user, and remain the only reasonable solution for many problem categories. Many applications would simply not be possible or practical without a GUI. However, terminal programs can find a niche audience in technical users such as software developers and system administrators who are often in need of utilities that are a bit more two-dimensional than the command line's standard input and output, but retain the flexibility to be easily used remotely or on devices of limited capability.

Also, terminal apps are often extremely fast — fast enough to maintain the illusion of the computer being an extension of the mind. I find it frustrating that in 2017 I still spend plenty of time waiting for the computer to do something. Occasionally even typing into a text field in a web browser is laggy on my high-end late-model iMac. For every extra cycle the hardware engineers give us, we software engineers figure out some way to soak it up.

The terminal is not for everyone, but lately I've found it's the one environment that is instantaneous enough that my flow is not thrown off. For kicks, I recently installed XUbuntu on a $150 ARM Chromebook with the idea of mostly just using the terminal (and having a throwaway laptop that I'm not scared to use on the bus/train). I expected to mostly be using it as a dumb terminal to ssh into servers, but to my surprise, it has actually proven to be very capable at performing a wide range of local tasks in the terminal with good performance.

The Cursive framework

Anyone who has developed software with a GUI toolkit (e.g. Windows, GTK+, Java Swing, Cocoa, etc.) will find most Cursive concepts to be very familiar. Visual components are called "views" (some toolkits use use the terms "widget" or "control" for the same concept), and are installed into a tree which is traversed when rendering. Some views may contain child views and are used for layout (e.g. BoxView and LinearLayout), while others are used as leaf nodes that provide information or interact with the user (e.g. Button, EditView, TextView, SliderView, etc.). Cursive can maintain multiple view trees as "screens" which can be switched between. Each screen's view tree has a StackView as the root element, whose children are subtree "layers" that can be pushed and popped.

Cursive provides an event model where the main program invokes Cursive::run() and the Cursive event loop will render views and dispatch to registered callbacks (typically Rust closures) as needed until Cursive::quit() is called, at which time the event loop exits. Alternately, the main program may choose to exercise more control by calling Cursive::step() as needed to perform a single iteration of input processing, event dispatch, and view rendering. Key events are processed by whichever input view currently has focus, and the user may cycle focus using the tab key.

Referencing views

Cursive diverges from other UI toolkits with respect to referencing views. In many environments, we would simply store references or pointers to any views that we need to reference later, in addition to whatever references are needed internally by the view tree to form the parent-child relationships. However, Rust's strict ownership model requires us to be very explicit about how we allow multiple references to the same memory.

After the main program instantiates and configures a view object, it generally adds it to the view tree by making it the child of an existing view (e.g. LinearLayout::add_child()) or adding it to a screen's StackView as a layer. Rust ownership of the object is moved at that time, and it is no longer directly accessible to the main program.

To access specific views after they have been integrated into a view tree, views may be wrapped in an IdView via .with_id(&str) which allows them to be referenced later using the provided string identifier. A borrowed mutable reference to the wrapped view may be retrieved with Cursive::find_id() or a closure operating on the view may be invoked with Cursive::call_on_id(). Under the hood, these methods provide interior mutability by making use of RefCell and its runtime borrow checking to provide the caller with a borrowed mutable reference.

The following code demonstrates how views can be referenced by providing a callback which copies text from one view to the other:

extern crate cursive;

use cursive::Cursive;
use cursive::event::Key;
use cursive::view::*;
use cursive::views::*;

fn main() {
    let mut cursive = Cursive::new();

    // Create a view tree with a TextArea for input, and a
    // TextView for output.
    cursive.add_layer(LinearLayout::horizontal()
        .child(BoxView::new(SizeConstraint::Fixed(10),
                            SizeConstraint::Fixed(10),
                            Panel::new(TextArea::new()
                                .content("")
                                .with_id("input"))))
        .child(BoxView::new(SizeConstraint::Fixed(10),
                            SizeConstraint::Fixed(10),
                            Panel::new(TextView::new("")
                                .with_id("output")))));
    cursive.add_global_callback(Key::Esc, |c| {
        // When the user presses Escape, update the output view
        // with the contents of the input view.
        let input = c.find_id::<TextArea>("input").unwrap();
        let mut output = c.find_id::<TextView>("output").unwrap();
        output.set_content(input.get_content());
    });

    cursive.run();
}

Early in my exploration of Cursive, this method of accessing views proved to be somewhat challenging since fetching references to two views in the same lexical scope would result in BorrowMutError panics, since the internals of the second find_id() would try to mutably borrow a reference to the first view while traversing the tree. Cursive's view lookup code has since been adjusted so that this is no longer an issue.

Model-View-Controller

While developing a full application, I quickly ran into BorrowMutError panics again. With application logic tied to my custom view implementations, and some such code needing to call methods on other custom views, inevitably some code would need to mutably borrow a view that was already borrowed somewhere further up the stack.

My solution was to completely decouple UI concerns from the application logic, resulting in something along the lines of the well-known Model-View-Controller (MVC) design pattern. A Ui struct encapsulates all Cursive operations, and a Controller struct contains all application logic. Each struct contains a message queue which allows one to receive messages sent by the other. These messages are simple enums whose variants may contain associated data specific to the message type.

Instead of calling Cursive::run(), the controller will provide its own main loop where each iteration will operate as follows:

  1. The controller main loop will call Ui::step().
  2. The Ui::step() method will process any messages that the controller may have added to its message queue. These messages allow the controller to change the UI state in various ways.
  3. The Ui::step() method will then step the Cursive UI with Cursive::step(). Cursive will block until input is received. Any pending UI events will be processed and any registered callbacks will be executed. Callbacks may result in messages being posted to the controller's message queue (for example, the contents of a dialog box's form).
  4. The controller main loop will then process any messages that the UI may have added to its message queue. The controller may perform tasks related to these messages, and optionally post messages to the UI's message queue to indicate the outcome.

This scheme worked great for my needs where it's okay for the program to completely block while waiting for user input.

For the message queue, I used Rust's std::sync::mpsc (multi-producer, single consumer FIFO queue), which provides a convenient way for different code components to own a cloned Sender object which inserts elements into a shared queue. The use of mpsc is really overkill for the single-threaded applications I was working with, since any thread synchronization work being performed is wasted.

Here's an example of adapting the above text copy program to such an MVC model. It's admittedly much lengthier.

extern crate cursive;

use cursive::Cursive;
use cursive::event::Key;
use cursive::view::*;
use cursive::views::*;
use std::sync::mpsc;

pub struct Ui {
    cursive: Cursive,
    ui_rx: mpsc::Receiver<UiMessage>,
    ui_tx: mpsc::Sender<UiMessage>,
    controller_tx: mpsc::Sender<ControllerMessage>,
}

pub enum UiMessage {
    UpdateOutput(String),
}

impl Ui {
    /// Create a new Ui object.  The provided `mpsc` sender will be used
    /// by the UI to send messages to the controller.
    pub fn new(controller_tx: mpsc::Sender<ControllerMessage>) -> Ui {
        let (ui_tx, ui_rx) = mpsc::channel::<UiMessage>();
        let mut ui = Ui {
            cursive: Cursive::new(),
            ui_tx: ui_tx,
            ui_rx: ui_rx,
            controller_tx: controller_tx,
        };

        // Create a view tree with a TextArea for input, and a
        // TextView for output.
        ui.cursive.add_layer(LinearLayout::horizontal()
            .child(BoxView::new(SizeConstraint::Fixed(10),
                                SizeConstraint::Fixed(10),
                                Panel::new(TextArea::new()
                                    .content("")
                                    .with_id("input"))))
            .child(BoxView::new(SizeConstraint::Fixed(10),
                                SizeConstraint::Fixed(10),
                                Panel::new(TextView::new("")
                                    .with_id("output")))));

        // Configure a callback
        let controller_tx_clone = ui.controller_tx.clone();
        ui.cursive.add_global_callback(Key::Esc, move |c| {
            // When the user presses Escape, send an
            // UpdatedInputAvailable message to the controller.
            let input = c.find_id::<TextArea>("input").unwrap();
            let text = input.get_content().to_owned();
            controller_tx_clone.send(
                ControllerMessage::UpdatedInputAvailable(text))
                .unwrap();
        });
        ui
    }

    /// Step the UI by calling into Cursive's step function, then
    /// processing any UI messages.
    pub fn step(&mut self) -> bool {
        if !self.cursive.is_running() {
            return false;
        }

        // Process any pending UI messages
        while let Some(message) = self.ui_rx.try_iter().next() {
            match message {
                UiMessage::UpdateOutput(text) => {
                    let mut output = self.cursive
                        .find_id::<TextView>("output")
                        .unwrap();
                    output.set_content(text);
                }
            }
        }

        // Step the UI
        self.cursive.step();

        true
    }
}

pub struct Controller {
    rx: mpsc::Receiver<ControllerMessage>,
    ui: Ui,
}

pub enum ControllerMessage {
    UpdatedInputAvailable(String),
}

impl Controller {
    /// Create a new controller
    pub fn new() -> Result<Controller, String> {
        let (tx, rx) = mpsc::channel::<ControllerMessage>();
        Ok(Controller {
            rx: rx,
            ui: Ui::new(tx.clone()),
        })
    }
    /// Run the controller
    pub fn run(&mut self) {
        while self.ui.step() {
            while let Some(message) = self.rx.try_iter().next() {
                // Handle messages arriving from the UI.
                match message {
                    ControllerMessage::UpdatedInputAvailable(text) => {
                        self.ui
                            .ui_tx
                            .send(UiMessage::UpdateOutput(text))
                            .unwrap();
                    }
                };
            }
        }
    }
}

fn main() {
    // Launch the controller and UI
    let controller = Controller::new();
    match controller {
        Ok(mut controller) => controller.run(),
        Err(e) => println!("Error: {}", e),
    };
}

Miscellaneous notes

  • Cursive is very much a work in progress and there are still some rough edges to be worked out. However, Alexandre Bury is lightning fast at responding to bug reports and fixing issues. One recent issue I filed went from report to patch to commit in 14 minutes.
  • It's unclear how you would develop a lightweight single-threaded program that uses reactor-style asynchronous I/O dispatch. For example, a central select() loop which dispatches stdin/stdout events to Cursive, network socket events to other code, and so on. (I'm not even sure if backends such as ncurses would even support this.)
  • I'm also not sure how I would go about structuring a multi-threaded application where the UI needs to process events from other threads. Cursive does provide a Cursive::set_fps() method which, in conjunction with Cursive::cb_sink(), can poll for new events at specified time intervals. But I've always preferred a purely event-driven design for such things instead of needlessly burning cycles periodically while waiting. (Again, there may be complications at the ncurses layer.)
  • Cursive wants callback closures to have static lifetime, which can lead to some Rust puzzles if you'd like to access non-static non-owned items from within closures. This may be inevitable, and the issue mostly goes away with the MVC decoupling technique mentioned above.

As a learning exercise, I wrote a Cursive-based interface to UPM password manager databases. However, nobody should use it for reasons outlined in its README.

Migrating from Apache Roller to Hugo and Isso

After almost ten years of using Apache Roller to power this blog, I'm making the leap to the Hugo static site generator. Roller served me well, but after years of watching Roller+Tomcat use hundreds of megabytes of memory on my server, I decided that it was overkill for my needs.1 The only major feature which absolutely demands dynamically generated pages is the comments, and I've migrated that functionally to a distinct service using Isso.

My goals are:

  • Reduce the server resource usage.
  • Allow blog posts to be created, managed, and revisioned using the same tools I use to manage software projects — Vim, Git, etc.
  • Reduce the deployment effort of the server-side software. (Like many Go apps, Hugo is a single statically linked binary with no dependencies to worry about.)

For my own future benefit, I'm providing my notes on this migration below.

Migration steps

Most of the migration process was straightforward, although a bit time consuming. I unfortunately don't have a "roller-to-hugo" script that magically converts a blog, as several categories of content assets needed to be manually adapted to the Hugo way of doing things. I can outline the basic steps, though:

  1. Create a Hugo theme. I wanted the blog to look and feel the same in Hugo as it did in Roller, so I needed to create a custom Hugo theme. I was already using custom stylesheets and Velocity templates on Roller, so simply extracting these assets from the database's webpage table into files got me most of the way there. I then needed to touch up the files to convert Velocity markup into Go templates and adapt the pagination scheme.
  2. Port blog posts. Roller stores blog entries in the weblogentry table, and their associated tags in the roller_weblogentrytag table. I wrote a one-off Python script to create Hugo content files out of this data.
  3. Port static content. This was simply a matter of finding the Roller resources directory in the filesystem, and copying it to the Hugo static/resources directory.
  4. Port RSS and Atom feeds. The current version of Hugo does not have built-in support for Atom feeds, so I used a template-generated solution as described here. I also needed to update the <head><link ... /></head> references in my templates to point to the new feed URLs.
  5. Port comments. I used the Isso comment server to support comments. Isso works similarly to Disqus, except it is self-hosted.
    1. Install Isso in a Docker container for isolation and ease of management.
    2. Map the blog's /isso URLs to Isso.
    3. Add the client HTML bits to inject the Isso comments into pages.
    4. Configure Isso: Basic configuration (dbpath, host, [server].listen), logging, SMTP notifications, moderation, and guard settings (rate limits).
    5. Import comments. I wrote a one-off Python script to import comments, paying careful attention to properly initialize the voters Bloom filter bitmask in Isso's SQLite database.
  6. Map old Roller URLs to Hugo URLs. I configured some 301 (permanent) redirects on my web server so that existing links to block posts and feed URLs will continue to work.

I have a few ugly Python scripts for migrating data from Roller to Hugo/Isso, but I'll hold off on posting them unless someone really wants to see them.

Pros and cons

Hugo pros:

  • When I first started working through the Hugo Quickstart Guide, it seemed like a lot of steps. However, after playing around with it for a while, everything seems really easy and straightforward.
  • As expected, the resource usage is low. Hugo is a single, self-contained ~6MB binary. Since static pages are generated, there is no persistent resource usage.
  • Blog posts can be composed completely offline, and tested using Hugo's built-in web server. When ready, I can push the Git commit(s) and rebuild the site on the server. Rebuilding my blog from scratch takes about 200 milliseconds.

Hugo cons:

  • No built-in support for Atom feeds. (But it's easy to add via a template.)
  • It's not obvious how trackbacks would be implemented with Hugo.
  • Dynamic web apps have the luxury of providing the correct MIME type with every document that is delivered. Since Hugo is generating static files, I now rely on the web server to determine the MIME type based on filename extensions. This may be an obstacle to preserving some URL schemes when migrating to Hugo.2 I ended up restructuring the web site to use Hugo-friendly URLs, and adding permanent redirects to map old Roller URLs to Hugo URLs.

Isso pros:

  • As a self-hosted solution, Isso avoids some of the privacy concerns that people have with third-party solutions such as Disqus.
  • Notification and moderation of comments via mail.

Isso cons:

  • Isso is a very simple, no-frills service. It accepts and regurgitates comments, but not much more.
  • There is no visible feedback when the guard rate-limits are hit, so the user doesn't receive any hint about why the comment is not being posted.
  • It doesn't seem practical to add the comment count below each entry on the main page, as I did with Roller.
  • I haven't figured out how to configure Isso to use my correct base URL in mail notifications, so I have to tweak the URLs when approving or deleting comments. (The host option seems to not be useful here.)
  • Isso seems to wake up every 500 milliseconds to do something, even when it is not being actively used:
    # strace -p 4890
    strace: Process 4890 attached
    select(5, [4], [], [], {0, 85673})      = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    select(5, [4], [], [], {0, 500000})     = 0 (Timeout)
    
    Perhaps this is a function of Werkzeug. Despite the 500ms wakeup, the CPU utilization seems to be negligible.

Footnotes

  1. To be fair, there are probably lots of opportunities to tweak the parameters of Tomcat and Roller to tune the resource usage. Perhaps the JVM heap size and/or the size of internal caches could be adjusted. Also, memory usage of specific services on Linux can be notoriously difficult to determine. (Top is currently showing that resident memory usage of my Tomcat server is 286MB.)
  2. I suppose someone could manually add web server configuration rules to match URL patterns to the right MIME types, but this seems needlessly manual and brittle.
  3. Isn't it fun reading through all the footnotes?
Reinventing software for security; or: The woodpeckers are coming.

The past couple of years have been tough for digital security. A few disasters and near-disasters include:

  • Heartbleed, a buffer over-read vulnerability in OpenSSL allowing unauthorized remote access to data which may contain private keys.
  • Shellshock, an issue with Bash allowing remote code execution in many varied scenarios.
  • A bug in Microsoft's SSL/TLS library (Schannel) allowing remote code execution.
  • POODLE, a flaw in the SSLv3 protocol that an attacker can leverage on many connections by forcing a protocol downgrade, or relying on certain flaws in TLS implementations.
  • Attackers' increasing boldness in targeting networks for financial gain (Target, Home Depot) or cybervandalism (Sony Pictures), resulting in hundreds of millions — or perhaps even billions — of dollars in damages.
  • A rising awareness of state-sponsored attacks, from actors such as the NSA (Regin malware), the UK's GCHQ (Belgacom attack), and North Korea (alleged perpetrator of the Sony Pictures attack).

How did our infrastructure become so fragile? How did the miracles of technology turn against us? Who is responsible for this? Regrettably, my fellow software engineers and I are largely responsible. Together, we have created this frightening new world where people's property, finances, and privacy are at risk.

“If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.” — Gerald Weinberg

Weinberg's famous quote about software quality points out a lack of rigor that can be seen in the software industry for decades. In the 1970's and 1980's, the nascent Internet was a more civilized place, similar to a small town where people felt comfortable leaving their front doors unlocked. Accordingly, we built software with little consideration of security. Unencrypted communication protocols like telnet would happily share your passwords with any eavesdropper, and lax security in other network services would eventually expose unexpected attack modes that were perhaps obvious only in hindsight. In the 1990's and 2000's, we wised up with better encryption, authentication, authorization, and recognition of security as an explicit engineering goal. (As far as I can tell, the first RFC with a dedicated “Security Considerations” section was RFC 1060 from March 1990.)

However, although we managed to lock the front door, we left our systems vulnerable in many other ways. Memory safety errors, unexpected consequences emerging from complexity, and numerous mundane code correctness issues provided attackers with a seemingly endless toolkit for compromising systems.

Many other engineering disciplines have the benefit of hundreds or thousands of years of accumulated wisdom that have resulted in highly refined tools and methods. Designing bridges or buildings, for example, is a well-understood process. We've only been developing software for about 60 years, and only been developing software at a large scale for maybe 30-40 years. Our field is very much still in its infancy: our tools are sorely lacking, our methods tend to be ad-hoc, and lack of experience leads us to be overconfident in our ability to produce correct code. Our products often fail to provide the basic functions expected by the user, much less withstand attacks by a thinking, creative adversary. It pains me that we've let down our employers, customers, and users by producing such flawed products.

Software development must reinvented. We need better tools and methods to build more reliable software, and an environment that values security and rewards engineers and companies for producing such software. These things are easier said than done, and I don't have all the solutions. I do know that it's time to start working on solutions. The threat level is not going down any time soon. In fact, I expect it to rise with our increased reliance on software systems and as recent high-profile attacks show the world's miscreants just how vulnerable we are.

The woodpeckers are coming.

Limitations of defensive technology

The industry's solution is to double down on defensive technology — malware scanners, firewalls, intrusion detection appliances, and similar systems. While these play an important role, it is increasingly difficult for defensive systems to shoulder the entire burden of security while an army of software engineers continues to supply a never-ending fountain of vulnerabilities. Firewalls become less effective as more software integrates firewall-bypassing communication channels with cloud services, attackers seek to exploit flaws in such software, and malware is distributed out-of-band. Malware scanners especially face tough challenges as fully metamorphic viruses are already extremely difficult to detect, and likely have a lot more opportunities for improvement than the scanners have options for improving detection.

Ultimately, software engineers are able to create security problems much faster than producers of defensive products can figure out ways to contain them. We must stop thinking of security in terms of band-aids, and address the source of the problem by developing software that is secure by design.

Attacking attack vectors with better tools and methods

We can broadly divide the attack universe into two categories:

  • Software engineering attack vectors. This includes programming issues such as memory safety and code correctness, and system design issues dealing with authentication schemes, cryptosystems, protocols, complexity management, and the user experience.
  • Other attack vectors found in system administration, configuration, networking, wiring, physical side channel emissions, passwords, social engineering, operational security, and physical security.

As a software engineer interested in improving software engineering, I'm focused on the former category. Examining a few of the recent high-profile vulnerabilities is useful for thinking about how we can approach certain attack vector categories.

Heartbleed and memory safety

“Whenever I go to debian.org and look at the latest security fixes, the vast majority of them involve memory safety issues, which only appear in unsafe languages such as C and C++.”
user54609, Information Security Stack Exchange

Memory safety issues are behind a huge chunk of vulnerabilities, such as OpenSSL's Heartbleed. Much security-sensitive code is written in low-level languages because we seek performance, minimal memory footprint, minimal dependencies, interoperability, and sometimes fine-grain control over execution. This is especially true for cryptography, where we'd like the CPU overhead to be as close to zero as possible, and avoid potential timing attacks that could arise from high-level language execution. However, developing complex systems in C and C++ can require a superhuman level of attention to detail to avoid memory errors, and even the most capable programmers seem to let such errors slip through on occasion. Although techniques exist to help minimize such errors (e.g. C++ smart pointers), it may not be possible to develop a large, complex C/C++ program with a high assurance of correct memory usage.

Fortunately, there has been much interest lately in developing new low-level languages with memory safety assurances. My favorite of these is currently Rust, which promises zero-cost memory safety by requiring that the programmer adhere to a certain memory management discipline. Rust is the most promising step toward reinventing software that I see today. If our critical low-level infrastructure was written in Rust instead of C/C++, we would be far more secure. Heartbleed would not have happened if OpenSSL was written in Rust.

Rust is still a work in progress, can be difficult to use, and even a fully mature Rust may not be the final solution. Other new languages also have merit. The Go programming language looks promising and is quite a bit more mature than Rust. However, Go's mandatory garbage collection may exclude it from certain applications, such as operating system kernels, real-time tasks, or possibly cryptography. (It's not clear to me if garbage collection can contribute to timing side channels in cipher implementations. I'd love to see some research on this.)

When it comes to memory safety bugs, the path ahead is refreshingly clear: new high-performance, low-level programming languages that prevent these bugs from happening. Unfortunately, general solutions for other classes of bugs remain murky.

Shellshock and emergent vulnerabilities

"So who's to blame? Everybody and nobody. The system is so complex that unwanted behaviours like these emerge by themselves, as a result of the way the components are connected and interact together. There is no single master architect that could've anticipated and guarded against this."
Senko Rasic on Shellshock

The Shellshock vulnerability in Bash is a great example for reminding us that some threats can be created even with the most logically consistent and memory-safe code. Writing Bash in a rigorous language such as Rust would not have prevented Shellshock from happening, nor would any amount of static analysis have revealed the problem. Shellshock arises from a feature added to Bash in 1992 for passing shell functions to child Bash processes using environment variables. The feature seems to be implemented by passing the environment variable's value directly to Bash's interpreter, as commands provided after the close of the function definition will be parsed and executed immediately. This probably seemed like a reasonable feature in 1992, but it became a devastating vulnerability when Bash became the glue tying network services to scripts (e.g. web servers to CGI scripts, or DHCP clients to hook scripts), and environment variables could suddenly contain hostile payloads, thus providing remote code execution to external parties.

It would have been nice if the troublesome feature halted interpretation at the end of the function definition, but even provisioning functions from environment variables was something that network service developers could not have anticipated. Indeed, they probably didn't anticipate the use of Bash at all — they were merely passing data to a child process in a generic fashion, and the use of Bash was often simply a result of how the system administrator or the distribution maintainer connected the pieces. Thus, Shellshock falls into an elusive category of emergent vulnerabilities that can arise in complex systems.

This class of vulnerability is particularly disturbing since most software is built around the idea of reusable modules of code, many of which may be supplied by external vendors, and connected in a vast number of combinations. We need engineering methods for dealing with this complexity, but I'm not sure exactly what these would be. Perhaps interface definitions between software components could make formal guarantees about how the passed data will be used.

Apple's “goto fail” bug and code correctness

Apple's goto fail bug, revealed in February 2014, prevented signature verification from happening properly in TLS handshakes, thus allowing man-in-the-middle attacks. The cause of the bug was a minor typo in the source code which led to unintended behavior. The program was incorrect — its behavior did not match its specification. Incorrect code can be produced by even the very best programmers, since these programmers are human beings and will occasionally make human mistakes.

Mike Bland believes that the “goto fail” bug could have been avoided by promoting a unit test culture, and Adam Langley suggests code reviews. These are both great ideas, especially for such critical code. However, I wonder if there are ways we can avoid creating these errors to begin with, instead of hoping to catch them later in a mop-up phase. Would use of functional languages like Haskell help us better express our intentions? Could formal methods and formal specifications be useful for catching such implementation errors?

POODLE and the trouble with cryptographic protocols and implementations

The POODLE attack revealed in September 2014 allows attackers to target secure connections protected with correct SSL 3.0 implementations, or TLS implementations with certain coding errors. (Although SSL 3.0 is 18 years old and seldom used in normal operation, this is still quite concerning as an attacker can use a forced downgrade attack to cause an SSL 3.0 session to be negotiated.) This reminds us that bugs can exist in protocols themselves, and cryptography can be enormously difficult to implement correctly. It's not good enough for cryptography implementations to properly encode and decode — to be secure, they must be mindful to a long list of small details involving parsing, padding, execution time (to avoid timing side channels), proper use of random number generators, and many more.

The best bits of advice I've heard about implementing cryptography are:

  • Practice extreme humility — overconfidence is the enemy of security. Know that no matter how good you are, your fresh cryptographic code is likely to have subtle problems.
  • Reuse existing cryptographic code modules whenever possible, preferably modules that have been audited, rigorously tested, and battle-hardened through their production use. As full of holes as OpenSSL is thought to be, it is probably more secure than whatever you would write to replace it. Better yet, consider opinionated toolkits such as the Sodium crypto library.
  • Seek expert assistance from professional cryptographers and security experts, when possible. There are people out there who have made it their life's work to study cryptography and its practical use, although they are probably not cheap.
  • Commission third-party security audits. When we programmers look at the same body of code for weeks at a time, we often lose the ability to view it critically. Fresh eyes can be invaluable.

The best engineering improvement I can think of is the use of domain-specific languages to specify protocols and algorithms, as this may help avoid the pitfalls of implementing cryptography in general purpose languages. I'm encouraged by projects such as Nick Mathewson's Trunnel, a binary parser generator for protocols.

Economics of secure software

“It's a valid business decision to accept the risk [of a security breach]... I will not invest $10 million to avoid a possible $1 million loss.”
— Jason Spaltro, senior vice president of information security, Sony Pictures, in a 2007 interview with CIO.

From individual consumers to the largest companies, security often seems to be valued rather low. Mr. Spaltro's unfortunate cost-benefit analysis has been mentioned often in the days since the devastating Sony Pictures attack was made public. However, I doubt his thinking was too far out of line with others at the time. In most organizations, information technology is a cost center that does not directly contribute to the bottom line, so it's understandable that companies would seek to minimize its expense. There is probably considerable temptation to underestimate the cost of breaches. This is regrettable, as even with improved engineering tools and methods, the financial investment needed to develop, audit, and deploy improved software may be quite large. I suspect companies such as Sony, Target, and Home Depot now have a better understanding of risks and may be willing to invest more money into security. Hopefully some of their security budget will include software better engineered for security, whether supplied by external vendors or developed in-house. In the end, it may take hundreds of billions or even trillions of dollars to rebuild our software foundations.

One great puzzle is figuring out how to fund the development and auditing of open-source software. Much of the technology we use every day relies on various open-source software modules under the hood, and our security relies on these modules being reliable. Additionally, the inherent auditability of open-source software makes it important for resisting attempts by governments to weaken security by coercing companies to include intentional flaws in their software. Of course, simply being open-source does not automatically make software more trustworthy. Being open-source is necessary but not sufficient. There is not an army of bored software engineers browsing through GitHub projects looking for flaws because they think it's a fun way to spend a Saturday night. With the right funding, though, we can pay qualified experts to conduct thorough audits.

I'm highly encouraged by the efforts of several groups to help fund audits and other security investigations, whether their motivations arise from their reliance on the security of the targeted software, positive public relations, self-promotion, or something else entirely. For example, the Open Crypto Audit Project is funding the necessary auditing of critical open-source projects. Although their visible efforts to date have been limited to a crowdfunded audit of TrueCrypt, Kenneth White spoke at last summer's DEFCON about their intention to begin an audit of OpenSSL funded by the Linux Foundation's Core Infrastructure Initiative, which itself is funded by a long list of big names such as Google, Intel, Microsoft, and Amazon. Such investment from stakeholders to fund security audits seems like a very reasonable approach. Likewise, Google's Project Zero is a team of security researchers tasked with improving the security of all commonly used software. Even some security consultancies are finding the time for pro bono investigations, such as with the Cryptography Services effort.

I'm optimistic about the improvement of many classes of software being driven by increased demand from businesses. Selling end users on the idea of paying for security may be a much tougher challenge in a market dominated by free advertiser-sponsored software and services (e.g. mobile apps, popular web sites, etc.). We have much more work ahead of us to construct a workable value proposition for this market.

Conclusion

Looking at the current state of software security and the harm of recent attacks can be a bit of a downer, but I remain optimistic that we can fix many of the problems with better engineering and better funding. What can we do to push the state of software engineering forward and create a more secure world?

  • Study new programming languages built for better memory safety without sacrificing high performance. Think about which critical software modules might be best suited for implementation in these languages, and which can be implemented in high-level languages. If you must use C++, learn the latest techniques for helping improve memory safety.
  • Develop new abstractions that may improve software reliability, such as generators for protocol handlers and cryptography algorithms.
  • Think about engineering methods that may improve code correctness, and how they can be applied to existing software development processes.
  • Develop funding mechanisms and more compelling end-user value propositions so that software engineers working on better security can be rewarded by those who value it.

I'd love to hear about any ideas you may have about making the world's software infrastructure more resilient to attack.

Battery cost of periodic mobile network use

As the consumer electronics revolution brings more and more of the digital world to handheld devices, the chief constraint developers often face is not bandwidth or CPU cycles, but rather battery life. Since many next-generation applications require creative use of the network, I decided to run a few tests to discover the true battery cost of network use in certain scenarios.

I have an interest in mesh overlay networks, and I'm curious about the cost of mesh maintenance on power-constrained devices. Therefore, these tests explore the use of relatively small network transactions performed at regular intervals. Mobile devices are known to be optimized for aggregated time-adjacent traffic, with the radio wake-up cost leading to a low ROI for small (e.g. one IP packet) transmissions. This could unfortunately be bad news for mesh maintenance, where lots of small transmissions are spread out in time.

Ilya Grigorik's High Performance Browser Networking (O'Reilly Media, 2013) goes into detail about these issues in Chapter 7 "Mobile Networks" and Chapter 8 "Optimizing for Mobile Networks". Some key insights from this work include:

  • "The "energy tails" generated by the timer-driven state transitions make periodic transfers a very inefficient network access pattern on mobile networks."
  • "Radio use has a nonlinear energy profile with respect to data transferred."
  • "Intermittent network access is a performance anti-pattern on mobile networks..."

Test methodology

Earlier this year when I was switching carriers, I found myself with a spare Android handset with LTE service enabled. Seizing the opportunity of having an activated handset that's not saddled with my usual array of chatty apps (mail, Twitter, etc.), I ran a series of tests measuring battery drain in various controlled network conditions.

The handset under test was a Samsung Galaxy Nexus running Android 4.3, equipped with the factory supplied 1850mAh battery. The network connection was provided by Sprint's LTE network. To reduce the amount of unintentional background network traffic, the device was reset to factory defaults and not associated with any Google account.

I developed an app to perform network traffic at a specified interval and record the battery level and network counters each minute. This app sends a 1400-byte UDP packet as an echo request to a cloud server, where a small Python script verifies the authenticity of the request and returns a 1400-byte UDP echo response packet. In this fashion, network traffic should be roughly balanced between upload and download, minus any occasional packet loss.

To judge the overall battery usage for an individual test with a specific network transaction frequency, I measured the time elapsed while the battery drained from 90% to 30%. (Battery usage was seen to have some aberrations above 90% and below 30%, so such data was discarded for the purpose of calculating drainage times.)

Caveats

This is not a fully controlled laboratory test or representative of a broad range of devices and networks, but rather a "best-effort only" test using the equipment at hand. Thus, it's important to keep in mind a number of caveats:

  • The LTE signal strength is not guaranteed to be constant throughout the test. I tried to minimize the variation by always performing tests with the handset in the same physical location and orientation, but there are many factors out of my control. A lower signal strength requires the radio to transmit with higher power to reach the tower, so this could add noise to the data.
  • LTE is something of a black box to me, so any peculiarities of the physical and link layers are not taken into account. For example, are there conditions that may prompt the connection to shift to a different band with different transmit power requirements?
  • Other wireless providers may use LTE in different frequencies or configurations which may affect the battery usage in different ways.
  • Android's background network traffic could not be 100% silenced, and I did not go to extraordinary lengths to track down every last built-in app that occasionally uses the network. However, this unintentional traffic should be fairly negligible.
  • This test only considers one specific mobile device with one operating system. Other models will have radios with different power usage characteristics.
  • Wi-Fi use is not tested.

Results

Note that the echo frequency above is in millihertz (mHz), not megahertz (MHz) — 1000mHz is 1 echo request/response per second.

Conclusion

One surprising result was the battery longevity in the control test. While most of us have grown accustomed to charging our mobile devices every day, it turns out that with minimal network activity, they can last quite a long time indeed. In this case, the Galaxy Nexus lasts almost three days while associated with an LTE tower.

As expected, the relationship between periodic network use and battery drainage is non-linear. For example, doubling the transaction frequency — say, from 128-second intervals to 64-second intervals — doesn't halve the 90-30% drain time; it only lowers it by 32%. Additionally, there seems to be a leveling out around 8-second (and shorter) intervals. Perhaps certain radio components never power down with such frequent transmissions.

Overall, the situation looks pretty grim for mobile devices being full, continuous participants in mesh overlay networks. The modest bandwidth needs of such applications are overshadowed by the battery impact of using the network in little sips throughout the day. Perhaps a system where all participants agreed to a synchronized schedule for mesh maintenance activities could mitigate the problem, but the benefits are not clear when combined with real-time mesh events instigated by remote users (say, a Kademlia node lookup).

It might be interesting to evaluate the impact of periodic network use with Wi-Fi, or investigate the techniques used by platform push systems such as Google Cloud Messaging and the Apple Push Notification service.

Highlights of DEFCON 22

The twenty-second DEFCON took over Las Vegas last week, and brought many interesting and notable speakers. I took a few notes from the talks that stood out to me, and I'm passing them along here.

Paul Vixie, Internet pioneer and DNS expert. Vixie spoke about his DNSDB project for accumulating global DNS resource records in a passive fashion, and making this information available to researchers and security product vendors. He also spoke about his DNS firewall for shielding users from malicious throwaway domain names.

Phil Zimmerman, creator of PGP and president of Silent Circle. Zimmerman spoke about wiretap overcompliance in the telecommunications industry, trust in cryptographic techniques, and his new endeavors at Silent Circle. Reading about Zimmerman's PGP efforts and the resulting drama (PGP: Pretty Good Privacy, Simson Garfinkel) is what got me interested in cryptography many years ago, so it was great to see a living legend on the stage. I did take issues with a few of his comments, though. When asked about trusting binary executables, Zimmerman mentioned the problem of distributing a binary which is identical to one that might be produced from source, due to differences in timestamps — and failed to discuss recent progress in reproducible build techniques which are meant to solve that problem. He also painted a somewhat rosy picture of the legislative attitude towards cryptography and privacy: we won the Crypto Wars in the 1990's, and cryptographic freedom can't be rolled back again now that everyone relies on it. This does not seem to be the case — last year, Congress and the administration was pushing a proposal which would effectively outlaw peer-to-peer communication systems that might be problematic to wiretap. (Thankfully, the Snowden revelations made the proposal politically toxic for now, and it has been shelved.)

Kenneth White, security researcher. White spoke about the Open Crypto Audit project which he launched along with cryptographer Matthew Green, and the drama caused by their first audit subject, TrueCrypt, being suddenly discontinued under mysterious circumstances. I've followed the progress of the Open Crypto Audit project and the ongoing news about the TrueCrypt disappearance, so there wasn't much in the talk that was new to me. It was interesting to hear that some of the biggest challenges of Open Crypt Audit were the community aspects of audit fundraising. White reported that they will finish the TrueCrypt audit in spite of the shutdown, and then move on to OpenSSL.

Dan Kaminsky, security researcher. Kaminsky scored a coveted two-hour slot in the Penn and Teller theater, which he fully used to discuss a variety of topics:

  • Secure random by default. Kaminsky argued that most vulnerabilities resulting from random number generation are not due to exotic attacks on complex algorithms, but rather gross missteps in the use and generation of randomness. For instance, some software has been observed to only effectively use 32 bits of entropy, while others employ the use of linear feedback shift registers (LFSRs) in spite of their easy cryptanalysis. Kaminsky proposes a new Liburandy library which wraps /dev/urandom when appropriate.
  • Storybits. Kaminsky invited Ryan Castellucci onto the stage to demonstrate Storybits 0.1, a new cryptomnemonic scheme for people to remember binary strings such as keys, fingerprints, secrets, etc. The system encodes the data as adjective-noun-verb tuples to make the data easier to remember, and provide error correction by way of spellcheck auto-correct.
  • Memory hardening. Convinced that improper memory usage is a major cause of vulnerabilities, Kaminsky outlined several strategies for memory-hardening applications. These include use of a typed heap (as Google does in Chrome), the use of nondeterministic freeing (as Microsoft does in Internet Explorer), and a novel approach called IronHeap where 64-bit virtual memory addresses are simply never freed (although pages may be returned for MMU reuse). He also announced the formation of a team to memory-harden Firefox, to provide added security for the Tor Browser Bundle.
  • Distributed Denial of Service (DDoS) mitigation. Kaminsky considers the rise of DDoS attacks using techniques such as datagram amplification to be an existential threat to the Internet. He proposes a new scheme of sending tracer packets within data flows to indicate when source address spoofing may be happening.
  • NSA. Kaminsky is concerned that the NSA backlash may lead to a balkanization of the Internet, as various nations opt to develop their own internal systems for core Internet services.
  • Apple bug bounties. Finally, Kaminsky is quite happy that Apple is offering bug bounties relating to Safari autoredirection.

Kaminsky's slides are available.

Ladar Levison, founder of Lavabit. Levison spoke about his proposed Dark Mail Alliance, a new electronic mail system designed to preserve the privacy of users. He began by announcing a new name for the project: DIME, the Dark Internet Mail Environment. I was a bit disappointed in the new name — "Dark" can have a sinister connotation for some people, and privacy preserving technologies should be marketed to the public with positive names reflecting the true value they provide. He should have renamed the project TIME, the Trustworthy Internet Mail Environment. Levison outlined the basic components of the system, including a server called Magma and a modified Thunderbird client called Volcano. DIME unfortunately does not provide forward secrecy for messages, although Levison pointed out that there was forward secrecy at the TLS1.2 line level. There was also talk of a pseudo-onion scheme to shield metadata and provide some small measure of anonymity, but it wasn't clear to me how this was implemented.

Adam Caudill, software developer and security researcher. In DEFCON's new Crypto Village, Caudill proposed a new secure electronic mail system called Simple Messaging and Identity Management Protocol (SMIMP). This scheme shares some of the same goals as Levison's DIME, but provides an alternative design intended to be developed in the open among the greater Internet engineering community. The most interesting thing to me was a Hashcash-like proof-of-work requirement for reducing spam.

Recent Android "Package file is invalid" errors

In the past day or so, I've been noticing these "Package file is invalid" errors on my Android devices while trying to upgrade or install certain packages from the Play Store. A bit of searching revealed that many others are having this problem, and various home remedies abound for trying to fix it, such as clearing the Play Store's app cache. Unfortunately, while these remedies may have worked for past problems that led to this error message being displayed, they are useless when trying to fix the issue people are experiencing this weekend.

I decided to do a bit of digging, and I found that Google's web servers are actually sending corrupted packages to the Play Store app. Therefore, no amount of tweaking your device will fix the problem. (Unless such tweaking happens to result in pulling packages from a different web server that doesn't have corrupted files, I suppose.)

UPDATE 2013-08-12: It appears that this problem is isolated to one or more specific servers on Google Play's content distribution network -- if your closest server has corruption, you'll always see this issue unless you move to a different network and a different server is selected. I see the problem here in Colorado, and a brief Twitter survey shows a high concentration of complaints from the U.S. Midwest and Great Lakes region. Suggestions to use a VPN have some merit -- when I VPN into Dallas, I can successfully update/install these problematic packages, because a non-corrupted server is chosen in that case. (Obviously this isn't a great solution.)

UPDATE 2013-08-13: I heard from a Google Play engineer today. It sounds like they're in the process of rolling out a fix, so our package updates and installs should be back to normal very soon!

I've observed this problem on the following devices:

  • Galaxy Nexus (Android 4.2)
  • Nexus 10 (Android 4.3)

To investigate the problem, I tried downloading the recently released Twitter 4.1.4 package, and compared the downloaded package file (temporarily stored in /data/data/com.android.providers.downloads/cache/downloadfile.apk) to a known good version.

A hex dump of an uncorrupted Twitter 4.1.4 package looks like this around offset 0x0200000:

01fffc0: 6e69 2067 6fcc 8872 6d65 6b2e 0028 2b42  ni go..rmek..(+B
01fffd0: 6972 2069 6e73 616e 206d c4b1 73c4 b16e  ir insan m..s..n
01fffe0: 2079 6f6b 7361 2062 6972 2062 696c 6769   yoksa bir bilgi
01ffff0: 7361 7961 7220 6dc4 b13f 000c 0c42 6f79  sayar m..?...Boy
0200000: 7574 3a20 252e 3166 6b00 0f11 4b6f 6e75  ut: %.1fk...Konu
0200010: 6d75 2064 65c4 9f69 c59f 7469 7200 0303  mu de..i..tir...
0200020: 5369 6c00 2122 2225 3124 7322 2022 2532  Sil.!""%1$s" "%2
0200030: 2473 2220 6c69 7374 6573 696e 6920 6f6c  $s" listesini ol

A hex dump of the corrupted Twitter apk looks like this around offset 0x0200000:

01fffc0: 6e69 2067 6fcc 8872 6d65 6b2e 0028 2b42  ni go..rmek..(+B
01fffd0: 6972 2069 6e73 616e 206d c4b1 73c4 b16e  ir insan m..s..n
01fffe0: 2079 6f6b 7361 2062 6972 2062 696c 6769   yoksa bir bilgi
01ffff0: 504b 0304 1400 0800 0800 e27c 0543 2d70  PK.........|.C-p
0200000: 8d5b c420 0100 986f 0200 1d00 0400 6173  .[. ...o......as
0200010: 7365 7473 2f66 6f6e 7473 2f52 6f62 6f74  sets/fonts/Robot
0200020: 6f2d 4c69 6768 742e 7474 66fe ca00 00ec  o-Light.ttf.....
0200030: 9d07 7c54 55fa f74f 994c 0a21 bd00 8190  ..|TU..O.L.!....

At 16 bytes before the 2-megabyte mark, the corrupted file begins repeating the contents of the beginning of the file, including the ZIP header. It looks like a common programming error when dealing with buffered I/O streams. I first suspected that the Play Store app or the Android framework on my devices had such an error, but then I used tcpdump to examine the actual HTTP traffic as seen from my router:

GET http://r15---sn-qxo7sn7s.c.android.clients.google.com/market/GetBinary/com.twitter.android/420?...
22:01:25.861259 IP 74.125.x.x.80 > 192.168.x.x.39431: Flags [.], seq 2097056:2098516, ack 527, win 245, length 1460
...
0x0230:  2073 cca7 6966 7265 6e69 2067 6fcc 8872  .s..ifreni.go..r
0x0240:  6d65 6b2e 0028 2b42 6972 2069 6e73 616e  mek..(+Bir.insan
0x0250:  206d c4b1 73c4 b16e 2079 6f6b 7361 2062  .m..s..n.yoksa.b
0x0260:  6972 2062 696c 6769 504b 0304 1400 0800  ir.bilgiPK......
0x0270:  0800 e27c 0543 2d70 8d5b c420 0100 986f  ...|.C-p.[.....o
0x0280:  0200 1d00 0400 6173 7365 7473 2f66 6f6e  ......assets/fon
0x0290:  7473 2f52 6f62 6f74 6f2d 4c69 6768 742e  ts/Roboto-Light.
0x02a0:  7474 66fe ca00 00ec 9d07 7c54 55fa f74f  ttf.......|TU..O

Sure enough, the corruption was present in the stream as sent from Google's web server. I assume that the bug is in Google's web server code, or in some intermediate package processing step at the Play Store. Either way, we'll just have to wait for Google to fix the glitch.

Google Fiber Tourism: Plugging into the glass at the Kansas City Hacker House

While finishing up my holiday travel, I decided to stop in for a couple of days at the Kansas City Hacker House, a place for aspiring technology entrepreneurs to live and work on their projects while connected to the Google Fiber gigabit network. Unlike my previous Google Fiber experience, I had an opportunity to plug my laptop directly into the network via gigabit ethernet and run some more tests.

Legacy Tests

I first ran a few tests of legacy network usage -- uploading and downloading large files from various services.

testfile sizetimeeffective bitrate
Google Drive - upload256MB400 seconds5.3687 Mbps
Google Drive - download256MB289 seconds7.4307 Mbps
Dropbox - upload256MB31.7 seconds67.655 Mbps
Dropbox - download256MB67.6 seconds31.779 Mbps
Ubuntu 12.10 (mirror.anl.gov)753.293MB61.4 seconds102.86 Mbps
Ubuntu 12.10 (bittorrent)753.293MB342 seconds18.477 Mbps (peak 31.932)
Linux Mint 12 (bittorrent; 72/325 seeds)1027.468MB283 seconds30.456 Mbps

It looks like Google Drive wasn't having a good day. Dropbox, on the other hand, really screamed. (Although not as much as you might expect on a gigabit connection.) It was nice to be able to download Ubuntu in 61 seconds from a well-connected server. Bittorrent didn't perform well, though -- I suspect you'd need to be downloading a much larger file from many more seeds before you'd see Bittorrent have time to ramp up the connections and compare favorably.

All tests were performed to and from a local ramdisk, to avoid any hard drive I/O bottlenecks. However, the remote servers are likely using spinning disks that are contending with many other users.

Speedtest.net tests

The Speedtest.net tests really aren't very useful for Google Fiber, since the servers aren't really set up for measuring high-bandwidth connections. You really end up measuring the server's capabilities and the throughput of various intermediate networks. Nevertheless, here are a couple of tests:

I tested with several other Speedtest.net servers, and all the results varied too much to be useful.

Google Fiber Speed Test

To provide users with a reliable way of measuring the bandwidth to their home, Google provides a Google Fiber Speed Test for testing the connection from the home to a server on the Google Fiber network. (Google Fiber customers can access the server, but it doesn't appear to be accessible from the outside.)

The primary differences between Google's speed tests and the other speed tests seem to be:

  1. Google's server is located on the Google Fiber network in Kansas City, a mere 4 hops and 1.337ms of latency away from Google Fiber customers. This means that the Google Fiber Speed Test can more accurately measure the capability of a customer's last-mile link. (This also means it's perhaps less useful as a test for measuring access to resources outside of Kansas City.)
  2. The server is presumably provisioned well enough to handle tests from gigabit customers.
  3. Google's test opens a large number of simultaneous connections -- as many as 64 from my tcpdump observations. This may help with issues related to TCP window size, and possibly mitigate the negative effects of TCP congestion control should one of the connections miss a packet.

Network topology

Google Fiber has considerably increased their peering arrangements since my last visit. They seem to have good peering with the following networks that I noticed:

  • Level 3 - Chicago
  • XO - Chicago
  • Facebook (Rackspace ORD1) - Chicago
  • Inteliquent - Chicago
  • Kansas Research and Education Network (KanREN) - Kansas City
  • Level 3 - Dallas
  • Level 3 - Denver
  • Comcast (Equinix Great Oaks) - San Jose
  • Level 3 - San Jose
  • Amazon - San Francisco
  • Google - various

(Who knew that Facebook even had their own nationwide network? If you see tfbnw.net addresses in your traceroutes, the tfbnw stands for "the facebook network".)

IPv6 seems to be functioning properly, according to various online testers. (I did have some issues reaching my 6to4-connected home network via IPv6, for some reason.)

Conclusions

The file transfer tests -- old-fashioned "move this big file from one hard drive on the network to some other hard drive" -- are probably not the best tests of a next-generation gigabit service such as Google Fiber. Nor are most other "download" applications. (What's the point of being able to download four seasons of Breaking Bad in 3 minutes, when it takes 30 hours to watch?) Ultimately, unlocking the true potential of home gigabit connections will rely on the development of new and interesting applications. I predict a lot of live media, immersive telepresence, and rich collaboration applications will arise from this experiment.

Thanks to Ben Barreth and the residents of the Hacker House for having me over!

Hanging out on the job: Using Google Hangouts for collaborative telepresence
My work and telepresence setup.

As a work-from-home software engineer, I'm always looking for ways to improve communication with co-workers and clients to help bridge the distance gap. At the beginning of October, a colleague and I decided to devote the month to an extreme collaboration experiment we called Maker's Month. We had been using Google Hangouts for meetings with great effectiveness, so we asked ourselves: Why not leave a hangout running all day, to provide the illusion of working in the same room? To that end, we decided to take our two offices -- separated spatially by 1,000 miles -- and merge them into one with the miracle of modern telecommunications.

We began by establishing some work parameters: We would have a meeting every morning to discuss the goals of the day, then mute our microphones for most of the next 6 to 7 "core office hours" while the hangout was left running. During the day we could see each other working, ask questions, engage in impromptu integration sessions, and generally pretend like we were working under the same roof. At the end of the day, we would have another meeting to discuss our accomplishments, adjust the project schedule, and set goals for the following day. We would then adjourn the hangout and work independently in "offline" mode.

There were a handful of questions we were hoping to answer during the course of this experiment:

  • How much bandwidth would this telepresence cost, in terms of both instantaneous bitrate and total data usage?
  • What audio/video gear would give us the best experience, and help avoid the usual trouble areas? (Ad-hoc conferencing setups are notorious for annoying glitches such as remote echo.)
  • Would Google even allow us to keep such long-duration hangouts running, or to use such a large number of hangout-hours in a month? (Unlike peer-to-peer protocols such as RTP/WebRTC/etc., hangout media streams are actually switched in the cloud and consume the CPU/bandwidth resources of Google.)
  • Do extended telepresence sessions provide real value to software development teams?

While Google Hangouts supports up to nine people in a hangout, our experiment only involved two people. (Our initial plans to bring a third team member into the hangout never materialized.)

Technical details

This wouldn't be a proper Caffeinated Bitstream post without some graphs and figures, so here are some charts showing the overall bandwidth usage:

The first chart shows the bandwidth usage of a typical two-person hangout session, which uses about 750-1000 kbps in each direction (when the connection settings are configured for "fast connection"). The aberrations in the chart are due to changing hangout parameters (i.e. screen sharing instead of video, or the remote party dropping off.) The second chart shows the bandwidth usage for my house during the month of October. The hangout sessions are likely the bulk of this usage, but it also includes occasional movie streaming, Ubuntu downloads, software updates, and such. I sometimes hear people comment that the bandwidth caps imposed by some internet service providers can't be exceeded by legitimate use of the network, but I can easily imagine many telepresence scenarios that would quite legitimately push users over the limit. Fortunately, our usage is fairly modest, and my provider doesn't impose caps, anyway.

My hangout hardware consists of:

  • A desktop computer with a quad-core Core i7 920 2.67Ghz processor and 8GB of RAM, running Ubuntu Linux
  • A dedicated LCD monitor
  • A Logitech HD Pro Webcam C910
  • A Blue Yeti microphone
  • A stereo system with good speakers, for audio output.

I've occasionally run Google Hangouts on my mid-2010 MacBook Pro, but the high CPU usage eventually revs up the fan to an annoying degree. The desktop computer doesn't seem to noticeably increase its fan noise, although I do have it tucked away in a corner. I've found that having a dedicated screen for the hangout really helps the telepresence illusion. The Yeti microphone is awesome, but the C910's built-in microphone is also surprisingly great. In fact, my colleague can't tell much of a difference between the two. I've noticed that the use of some other (perhaps sub-standard) microphones seems to thwart the echo cancellation built-in to Google Hangouts, resulting in echo that makes it almost impossible to carry on a conversation.

In addition to its thirst for bandwidth, Google Hangouts also demands a hefty chunk of processor time (and thus, power usage) on my equipment:

systemcpu usagequiescent powerhangout powerhangout power increase
4-core Core i7 920 2.67Ghz desktop62%75W80W5W
2-core Core i7 2.66Ghz mid-2010 MacBook Pro77%13W38W25W

(Note: CPU usage is measured such that full usage of a single core is 100%. The usage is the sum of various processes related to delivering the hangout experience. On Linux: GoogleTalkPlugin, pulseaudio, chrome, compiz, Xorg. On Mac: GoogleTalkPlugin, Google Chrome Helper, Google Chrome, WindowServer, VDCAssistant. Power was measured with an inline Kill A Watt meter.)

I figure that using my desktop machine for daily hangouts has a marginal electrical cost of around $0.06/month. (Although keeping this desktop running without suspending it is probably costing me around $4.74/month.) Changing the hangout settings to "slow connection" roughly reduces the CPU usage by half.

Why does Google Hangouts use so much CPU and bandwidth? I think it all comes down to the use of H.264 Scalable Video Coding (SVC), a bitrate peeling scheme where the video encoder actually produces multiple compressed video streams at different bitrates. The higher-bitrate streams are encoded relative to information in the lower-bitrate streams, so the total required bitrate is fortunately much less than the sum of otherwise independent streams, but it is higher than a single stream. The "video switch in the cloud" operated by Google (or perhaps Vidyo, the provider of the underlying video technology) can determine the bandwidth capacity of the other parties and peel away the high-bitrate layers if necessary. Unfortunately, not only does SVC somewhat increase the bandwidth requirements, but it also means that the Google Talk Plugin cannot leverage any standard H.264 hardware encoders that may be present on the user's computer. Thus, a software encoder is used and the CPU usage is high. The design decision to use SVC probably pays off when three people or more are using a hangout.

One downside to using Google Hangouts for extended telepresence sessions is the periodic "Are you still there?" prompt, which seems to appear roughly every 2.5 hours. If you don't answer in the affirmative, you will be dropped from the hangout after a few minutes. Sometimes when I've stepped out of the office for coffee, I'll miss the prompt and get disconnected. I understand why Google does this, though, and reconnecting to the same hangout is pretty easy. Even with our excessive use of Google Hangouts, we haven't encountered any other limits to the service.

Telepresence effectiveness

Video conferencing has always offered some obvious communication advantages, and Google Hangouts is no exception. The experience is much better than talking on the phone, as body language can really help convey meaning. In many ways, it does help close the distance gap and simulate being in the same room: team members can show artifacts (such as devices and mobile phone apps) and see at a glance if other team members are present, absent, working hard on a problem, or perhaps available for interruption. We made heavy use of the screen sharing feature, and even took advantage of the shared YouTube viewing on several occasions. We didn't engage in pair programming in this experiment, although remote pair programming is not unheard of. The biggest benefit of telepresence for geographically distributed teams seems to be keeping team members focused and engaged, as being able to see other team members working can be a source of motivation.

For me, the biggest downside to frequent use of Google Hangouts is the "stream litter" problem: Every hangout event appears in your Google+ stream forever, unless you manually delete it. While it's only visible to the hangout participants, it's really annoying to have to sift through a hundred hangout events while I'm looking for an unrelated post in my Google+ stream. Also, it's sometimes awkward when I want to share the screen from my work computer while using a different computer for the hangout. I end up joining the hangout a second time from my work computer, only to have nasty audio feedback ensue until I mute the microphone and speaker.

Conclusions

I think that using Google Hangouts for extended work sessions adds a lot of value, and I'll continue to use it. It would be interesting to try other video conferencing solutions to see how they compare.

For the impatient people who just scrolled down to "Conclusions" right away, here's the tl;dr:

Pros:
  • Continuous visual of other team members increases the opportunities for impromptu discussions and helps motivation.
  • The "same room" illusion helps close the distance gap associated with telework.
  • Good quality audio and video.
  • Easily accessible from GMail or Google+.
  • Screen sharing.
  • Shared YouTube viewing.
Cons:
  • Relatively high (but manageable) bandwidth and CPU requirements.
  • Google+ stream littered with hangout events.
  • 2.5-hour "Are you still there?" prompt.
  • When eating doughnuts in front of team members, can't offer some for everyone.
A quick survey of C++11 feature support

I recently conducted a quick-and-dirty survey of C++11 (formerly known as C++0x) features available on various platforms and compilers that I had lying around. My testing was not authoritative nor rigorous. (For example, g++ without -std=c++0x actually compiles lambdas without throwing an error, so I marked it as supported even though it does give a stern warning.) I'm posting the results here, mostly for my own future reference.

Mac OS 10.6 / Xcode 4.2
gcc version 4.2.1
Apple clang version 3.0
Ubuntu 12.04
gcc version 4.6.3
Ubuntu clang version 3.0-6ubuntu3
Windows 7
MSVC++ 2010
g++ clang++ clang++ -std=c++0x g++ g++ -std=c++0x clang++ clang++ -std=c++0x cl.exe /clr
__cplusplus 1L 1L 201103L 1L 1L 1L 201103L 199711L
__GXX_EXPERIMENTAL_CXX0X__ undef undef 1 undef 1 undef 1 undef
omit space in nested template ">>" X X X X
std::tr1::shared_ptr X X X X X X X
std::shared_ptr X X X
nullptr X X X X
auto X X X X X
uniform initialization X
for range (foreach) X X X X X
move semantics (std::move) X X
raw string literals X X
encoded string literals X X
noexcept X X X
constexpr X X X
variadic templates X X X X X X
lambdas X X X
decltype X X X X
new function declaration style X X X X
scoped enums X X X X
std::function X X X
std::tr1::function X X X X X X X
can autodetect need for std::tr1 X X X X X X X X

Other, probably more thorough information about C++11 feature support:

My quick-and-dirty test suite is available for download.

UPDATE 2013-05-27: More recent platforms and compilers, below...

Mac OS 10.8 / Xcode 4.6.2
gcc version 4.2.1
Apple clang version 3.3
Ubuntu 13.04
gcc version 4.7.3
Ubuntu clang version 3.2-1~exp9ubuntu1
clang++ clang++ -std=c++11 g++ g++ -std=c++11 clang++ clang++ -std=c++11
__cplusplus 199711L 201103L 199711L 201103L 199711L 201103L
__GXX_EXPERIMENTAL_CXX0X__ undef 1 undef 1 undef 1
omit space in nested template ">>" X X X
std::tr1::shared_ptr X X X X X X
std::shared_ptr X X
nullptr X X X
auto X X X X X
uniform initialization X X
for range (foreach) X X X X X
move semantics (std::move) X X
raw string literals X X X
encoded string literals X X X
noexcept X X X
constexpr X X X
variadic templates X X X X X X
lambdas X X X X
decltype X X X
new function declaration style X X X
scoped enums X X X
std::function X X
std::tr1::function X X X X X X
can autodetect need for std::tr1 X X X X X
Nest Learning Thermostat: Installation, battery issues, and the importance of the "C" wire

My furnace's control board. The "C" terminal has no connection to the thermostat in this picture. (The white wire on the C terminal goes to the A/C.) I connected the unused blue wire (bottom center) to the C terminal.

The Nest now confirms the active "C" wire.

I recently bought and installed a Nest Learning Thermostat to replace my old non-networked thermostat. I show the installation, demonstrate control from mobile devices, and provide a general review in the above video.

It's been about a month since I installed the device, and I found one important issue yesterday. My Nest dropped off the network for 7 hours, and upon investigation I discovered that the battery was low and it turned off the Wi-Fi radio to save power. Many other people have reported problems with the battery, which is scary because your thermostat is one device that you absolutely want to work 24/7 -- you don't want your pipes freezing when you leave town and the Nest decides to run out of juice!

It turns out that my thermostat wiring, like in many homes, does not provide a "C" wire (common 24VAC) for completing a circuit that provides constant power to the unit. This sort of wiring worked great for old-fashioned mercury thermostats -- it provides a red 24VAC power wire, and "call" wires for turning on the fan, heat, and air conditioning. When the thermostat needs to turn on one of those appliances, it simply closes the circuit between the red wire and the relevant call wire. Smart thermostats rely on batteries to power their smartness when no circuit is closed. When an appliance is running (i.e. one of those three circuits is closed), it can perform "power stealing" to sap power from the closed circuit for its operation and recharging the battery. For simple programmable thermostats, power stealing is probably sufficient. However, for a power-hungry device like the Nest that needs to operate a Wi-Fi radio, this mode of operation can be problematic for several reasons:

  1. If you live in a nice place like Colorado where you can open the windows and go days without using the heater or air conditioner, the control circuits are never closed and the Nest's battery doesn't have an opportunity to recharge.
  2. Power stealing is an imperfect backwards compatibility hack, and can't necessarily provide enough current to recharge the battery even when the appliances are operating. This is because the current may be limited by resistance in your furnace's control board.
  3. When the HVAC appliances are not running and the battery needs to be charged, the Nest performs an even worse hack than power stealing: it pulses the heater call circuit on and off very quickly to steal some power, and hopes that the pulses are short enough to keep the furnace from activating. I haven't noticed any problem with this, but at least one person has found that this wrecks havoc on their heater.
  4. The Nest uses a "Power Saving Mode" of Wi-Fi to reduce the power consumption of the radio and prolong the battery life. (And hopefully require less overall power than it can steal from the call circuits.) Nest indicates that some non-conformant wireless access points may not fully support this mode, thus causing the Nest to consume more power. (Perhaps more quickly than it can be replenished.)

I was lucky that my thermostat wiring contained an extra, unused (blue) wire, and my furnace's control board provided a 24VAC common terminal for a "C" wire. After hooking up the blue wire at the furnace and the Nest's base, I now seem to have successfully provided a 24VAC "C" wire to the Nest, and hopefully my battery issues are behind me.

I do think that Nest is perhaps overly optimistic about their power stealing and circuit pulsing being able to provide adequate power to the device. There's certainly no warning about this potential issue when you provide your wiring information to their online compatibility tool.

References

1 2 3 4 5 Next page »