Go Speed Racer Go

By November 11, 2013clypd, clypd Blog, Technology

On a brief holiday from my regular activities, I recently spent a few days in Sidetrackistan attempting to answer a simple question: how far can we push bare-metal hardware using the simplest of web services and a standard deployment configuration? I quickly developed an experiment and collected data that exposed some interesting – though somewhat expected – limits of parallelism in some popular web platforms. I also identified a newer platform that seems better poised to exploit multicore to its theoretical limit.

This paragraph is a standard disclaimer for any published benchmarks on the Internet: these results are not intended to predict all workloads, this methodology is almost certainly flawed, and this experiment certainly doesn’t reflect an actual real-world scenario.

For this simple test my colleagues and I created web applications in Rails, Java, and Go that handled a single URL by responding with a fixed JSON string. To drive the testing of these simple web services we used the venerable Gatling project, which seemed to easily scale to support the load and – as a bonus – provided pretty charts of the results. All tests were run on CentOS 6.4.

The server under test was a 6-core Intel Xeon x5670, exposing 12 virtual CPUs to the operating system, with 24Gb of RAM and dual Gigabit Ethernet NICs. We employed two identical systems on the same LAN, one to host Gatling and the other for the test services. The Rails app was generated using the rails harness and a single controller added to handle one URL:

def handle
render :text => '{"id": 1}'
end

The Java application embedded a similar handler inside a jetty server:

public void handle(String target,
Request baseRequest,
HttpServletRequest request,
HttpServletResponse response)
throws IOException, ServletException
{
response.setContentType("text/html;charset=utf-8");
response.setStatus(HttpServletResponse.SC_OK);
baseRequest.setHandled(true);
response.getWriter().println("{'id': 1}");
}

Lastly, the Go version:

func handler(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("{\"id\": 1}"))
}

Clearly no one will ever accuse Java of being too succinct.

Here are our results all crammed onto a single chart.
goChart

Those tiny dots in the lower left corner illustrate the best we achieved with Rails. As expected, MRI fully loaded a single core. What surprised us was that JRuby(Puma) using 16 threads was unable to leverage more than a single core. Only 1 thread ever did any observable work and the JVM was unable to match the speed of MRI. One possible theory explaining this poor showing is the synchronization cost for the 15 useless threads and 1 active thread, but clearly something was way off. Thanks to a suggestion from the folks in #jruby on IRC, I switched from Puma and to Torquebox-lite and our app suddenly sprang to life across all 12 cores. Unfortunately JRuby(Torquebox-lite) still crawled along at about 2,200 requests per second, better than MRI but far less than we had hoped for.

While I’m sure there are ways to push Ruby and Rails higher we were unable to find the easy path forward. Besides, we still hoped to observe a huge jump from Java and Go since both of these platforms are fully parallelizable across native OS threads.

Native Java reached a limit of about 43,000 requests per second, an order of magnitude higher than Rails. Hooray! Using top to inspect the OS threads showed that Jetty spawned 275 of them. Wowzers. During our test one thread topped out at ~30% CPU and the other 274 in the pool hovered around ~1%. This clearly exposed an asymmetry in the parallelization strategy used by Jetty in its default configuration but there’s probably room for improvement:
http://wiki.eclipse.org/Jetty/Howto/High_Load

Last but certainly not least Go topped our trial with 52,500 requests per second. From a tip on the golang-nuts mailing list, I set GOMAXPROCS to 24, but otherwise there was no special build or deployment configuration. Inspecting the OS threads, each of Go’s 24 threads hovered at ~30% activity. In a mostly default deployment configuration Go was clearly the most adept at parallelizing the network workload symmetrically across all available cores.

At this point we were comfortable declaring Go a winner, but it’s interesting to speculate how much higher we could have gone, since we hadn’t even come close to pegging the CPU.

In a future trip I’d like to explore pushing the hardware even higher by tweaking network stack and operating system settings, by examining and reducing memory usage, and by avoiding other bottlenecks like cache misses and unnecessary context switches. But that will have to wait for another trip.

When not on the Go, Brian spends sometime as Principal Software Engineer at clypd.

Leave a Reply