A Small Case Study in Threading

A recent project required me to process hundreds of CSV (comma-delimited) files and import them into a database. It's boring work, but it offers a few opportunities for me to utilize some of my favorite Ruby techniques and idioms.

The first version of the script was very simple.

  1. Copy the zip file containing the CSVs to a working directory
  2. Unzip the file
  3. Delete unwanted files (e.g. README)
  4. Loop through each filename
  5. Convert the CSV lines where necessary (i.e. numbers should be Integer or Float instead of defaulting to String, do the same for dates)
  6. Insert the converted data into the database
  7. Delete the files as each completes, then go back to 1 for the next zip

I originally ran this under MRI and was getting approximately 3000 inserts per second on average. I did a quick back-of-the-envelope calculation and determined that the entire import would require about 15 days. Uh oh.

Ultimately I decided to write two scripts. The first script would handle Steps 1-4 and 7. The second script would handle Steps 5-6. Script One would launch Script Two as a subprocess via IO.popen and send filenames to Script Two's STDIN.

The real fun was in Script Two. Upon receiving a filename, it divides the file into multiple subfiles of equal size and generates an IO handle for each temp file. These IO handles are then passed to their own threads which do the processing described in Step 5. These threads batch up 10_000 lines and pass this bulk array to a dedicated database insertion thread which handles the bulk inserts. All inter-thread communication occurs through a SizedQueue. I chose a SizedQueue to provide backpressure to the parsing threads if they get too far ahead of the database thread.

Before spending any more time to optimize this work, I decided to benchmark the CSV parsing of a small file under MRI and Rubinius (JRuby is a whole different story worthy of its own post at the JRuby blog). The results are below.

GuestOSX:options_database cremes$ ruby -v
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-darwin14]
GuestOSX:options_database cremes$ ruby benchmarks.rb 
Rehearsal ----------------------------------------------------------
parse CSV                7.910000   0.010000   7.920000 (  7.924954)
------------------------------------------------- total: 7.920000sec

                             user     system      total        real
parse CSV                8.040000   0.020000   8.060000 (  8.053098)

GuestOSX:options_database cremes$ chruby rbx
GuestOSX:options_database cremes$ ruby -v
rubinius 2.5.8 (2.1.0 bef51ae3 2015-09-24 3.5.1 JI) [x86_64-darwin14.5.0]
GuestOSX:options_database cremes$ ruby benchmarks.rb 
Rehearsal ----------------------------------------------------------
parse CSV               16.264571   0.161624  16.426195 ( 10.562584)
------------------------------------------------ total: 16.426195sec

                             user     system      total        real
parse CSV                9.084859   0.033108   9.117967 (  9.010402)

The test was single threaded. Looking at the "real" column the table shows that MRI is fastest with 8 seconds to parse 50_000 lines of my test data. Rubinius came in second place with 9 seconds to parse the same.

Running the test again with 4 threads each (on an 8-core machine) was enlightening.

GuestOSX:options_database cremes$ ruby -v
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-darwin14]
GuestOSX:options_database cremes$ ruby benchmarks_multi.rb 
Rehearsal -------------------------------------------------------------
parse CSV - multithreaded   7.870000   0.100000   7.970000 (  7.962155)
---------------------------------------------------- total: 7.970000sec

                                user     system      total        real
parse CSV - multithreaded   7.930000   0.090000   8.020000 (  8.012822)

GuestOSX:options_database cremes$ chruby rbx
GuestOSX:options_database cremes$ ruby -v
rubinius 2.5.8 (2.1.0 bef51ae3 2015-09-24 3.5.1 JI) [x86_64-darwin14.5.0]
GuestOSX:options_database cremes$ ruby benchmarks_multi.rb 
Rehearsal -------------------------------------------------------------
parse CSV - multithreaded  20.991651   0.720063  21.711714 (  4.663789)
--------------------------------------------------- total: 21.711714sec

                                user     system      total        real
parse CSV - multithreaded  14.130549   0.136832  14.267381 (  2.984456)

MRI ran the 4-thread benchmark in the same 8 seconds as before! We are often reminded that MRI now maps its threads to native threads, but there is still a global interpreter lock (GIL) that prevents MRI from truly running code in parallel. Rubinius eliminated its GIL years ago, so all threads can run in parallel and produce a finishing time of just over 3 seconds.

With these improvements, Rubinius can finish my production job in about 5.5 days (versus the original 15 days). The CSV parsing work runs faster than the database can accept bulk inserts so, unless I want to spend a bunch of time optimizing the database configuration, my work is done. Thanks to Rubinius, I am saving 9 days on my import.

To reproduce these numbers on your own system, the benchmarks and test data can be found here.

Code Climate vs Rubinius

It is difficult to understand the behavior of a program written in a dynamic language, like Ruby, without running the program. While static analysis, like Code Climate, can tell us a fair amount about the code, there's still a lot more it can't tell us.

Wouldn't it be nice if the system running our program could tell us about what the code is doing while it's running? Rubinius can do this.

While a program is running, there are two graphs interacting. The first is the graph of functions (or methods) as they call one another. The second is the graph of data objects that the functions create or operate on.

In Rubinius, these two graphs intersect at the inline cache objects. Basically, wherever you have a method call in your Ruby program, when Rubinius runs your program, it creates a special object that records the type of object on which the method is called and what method is called. These simple Ruby objects record the graph of methods called in your program. From this graph, we can analyze all kinds of actual behavior of your code.

This sounds awesome, doesn't it? When was the last time you wondered how a bit of Ruby code in your program was interacting with other parts of the code? If you're like me, that's every time I'm writing Ruby code. So, how can Rubinius help? That's the problem, we don't have the tools that you need right now.

But I want to fix that. The question is, how? I need your help to decide.

Recently, Code Climate announced they were releasing their platform as open source, "the first open and extensible platform for static analysis". One possibility to leverage the ability that Rubinius has to help you understand your code is to integrate with the Code Climate platform.

Another possibility is to create a stand-alone Rubinius service that would start with some simple, runtime analysis of your code, but could expand in many different dimensions, showing data-flow, security analysis, performance, and many facets of application analysis beyond what is possible with simple static analysis.

I want to emphasize that these two options are only superficially similar. The facilities that we have in Rubinius, and continue to expand and improve, can provide far greater depth of analysis than that possible with static analysis. So, the question is really, where do we start?

We'd love to have your input. Please take this short survey and let us know what you think.

Survey: Code Climate vs Rubinius

If you have more you'd like to share, write us community@rubini.us.

Distributed Coding, Distributed Releases

In the last post, I talked about the new Rubinius versioning scheme. A version doesn't mean much to you if there's no release that goes with it. In this post, I'll describe the new release process we've been using.

What is a release, fundamentally? For Rubinius, it's a function from a version number to a commit SHA. A release is a git tag on master, and from this we can automatically derive the version number, the date of the release, and the commit SHA.

What a release isn't is a tarball or binary or any other artifact. As such, I make a distinction between releasing and a release artifact (or deploying, to borrow a word we use typically with SaaS). First we release, then we build a release artifact. Leveraging this two step process gives us the ability to correct errors early. We can easily delete a git tag and push a new one. Once a release is done, we build one or more artifacts and, in a sense, freeze the release in time.

This process solves a major coordination issue that we had with our previous process. Previously, we had to actually commit code to the repository to make a release. This meant that someone else could push code that changed what would be included in the release by racing on committing, or could push code that would conflict with the code in your release commit. With a git tag, synchronization is automatic, and it's a very light weight operation to delete a tag and push a new one when necessary.

This process also gives anyone with commit rights to the Rubinius repository the ability to make a release. Making a release requires a human; we are not going to automate it. The second part, building the release artifacts, needs to be automated and we are working on that. For example, we'll build the binary version that you use on Travis to test your code with Rubinius as part of our Travis CI job. We'll also upload the release source tarball directly from Travis.

The new release process and automation of the release artifacts, in concert with the new version scheme, should help us accelerate getting Rubinius enhancements and features to you with a minimum amount of work. Additionally, by making releasing something that any contributor can do, it will improve the scope of participation in the project. We're really excited about these developments. If you have questions, let us know community@rubini.us.

MAJOR.MINOR: Maximize Delivering Features, Minimize Trouble

There are a lot of versioning schemes out there, from lax to strict to weird and everything in between. It's the stuff of endless debates and plenty of disappointment and unhappiness all around. That's a bit sad since a versioning scheme is about getting new stuff. Who doesn't want new stuff? Rubinius is switching to a MAJOR.MINOR version number scheme and I'd like to tell you how it works and why we're doing it.

At its core, a versioning scheme is part communication and part contract. The communication part is a signal from developers to users that an update is available. The contract part is an agreement about the impact of the update on the user.

Additionally, the versioning scheme exists in the context of two competing concerns: 1. the developer wants to deliver features and improvements; and 2. the user wants the highest stability for the least amount of effort on their part.

It seems to me that the conflict between these two is not well appreciated. Every user's needs are unique (even allowing for some equivalence classes), and trying to devise a scheme that meets the union of those is impossible. This is especially true as the scope of the software covered by a single version number increases (and hence why the small modules approach can look very attractive). Consequently, there's a tension between batching things up in big enough chunks to accommodate the users who update slowly and providing new features quickly to those that update often.

To complicate both these aspects, the software development industry seems to operate as if it's still putting bits on physical media, putting those into boxes, into trucks, and onto store shelves. Very slowly, we are moving away from this model and to one where a mostly invisible stream of improvements find their way automatically to our devices and applications. Thank goodness.

So, what's in a Rubinius version number under this MAJOR.MINOR scheme? One part communication and one part contract:

  1. Communication: The MAJOR number designates an epoch--a period of time in the project's history typically marked by notable events or particular characteristics (adapted from the Apple dictionary). The MINOR number is a monotonically increasing number. If you are using Rubinius version A.N and there exists a version A.M, where M > N, you should upgrade.
  2. Contract: If you are using Rubinius version A.N, version A.N+1 will only: 1. remove a previously deprecated feature; or 2. add a deprecation warning. In other words, if you have no deprecation warnings, you can update to A.N+1 and expect no breaking changes.

This versioning scheme helps us deliver features faster, which is one of my main goals. And it explicitly decouples Rubinius changes from your decision about when to upgrade. I provide a clear signal and a clear expectation about the impact of a new version. If you decide to update every 15 or 20 versions, you can do so in whatever way works best for you. You can jump ahead to the newest version and if that doesn't work, bisect version numbers just like you might bisect git commits to find an issue.

Finally, this scheme ties in well with providing landmarks about Rubinius development. At any point in time, there are the things we have learned but haven't delivered features based on that learning yet, and things we are learning. I dislike roadmaps because they are mostly predictions, and we collectively are terrible at predictions. Focusing on landmarks gives the opportunity to discuss direction without necessarily deciding on the path there. As we learn, we deliver features and improvements and learn more. Of course, this approach is possible with other versioning schemes, but I think the one I'm describing here is especially good for this approach.

As for landmarks, here are a couple: the last version of Rubinius 2.x will be 2.71828182 and the last version of Rubinius 3.x will be 3.14159265. That should be plenty of numbers to do some interesting stuff.

The new versioning scheme will start with the next release, version 2.6. We'll be working the issues out of the deprecations process, so please be patient with that. As always, we love to hear about your experiences, so drop us a note community@rubini.us.

Rubinius :heart: Gitter IM

Full disclosure: Gitter neither solicited nor reviewed this post but they do donate the Rubinius organization Gitter account at no cost so we can have full access to private chat and history. We thank them for this sponsorship.

Gitter IM is the sweet spot of open source, GitHub-centered, near-realtime team communication. We've been using it for months on the Rubinius project and I've been meaning to write this post for months because it was immediately obvious how good it is. However, just before we started using it, I didn't get it. I'm hoping this post will help you see why it is such a great tool.

The Need To Communicate

One of the central, critical aspects of an open source project is communication. We ask a question, share an idea, talk over some code, answer a question, coordinate work, and share general life experiences. Since we are distributed in time zones across the world, we need a balance point between communication that is synchronous (I ask a question and you immediately respond) and asynchronous (I ask a question and at some, undetermined, future time you respond).

Historically, IRC has been the medium for this sort of communication. Many people defend its use and some people are adamant that it's the only valid option for open source projects. Unfortunately, it has significant drawbacks for everyone involved in a project, experienced and beginner alike.

I'm comfortable in IRC and have been for years. But I pay for a VPS to maintain my IRC bouncer so I have history for all the channels I'm in even when I'm not physically online. I also need to maintain that server with security patches and upgrades.

There are alternatives to rolling your own IRC bouncer but the point is that there's a significant cost here. And most importantly, as a newcomer, you have to know something about IRC before you can ask your first question. Unless you are contributing to IRC software, I'm certain your question has nothing to do with IRC.

And that's why the limitations of IRC affect everyone on the project. Numerous times I've logged in to find a question from someone and when I go to reply, I notice they are no longer in the channel. Missed connection, missed opportunity. Bummer. That doesn't only happen with newcomers either. The number of experienced developers that I've interacted with for years on IRC who don't use a bouncer or service is high.

So, we have a vital need to communicate but just getting to the point of asking a question is a big hurdle. However, that's not where the difficulty ends. Even when someone has gotten on IRC, there are more challenges waiting. This is where Gitter really shines. It eliminates common barriers to communication and improves the context for the communication.

Barriers To Communication

On Gitter, your nick or handle is your GitHub user name. So simple; so useful. It may not seem like a big thing: you have a GitHub user name and you have an IRC user name and you created them at different times so they are often different. Big deal; who can't remember two names?

Well, me for one. I need to keep track of a lot of people and not having to even wonder what a person's GitHub user is when I'm chatting with them on Gitter is a huge help. It's also a huge help for people who are new to the project and just getting oriented.

Think about this: when you go from one thing to two things, you've increased complexity by 100%. But as programmers, with our fancy loop constructs, we typically don't think twice (no pun intended) about multiplying by two. If you think that's not a huge deal, perhaps try asking for double your salary. In the "real world", multiplying by two is a big deal.

With Gitter, everyone is who they are on GitHub. And since we're working with code hosted on GitHub, that's an obvious, useful, simplication.

Communicating In Context

Communication happens in a context. It's inseparable from that context. This is the key thing that Gitter gets right, and the thing that I didn't get until I created an account and started poking around.

Your GitHub organization and repositories have corresponding Gitter rooms. Boom! Did you see that? Where do you go to talk about Rubinius? gitter.im/rubinius/rubinius. Simple, direct, and (in retrospect) obvious.

These are some of the contexts we are already using:

  1. The Rubinius organization room (gitter.im/rubinius): This room is public/private. It's available for everyone who is a member of Rubinius (over 100 people), but it's not visible to people outside the organization. This is a perfect place to hold organization-member wide discussions.
  2. The Rubinius team room: This is a private channel accessible only to the Rubinius team members where we can have a safe space to discuss and share.
  3. The main Rubinius room (gitter.im/rubinius/rubinius): This is a public room where anyone with a GitHub account can stop by to ask questions, leave comments, or interact. We have integrations enabled so we can see when issues are opened, comments are made on issues, code is updated, and Travis CI results are posted.
  4. Individual private discussions: These are rooms where two people can hold a conversation in private.
  5. Other project rooms: These are public rooms corresponding to other Rubinius repositories. Rubinius is a complex project with many parts. Having a dedicated room to discuss a specific project (for instance, the Rubinius bytecode compiler), provides a good way to put communication in context and cut down on the cross-communication that happens when a project has a single IRC channel. (Most projects I've been involved with had at most two IRC channels.)
  6. Other private group rooms: These can be created as-needed by people leading specific parts of the Rubinius project. This is not being used extensively yet, but with the work we're doing to define project roles, this will be a very useful tool.

Of course, all of this is possible with IRC. But it requires specific, extra work. It's built-in with Gitter, and that is the best part of the service. It's not all of why it's good, but it's an essential aspect.

Another aspect of context is the consistency of the experience. The Gitter app or the website function equivalently (I'm explicitly excluding the Gitter IRC bridge from this). With IRC, there are such a variety of clients that you don't necessarily know what the other person is seeing. With newcomers, this can impact the ability to communicate well. With Gitter, I can better evaluate if how I'm communicating is helpful or not because I can be reasonably sure they are seeing what I'm seeing.

Communicating About Code

Finally, since we're working on an open source project, a lot of the communication is about code. With GitHub flavo(u)red Markdown, communicating about code is really nice.

It seems like a simple thing, but it's not. For example, HipChat gets this horribly wrong. Communicating about code in HipChat is like using Notepad.exe (sorry, Notepad diehards) to program. It's possible, but only by stretching the meaning of the word beyond recognition.

We often make the assumption that code is concrete, but it's usually quite complex and being able to communicate about it well and visually is tremendously useful. And even when not communicating about code, the formatting available with Markdown makes communicating better and more enjoyable.

So, those are the reasons I think Gitter is awesome. Have you used it? Do you like or dislike it? What do you use instead and why do you like that? We'd love to hear from you: community@rubini.us.

Who's Using Ruby (or Not), for What, and Why?

If the only constant is change, as Heraclitus said, then we should constantly be trying to understand what's changing, why, and how it affects us.

The way we build, test, deploy, maintain, and support applications has changed a lot in the past five years. Containers, microservices, the growing number of connected devices (IoT, if you must), and the sorts of applications people are building with Ember, Angular, React and Meteor are big, not small, changes.

Containers, in particular, are intriguing because of how central they are becoming to much of what is changing about the application development landscape. At his recent keynote at MesosCon, Adrian Cockcroft showed the following slide when talking about adoption of Docker.

Adoption of Dockero

How does Ruby and Rails fit into this rapidly changing landscape? If you search Google for "what companies use Ruby or Rails" there's not a lot of specific details about what people are doing with Rails. How does Rails fit into a containerized microservices environment? How are people building services with Rails, or are they? I'm curious about this and I think a lot of the new developers learning Ruby in this changing environment would be as well.

So, I've put together a really brief survey; I'd love to hear more about why people are using Ruby, or why they are not.

Survey: Who's Using Ruby (or Not), for What, and Why?

Based on the responses I get, I'd also like to do a few Google hangouts with people on both sides of the question. It's one thing to go to a conference and hear an advocate for Ruby tell you reasons why it's a good choice. It's perhaps more informative and entertaining to watch a couple knowledgeable people debate its merits drawing from specific experiences working with Ruby or competing technologies.

Ruby's support for rapid development and Rails' emphasis on convention over coniguration both seem well-suited for the world of containers and microservices. At the same time, we've seen a lot of public examples of companies switching away from these to scale their applications and infrastructure. I'm curious what people we haven't heard from are doing. Let's look at some data on that.

EDIT: The formatting for the link to the survey has been improved.