Rubinius Compute - Programming for the Internet

Today, I'm announcing Rubinius Compute, a platform for computation inspired by Amazon Lambda that builds on the Rubinius language platform.

We see two major trends converging: First, more devices are being connected and those devices need to communicate, driving a constant rise in network use. Second, more data is being created, processed, and stored.

The problem is, there's a massive asymmetry in the network. A relatively tiny number of "smart" nodes do all the work and receive most of the communication, while an enormous number of nodes mostly just add load to the network.

The solution requires changing the way we build apps, moving towards distributed, resilient networks of collaborating nodes where computation and data co-exist.

IPFS is working on distributing content. We need to pair this with distributing computation. To do so, we need to move beyond the dominant abstraction of a "desktop computer", with operating system, libraries and packages, disk drives, etc. out in "the cloud".

Rubinius Compute is a foundation for building apps in a familiar way, leveraging Rubinius Analyst to understand and evolve them, and distributing them to the network without messing with irrelevant details from leaky "system" abstractions.

None of these ideas are new: distributed content, raw compute nodes without operating system abstractions, and apps as distributed networks of collaborating agents. But combined, they represent the most important shift yet seen in how we use computation.

Let's build the future.

Rubinius Analyst - Know Your App

Today, I'm announcing Analyst, the first product from Rubinius, Inc, building on the technology in the Rubinius language platform.

Analyst is a tool for anyone who has struggled to understand, evolve, transform, and scale an application.

It is an alternative to paying the significant cost to rewrite an application that has grown too large and complex to confidently update as quickly as business innovation now requires.

During the past 5-10 years, the world of programming has changed significantly. Languages like Clojure, Go, Rust, and Scala are commanding ever greater mind-share, and JavaScript has pushed into new territory, both within the browser and outside it. Programmers are not just switching languages, they are also changing the way they build applications.

At the same time, continuous integration, continuous deployment, and infrastructure automation are radically changing the way businesses deliver value to customers. These changes are also changing the way we build applications.

The next 5-10 years will make these past years seem tame. Tens of billions more devices are expected to be connected to the Internet in the next few years. The most important fact about all these devices is that they will be communicating with other devices. Distributed applications and microservices will be the default architecture. This evolution is already well underway.

Ruby has some challenges to continue to be useful in this changing environment. I'll post results soon from the recent surveys Who's Using Ruby? and What's your biggest pain point with Ruby? Some companies have elected to meet these challenges by rewriting all or significant portions of their apps in different languages.

One major goal of Analyst is to preserve the investment in your existing apps while helping you focus on the most important areas to improve as business requirements change.

Analyst enters a well-established market where companies like NewRelic, Sentry, Code Climate, and Coveralls already exist. This represents both a challenge and a great opportunity.

I'm announcing Analyst now to invite you along on a collaborative and participatory journey to explore and tackle the biggest problems people have with existing apps, and build the tools that are essential for a new generation of apps, where developers must respond in seconds and minutes instead of days and weeks.

As Analyst is a tool that builds on Rubinius features, expanding adoption of Rubinius will be an ongoing focus. As mentioned in the Rubinius, Inc announcement, I'm excited to be able to work directly with customers to identify and solve problems they have with apps and processes.

Please take a few minutes to let us know which Analyst features would help you the most:

Rubinius, Inc - A Benefit Company

I'm excited to share with you that I've formed Rubinius, Inc, an Oregon corporation designated as a "benefit company", to focus on building excellent programming language tools and sustain the development of the Rubinius platform.

You may be wondering, What is a benefit company?

Generally, it is a statutory designation that allows the business to consider the general social benefit and the environment when making decisions, rather than only focusing on revenue impact. The designation does not affect the tax structure for the business, and typically requires the business to publish an annual transparency report about the company's impact on society and the environment.

I elected to form a benefit company, rather than a foundation or non-profit, because I believe it provides the lowest administrative overhead while building a sustainable economic model to support Rubinius as an open source project. A portion of the shares of Rubinius, Inc are dedicated solely to sustaining Rubinius.

Open source is the future, but we need to innovate around the economics of open source. Many very profitable companies enjoy tremendous cost savings and generate significant revenue by building on the free contributions of many, many people. At the same time, many software-related businesses have chosen a predatory, adversarial relationship with their customers, locking them into proprietary technologies and milking them with support contracts.

But we are not limited to these existing structures. We can imagine a more collaborative future where companies have an array of opportunities to contribute financially and customers enjoy the flexibility to pursue the value they need within their own individual circumstances. This is not far-fetched. We are already seeing this evolution in the rapidly developing "cloud" technologies. Exploring this new, better, world of business is an important focus of Rubinius, Inc.

Rubinius, Inc will enable me to focus on two things that have become very important to me over the past nine years while working on Rubinius: helping to develop the Rubinius community, and working directly with customers to help them solve problems as they build and grow as well.

There is much more to come, but I'll leave you with two things for now:

  1. I have a new email address and I'd love to hear from you:
  2. Please take two minutes to tell me: What is your biggest pain point with Ruby?

Thank you!

Where You Get Your Ruby News: The Top Five

Recently, I was curious where people get news about Ruby these days. Lacking funds for a proper scientific poll, I sent a quick email to the Rubinius mailing list. A little more than two hundred people responded and I'm sharing some results below. Many thanks to everyone who took the time to respond!

(Note that I'm not making any assertions regarding whether these results are representative, accurate, or useful. I found them interesting and thought you may, too.)

If you're familiar with Ruby, it may come as no surprise that Peter Cooper's mighty media empire and Ruby Weekly soundly takes first place. What surprised me is that Twitter scored so highly, above Hacker News and Reddit among the responders. Here is a graph of the top five sources cited by percentage of people (the percentages sum to greater than one hundred because many people listed more than one).

Top 5 sources for Ruby news

The other aspect I looked at was how many sources people listed. Again, it was a surprise to me that so many listed a single source. Granted, something like Ruby Weekly, Hacker News, or Twitter aggregates a lot of content, but it was still interesting. Also note that this was difficult to calculate because the response field was completely unformatted. I've made some assumptions in parsing it but I think the results are mostly accurate. Here's a graph of the number of sources people cited, by percentage.

Number of sources cited

Beginners and seasoned Ruby programmers alike need a constant stream of news and discussion to help them learn and grow as developers. It was sad to see one person answer with the following:

At this point ruby feels dead I don't actively track it.

That was definitely the bleakest response, but a few people cited not being as involved with Ruby. There's a natural ebb and flow of interest, but if a few people are willing to say this, there's certainly a lot more who feel it. It's a reminder that sharing the interesting things we're doing with Ruby helps everyone, including people we don't even know.

If you're interested in adding your voice, take two minutes to fill out the survey:

Survey: Where do you get your Ruby news

A Small Case Study in Threading

A recent project required me to process hundreds of CSV (comma-delimited) files and import them into a database. It's boring work, but it offers a few opportunities for me to utilize some of my favorite Ruby techniques and idioms.

The first version of the script was very simple.

  1. Copy the zip file containing the CSVs to a working directory
  2. Unzip the file
  3. Delete unwanted files (e.g. README)
  4. Loop through each filename
  5. Convert the CSV lines where necessary (i.e. numbers should be Integer or Float instead of defaulting to String, do the same for dates)
  6. Insert the converted data into the database
  7. Delete the files as each completes, then go back to 1 for the next zip

I originally ran this under MRI and was getting approximately 3000 inserts per second on average. I did a quick back-of-the-envelope calculation and determined that the entire import would require about 15 days. Uh oh.

Ultimately I decided to write two scripts. The first script would handle Steps 1-4 and 7. The second script would handle Steps 5-6. Script One would launch Script Two as a subprocess via IO.popen and send filenames to Script Two's STDIN.

The real fun was in Script Two. Upon receiving a filename, it divides the file into multiple subfiles of equal size and generates an IO handle for each temp file. These IO handles are then passed to their own threads which do the processing described in Step 5. These threads batch up 10_000 lines and pass this bulk array to a dedicated database insertion thread which handles the bulk inserts. All inter-thread communication occurs through a SizedQueue. I chose a SizedQueue to provide backpressure to the parsing threads if they get too far ahead of the database thread.

Before spending any more time to optimize this work, I decided to benchmark the CSV parsing of a small file under MRI and Rubinius (JRuby is a whole different story worthy of its own post at the JRuby blog). The results are below.

GuestOSX:options_database cremes$ ruby -v
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-darwin14]
GuestOSX:options_database cremes$ ruby benchmarks.rb 
Rehearsal ----------------------------------------------------------
parse CSV                7.910000   0.010000   7.920000 (  7.924954)
------------------------------------------------- total: 7.920000sec

                             user     system      total        real
parse CSV                8.040000   0.020000   8.060000 (  8.053098)

GuestOSX:options_database cremes$ chruby rbx
GuestOSX:options_database cremes$ ruby -v
rubinius 2.5.8 (2.1.0 bef51ae3 2015-09-24 3.5.1 JI) [x86_64-darwin14.5.0]
GuestOSX:options_database cremes$ ruby benchmarks.rb 
Rehearsal ----------------------------------------------------------
parse CSV               16.264571   0.161624  16.426195 ( 10.562584)
------------------------------------------------ total: 16.426195sec

                             user     system      total        real
parse CSV                9.084859   0.033108   9.117967 (  9.010402)

The test was single threaded. Looking at the "real" column the table shows that MRI is fastest with 8 seconds to parse 50_000 lines of my test data. Rubinius came in second place with 9 seconds to parse the same.

Running the test again with 4 threads each (on an 8-core machine) was enlightening.

GuestOSX:options_database cremes$ ruby -v
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-darwin14]
GuestOSX:options_database cremes$ ruby benchmarks_multi.rb 
Rehearsal -------------------------------------------------------------
parse CSV - multithreaded   7.870000   0.100000   7.970000 (  7.962155)
---------------------------------------------------- total: 7.970000sec

                                user     system      total        real
parse CSV - multithreaded   7.930000   0.090000   8.020000 (  8.012822)

GuestOSX:options_database cremes$ chruby rbx
GuestOSX:options_database cremes$ ruby -v
rubinius 2.5.8 (2.1.0 bef51ae3 2015-09-24 3.5.1 JI) [x86_64-darwin14.5.0]
GuestOSX:options_database cremes$ ruby benchmarks_multi.rb 
Rehearsal -------------------------------------------------------------
parse CSV - multithreaded  20.991651   0.720063  21.711714 (  4.663789)
--------------------------------------------------- total: 21.711714sec

                                user     system      total        real
parse CSV - multithreaded  14.130549   0.136832  14.267381 (  2.984456)

MRI ran the 4-thread benchmark in the same 8 seconds as before! We are often reminded that MRI now maps its threads to native threads, but there is still a global interpreter lock (GIL) that prevents MRI from truly running code in parallel. Rubinius eliminated its GIL years ago, so all threads can run in parallel and produce a finishing time of just over 3 seconds.

With these improvements, Rubinius can finish my production job in about 5.5 days (versus the original 15 days). The CSV parsing work runs faster than the database can accept bulk inserts so, unless I want to spend a bunch of time optimizing the database configuration, my work is done. Thanks to Rubinius, I am saving 9 days on my import.

To reproduce these numbers on your own system, the benchmarks and test data can be found here.

Code Climate vs Rubinius

It is difficult to understand the behavior of a program written in a dynamic language, like Ruby, without running the program. While static analysis, like Code Climate, can tell us a fair amount about the code, there's still a lot more it can't tell us.

Wouldn't it be nice if the system running our program could tell us about what the code is doing while it's running? Rubinius can do this.

While a program is running, there are two graphs interacting. The first is the graph of functions (or methods) as they call one another. The second is the graph of data objects that the functions create or operate on.

In Rubinius, these two graphs intersect at the inline cache objects. Basically, wherever you have a method call in your Ruby program, when Rubinius runs your program, it creates a special object that records the type of object on which the method is called and what method is called. These simple Ruby objects record the graph of methods called in your program. From this graph, we can analyze all kinds of actual behavior of your code.

This sounds awesome, doesn't it? When was the last time you wondered how a bit of Ruby code in your program was interacting with other parts of the code? If you're like me, that's every time I'm writing Ruby code. So, how can Rubinius help? That's the problem, we don't have the tools that you need right now.

But I want to fix that. The question is, how? I need your help to decide.

Recently, Code Climate announced they were releasing their platform as open source, "the first open and extensible platform for static analysis". One possibility to leverage the ability that Rubinius has to help you understand your code is to integrate with the Code Climate platform.

Another possibility is to create a stand-alone Rubinius service that would start with some simple, runtime analysis of your code, but could expand in many different dimensions, showing data-flow, security analysis, performance, and many facets of application analysis beyond what is possible with simple static analysis.

I want to emphasize that these two options are only superficially similar. The facilities that we have in Rubinius, and continue to expand and improve, can provide far greater depth of analysis than that possible with static analysis. So, the question is really, where do we start?

We'd love to have your input. Please take this short survey and let us know what you think.

Survey: Code Climate vs Rubinius

If you have more you'd like to share, write us