Better access to runtime information from Ruby

Rubinius has always tried to provide a lot of information to you as a developer about runtime information. Things like the variable scope and constant scope are already available. In this post I'd like to describe a small new API that is available for even more information.

Inline caching

One of the important mechanisms in making your Ruby code fast is doing inline caching. Inline caching means that we store the result of a method lookup a specific call site.

So what is a call site you might ask? Well, that's every place in your code where a method is called. So take for example the following snippet.

1 class Foo
2   def bar
3     @foo.quz
4   end
5 end

Here @foo.quz is a call site, so the place where quz is called. What we do here is cache the type of @foo and store which method to execute for this type.

The fact that inline caching works well is based on the idea that at runtime for most call sites the receiver type never changes. So in this case it means that @foo is always of the same type. This is called a monomorphic call site.

Of course there are cases where this reasoning doesn't apply. If we see multiple types at a certain call site, it's called a polymorphic call site. For this we store things in a different object and Rubinius currently stores up to 3 different types in that case.

It might be surprising to some, but these cache objects are also just Ruby objects. This means that they can be used just like any other Ruby object and be inspected for example. So there are objects for an empty call site, for monomorphic and also polymorphic sites.

There is also a specialized site for places where we call respond_to? since that happens often enough to warrant a special case. In the future there might be new types introduced such as a specific type for send calls.

Accessing inline caches stored at call sites

So now that we have described all this, how do we get access to this information? It is actually pretty straightforward and builds on other things already available.

Under the hood, all Ruby code in Rubinius gets translated into a CompiledCode object, which also provides the API for getting access to inline caches. One way to get access to the CompiledCode object is to grab it through Ruby's Method API.

 1 class Foo
 2   def bar
 3     @foo.quz
 4   end
 5 end
 6 
 7 f = Foo.new
 8 compiled_code = f.method(:bar).executable
 9 compiled_code.call_sites.each do |cs|
10   p cs # #<Rubinius::CallSite:0xXX script.rb:3#quz(0)>
11 end

What you can see here is that we have one call site located at script.rb on line 3 (which is the class definition). Currently it's of the type CallSite, which means that it has a basic structure in place with no additional information because this method was never called. This can also be inferred from the (0) at the end which indicates the number of calls through this object.

So if we make sure we call this at least once, we should be able to see this information.

 1 class Foo
 2   def bar
 3     @foo.quz
 4   end
 5 end
 6 
 7 f = Foo.new
 8 begin
 9   f.bar
10 rescue
11 end
12 
13 compiled_code = f.method(:bar).executable
14 compiled_code.call_sites.each do |cs|
15   p cs # #<Rubinius::MonoInlineCache:0xXX script.rb:3#quz(1)
16        #   receiver_class=NilClass stored_module=Kernel
17        #   method=#<Rubinius::CompiledCode method_missing file=kernel/delta/kernel.rb>>
18 end

What we do here is actually run this inside a begin / rescue clause because the call will fail because @foo being nil. But this will also show that we can actually cache that this will run the method_missing method and that we cache that.

What you can see here is that we have a MonoInlineCache, because we've only seen one type for @foo, which is nil so the receiver class is NilClass. The stored_module is the module the saved method is in, which in this case is Kernel. The actual method stored is the method_missing method here. We can also see it has been called 1 time so far.

Caveats

Currently you get these objects back and you can actually modify them. Modifying them at the moment would actually result in undefined behavior. You can't do this easily though since they only provide accessor readers, but since it's Ruby you can of course hack around with instance_variable_set for example.

You can probably do really crazy stuff with this, if it crashes the vm it's probably a bug, otherwise you're pretty much on your own when hacking this.

Another thing is that the counters are not guaranteed to be accurate. This can mainly happen when the JIT kicks in. The JIT doens't update counters if it inlines methods so that way the method could be called more often than you'd think based on the counter. If it's zero though, you can be sure it has been never called since it will always go through the interpreter path first.

Future

So besides for fun and crazy, what is this actually useful for? Well, first of all we already use it to actually test in Ruby whether our inline caches work properly. This made testing this a lot easier and we can easily verify that an empty call sites turns into a monomorphic cache and subsequently in a polymorphic one if needed.

Another reason this can be very useful is for building better tools. With this information, it would be possible to create better refactoring tools for example. The information provided here gives you insight into which methods are bound where and could help perhaps in renaming things.

It would also be interesting to gather this information from your runtime application, so you could see which code does and doesn't get executed in your production environment. All these things are just ideas, but API's like the one discussed here make these future developments possible and allow for more insight in your Ruby code.

So play with and let us know what you've done with it!

Profiling JIT-ted Ruby code with OProfile

You can now profile JIT-ted Ruby code!

Before:

samples  %        linenr info                 app name                 symbol name
497      28.7117  instructions.cpp:82         rbx                      rubinius::MachineCode::interpreter(rubinius::State*, rubinius::MachineCode*, rubinius::InterpreterCallFrame*)
460      26.5742  (no location information)   anon (tgid:4118 range:0xb581c000-0xb589c000) anon (tgid:4118 range:0xb581c000-0xb589c000)
  # => The dark world of JIT-ted Ruby code
158       9.1277  jit_util.cpp:787            rbx                      rbx_set_local_depth
113       6.5280  block_environment.cpp:365   rbx                      rubinius::BlockEnvironment::call(rubinius::State*, rubinius::CallFrame*, rubinius::Arguments&, int)
102       5.8925  (no location information)   rbx                      __x86.get_pc_thunk.bx
93        5.3726  jit_util.cpp:848            rbx                      rbx_push_local_depth
71        4.1017  variable_scope.cpp:108      rbx                      rubinius::VariableScope::set_local(int, rubinius::Object*)
68        3.9284  block_environment.cpp:62    rbx                      rubinius::BlockEnvironment::invoke(rubinius::State*, rubinius::CallFrame*, rubinius::BlockEnvironment*, rubinius::Arguments&, rubinius::BlockInvocation&)
52        3.0040  variable_scope.cpp:126      rbx                      rubinius::VariableScope::get_local(int)
47        2.7152  variable_scope.cpp:100      rbx                      rubinius::VariableScope::set_local(rubinius::State*, int, rubinius::Object*)
45        2.5997  kind_of.hpp:153             rbx                      bool rubinius::kind_of<rubinius::Fiber>(rubinius::Object const*)
24        1.3865  shared_state.cpp:295        rbx                      rubinius::SharedState::checkpoint(rubinius::ManagedThread*)

After:

samples  %        linenr info                 app name                 symbol name
3468     27.0388  instructions.cpp:82         rbx                      rubinius::MachineCode::interpreter(rubinius::State*, rubinius::MachineCode*, rubinius::InterpreterCallFrame*)
3048     23.7642  loop.rb:7                   5060.jo                  _X_Object#forever$block@1
  # => Wow, the name of JIT-ted Ruby code is shown along side C++ function symbols!
1044      8.1397  jit_util.cpp:848            rbx                      rbx_push_local_depth
947       7.3834  (no location information)   rbx                      __x86.get_pc_thunk.bx
916       7.1417  jit_util.cpp:787            rbx                      rbx_set_local_depth
781       6.0892  block_environment.cpp:365   rbx                      rubinius::BlockEnvironment::call(rubinius::State*, rubinius::CallFrame*, rubinius::Arguments&, int)
639       4.9821  block_environment.cpp:62    rbx                      rubinius::BlockEnvironment::invoke(rubinius::State*, rubinius::CallFrame*, rubinius::BlockEnvironment*, rubinius::Arguments&, rubinius::BlockInvocation&)
421       3.2824  variable_scope.cpp:108      rbx                      rubinius::VariableScope::set_local(int, rubinius::Object*)
389       3.0329  variable_scope.cpp:126      rbx                      rubinius::VariableScope::get_local(int)
326       2.5417  variable_scope.cpp:100      rbx                      rubinius::VariableScope::set_local(rubinius::State*, int, rubinius::Object*)
288       2.2454  shared_state.cpp:295        rbx                      rubinius::SharedState::checkpoint(rubinius::ManagedThread*)
283       2.2065  kind_of.hpp:153             rbx                      bool rubinius::kind_of<rubinius::Fiber>(rubinius::Object const*)

In short, you can tell how the machine code that the Rubinius JIT generates performs.

Not satisfied yet? Even annotated profile is supported! That means you can even know how much time it spends on each line of Ruby code or even on each CPU instruction:

/* 
 * Total samples for file : "/tmp/loop.rb"
 * 
 *   5658 19.9436
 */


   718  2.5308 :def increment(i)
   482  1.6990 :  i + 1
               :end
               :
               :def forever
               :  i = 0
  2933 10.3384 :  loop do /* _X_Object#forever$block@1 total:   6990 24.6387 */
  1525  5.3754 :    i = increment(i)
               :  end
               :end
               :
               :forever

We're using OProfile, a profiling software. In this blog post, I'll show you how to profile using it!

What's OProfile?

It's a very useful profiling tool available on Linux. It's a sampling-based one. That means there is absolutely no change to Rubinius and your Ruby code to profile. Just run it as you normally do. Also, the overhead of profiling is minimal.

This is contrasted to measuring-based profiling. Rubinius's built-in profiler and ruby-prof are both examples of measuring profilers. This means the overhead is much bigger and can skew results because of that.

Basically, it works by reporting how many individual profiled items are sampled compared to the overall total samples. It doesn't measure elapsed time. It's much like top command's individual processes' CPU % usage with far greater flexibility. The actual profiled items can be any of C/C++ libraries, C/C++ functions, C/C++ source code lines, or CPU instructions.

So, OProfile can't usually profile Ruby code because it works on CPU instruction level. From OProfile, it can't tell which Ruby source code line Rubinius currently executing from the CPU instructions. However, it can profile JIT-ted Ruby code because Rubinius compiles it very down into the CPU instructions by definition.

OProfile works as a Linux kernel module. So, it's supported only for Linux. Sadly, Ubuntu's OProfile and LLVM have bugs relating to this feature. Apparently, there is no one using this. So you can say, we are really on the cutting edge. ;)

Anyway, we must work around these problems. But how? There are a few options. :)

Setup (PPA: the super simple way; Ubuntu 12.10 only)

I prepared a PPA just for you. Add it to your system. To be specific, run this:

$ sudo add-apt-repository ppa:ryoqun/ppa
$ sudo apt-get update
$ sudo apt-get install oprofile llvm-3.1
$ sudo apt-get dist-upgrade # Upgrade preinstalled libllvm3.1 to the PPA

By default, Rubinius doesn't use system-provided LLVM, so re-configure Rubinius to use it and re-build Rubinius:

$ cd /path/to/rubinius-git-repository
$ rake clean
$ ./configure --llvm-config llvm-config-3.1
$ rake

Done!

Setup (manual build: the hard way)

If you really want to build LLVM and OProfile manually or if you're using an other Linux distribution and the distribution doesn't provide OProfile-enabled LLVM packages, manually build LLVM and OProfile like this:

(I tested this on Ubuntu 12.10. Minor adjustments may be needed to build on your environment)

Build and Install OProfile:

$ sudo apt-get build-dep oprofile # do equivalent thing on your distro.
$ cd /path/to/working-dir-to-build-things
$ wget http://prdownloads.sourceforge.net/oprofile/oprofile-0.9.8.tar.gz
$ tar -xf oprofile-0.9.8.tar.gz
$ cd oprofile-0.9.8
$ ./autogen.sh
$ ./configure --prefix /usr # LLVM has a build issue. So install to /usr
$ make
$ sudo make install
$ adduser oprofile # this is needed for profiling JITted code
$ opreport --version
  # => opreport: oprofile 0.9.8 compiled on Mar  8 2013 00:57:08

Build LLVM with OProfile support enabled and rebuild Rubinius:

$ sudo apt-get build-dep llvm # do equivalent thing on your distro.
$ cd /path/to/working-dir-to-build-things
$ wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
$ tar -xf llvm-3.2.src.tar.gz
$ cd llvm-3.2.src
$ ./configure --enable-optimized --disable-assertions --with-oprofile
$ make
$ sudo make install

If the compilation of OProfileWrapper.cpp fails like this:

OProfileWrapper.cpp: In member function ‘bool llvm::OProfileWrapper::checkForOProfileProcEntry()’:
OProfileWrapper.cpp:141:62: error: ‘read’ was not declared in this scope
OProfileWrapper.cpp:142:24: error: ‘close’ was not declared in this scope

Apply this patch, then, re-make and continue:

diff --git a/lib/ExecutionEngine/OProfileJIT/OProfileWrapper.cpp b/lib/ExecutionEngine/OProfileJIT/OProfileWrapper.cpp
index d67f537..7c0d395 100644
--- a/llvm-3.1-3.1/lib/ExecutionEngine/OProfileJIT/OProfileWrapper.cpp
+++ b/llvm-3.1-3.1/lib/ExecutionEngine/OProfileJIT/OProfileWrapper.cpp
@@ -29,6 +29,7 @@
 #include <dirent.h>
 #include <sys/stat.h>
 #include <fcntl.h>
+#include <unistd.h>

 namespace {

Phew, finally rebuild Rubinius!:

$ cd /path/to/rubinius-git-repository
$ rake clean
$ rm -rf vendor/llvm # If you build Rubinius with vendorized LLVM.
$ ./configure
$ rake

Start OProfile

OProfile can be configured by a command called opcontrol, not by a configuration file, which are common.

$ sudo opcontrol --deinit
$ sudo modprobe oprofile timer=1      # Needed only inside VirtualBox VMs
$ echo 0 | sudo tee /proc/sys/kernel/nmi_watchdog
$ sudo opcontrol --no-vmlinux
$ sudo opcontrol --separete all
$ sudo opcontrol --start
$ sudo opcontrol --dump               # Flushes buffered raw profile data
$ opreport --merge all --threshold 1  # Prints the report of profile

Setup is complete if you see an output from opreport like this:

CPU: CPU with timer interrupt, speed 1858.39 MHz (estimated)
Profiling through timer interrupt
          TIMER:0|
  samples|      %|
------------------
    92127 67.6157 no-vmlinux
    21920 16.0880 rbx
     7950  5.8348 libc-2.15.so
     4068  2.9857 runner
     3429  2.5167 libstdc++.so.6.0.17
     2139  1.5699 cc1
     1416  1.0393 vm

Congratulations!

Run Ruby code!

To annotate Ruby code correctly, your current directory must be the top directory of the Rubinius git repository:

$ cd /path/to/rubinius-git-repository
$ ./bin/benchmark ./benchmark/core/hash/bench_access.rb

Generate profile report

Let's check the profile report of the above benchmark.

NOTE: try to run opcontrol --dump at least once while running Rubinius if JIT-ted Ruby code doesn't show in the profile report. OProfile automatically dumps periodically, but if your Ruby process terminates too quickly, you should manually run opcontrol --dump while it's still alive.

$ sudo opcontrol --dump && sudo opjitconv /var/lib/oprofile/ 0 0
$ opreport --merge all --threshold 1 image:./bin/rbx --symbols --debug-info \
    > /tmp/hash-access-symbols
$ opannotate --merge all --threshold 0.5 image:./bin/rbx --source \
    > /tmp/hash-access-source

I'll omit but you can generate profile report of annotated assembly by passing --assembly to opannotate instead of --source.

Reset current profile

By default, OProfile keeps the profile data indefinitely. To reset it, run this:

$ sudo opcontrol --reset

How to read the profile result

Open /tmp/hash-access-symbols, the content should be like this:

CPU: CPU with timer interrupt, speed 1860.42 MHz (estimated)
Profiling through timer interrupt
samples  %        linenr info                 app name                 symbol name
3571     31.5990  (no location information)   no-vmlinux               /no-vmlinux
619       5.4774  bench_access.rb:16          8295.jo                  _X_Object#__block__$block@7
599       5.3004  bench_access.rb:24          8295.jo                  _X_Object#__block__$block@11
531       4.6987  bench_access.rb:8           8295.jo                  _X_Object#__block__$block@1
457       4.0439  (no location information)   rbx                      __x86.get_pc_thunk.bx
318       2.8139  jit_util.cpp:787            rbx                      rbx_set_local_depth
298       2.6369  bench_access.rb:42          8295.jo                  _X_Object#__block__$block@20
277       2.4511  object.cpp:362              rbx                      rubinius::Object::hash(rubinius::State*)
236       2.0883  jit_primitives.cpp:4494     rbx                      jit_stub_object_hash
236       2.0883  tuple.cpp:36                rbx                      rubinius::Tuple::put(rubinius::State*, int, rubinius::Object*)
222       1.9644  inline_cache.cpp:696        rbx                      rubinius::InlineCache::check_cache_poly(rubinius::State*, rubinius::InlineCache*, rubinius::CallFrame*, rubinius::Arguments&)
197       1.7432  jit_util.cpp:848            rbx                      rbx_push_local_depth
187       1.6547  variable_scope.cpp:126      rbx                      rubinius::VariableScope::get_local(int)
179       1.5839  compiledcode.cpp:212        rbx                      rubinius::CompiledCode::specialized_executor(rubinius::State*, rubinius::CallFrame*, rubinius::Executable*, rubinius::Module*, rubinius::Arguments&)
162       1.4335  hash.rb:220                 8295.jo                  _X_Hash#key_index@15
147       1.3008  hash.rb:325                 8295.jo                  _X_Hash#values_at$block@13
143       1.2654  object.cpp:400              rbx                      rubinius::Object::hash_prim(rubinius::State*)
137       1.2123  (no location information)   libc-2.15.so             /lib/i386-linux-gnu/libc-2.15.so
127       1.1238  variable_scope.cpp:100      rbx                      rubinius::VariableScope::set_local(rubinius::State*, int, rubinius::Object*)
111       0.9822  inline_cache.cpp:576        rbx                      rubinius::InlineCache::check_cache_reference(rubinius::State*, rubinius::InlineCache*, rubinius::CallFrame*, rubinius::Arguments&)
108       0.9557  vm.cpp:175                  rbx                      rubinius::VM::new_object_typed_dirty(rubinius::Class*, unsigned int, rubinius::object_type)
99        0.8760  vm.cpp:207                  rbx                      rubinius::VM::new_young_tuple_dirty(unsigned int)
97        0.8583  array18.rb:6                8295.jo                  _X_Array#map@12
93        0.8229  integer.cpp:84              rbx                      rubinius::Integer::from(rubinius::State*, int)
90        0.7964  hash.rb:137                 8295.jo                  _X_Hash#[]@23
76        0.6725  array.cpp:205               rbx                      rubinius::Array::set(rubinius::State*, int, rubinius::Object*)
72        0.6371  linkedlist.cpp:27           rbx                      LinkedList::remove(LinkedList::Node*)
68        0.6017  tuple.cpp:64                rbx                      rubinius::Tuple::create(rubinius::State*, int)
57        0.5044  method_primitives.cpp:12327 rbx                      rubinius::Primitives::object_hash(rubinius::State*, rubinius::CallFrame*, rubinius::Executable*, rubinius::Module*, rubinius::Arguments&)

As you can guess, symbol names beginning with _X_ is the JIT-ted Ruby code. Many Ruby benchmark blocks are JIT-ted (like _X_Object#__block__$block@7 at bench_access.rb:16) and listed near the top. Also, _X_Array#map@12 is a JIT-ted code of Array#map. I'll explain the format of the report using it as an example:

97        0.8583  array18.rb:6                8295.jo                  _X_Array#map@12

97 is the number of counts OProfile found what Rubinius was executing, while periodically sampling it.

0.8583 is the percentage to the total number of samples OProfile collected. Note that summing all entries up doesn't equal to 100%, because opreport only reported the top part of whole profile result (by --threshold 0.5).

array18.rb:6 is the source location where this method is defined (this is same as Method#source_location).

8295.jo is a special app name for JIT-ted code. Usually app name is the name of file C/C++ functions reside in (shared libraries or executables). So, for normal Rubinius' C++ functions, app name is just rbx because they reside in ./bin/rbx. For JIT-ted Ruby code, there can be no meaningful app name, because JIT-ted Ruby code is generated at runtime and they reside in no file, but only in memory. So, OProfile uses *.jo as app name in such cases. And, 8295 means this is profiled when running the process of PID 8295.

_X_Array#map@12 is the name of symbol for this entry of profile report (duh!).

Open /tmp/hash-access-source, the content should be like this (I'll omit some unimportant part, because this file is big):

/* 
 * Total samples for file : "kernel/common/hash18.rb"
 * 
 *   4870 12.8904
 */

<credited to line zero>     19  0.0503 :
               :# -*- encoding: us-ascii -*-
               :
               :class Hash
               :
               :  include Enumerable
               :
...
               :
               :  class Bucket
               :    attr_accessor :key
               :
   633  1.6755 :    def match?(key, key_hash) /* _X_Hash::Bucket#match?@16     75  0.1985, _X_Hash::Bucket#match?@16     45  0.1191, total:    120  0.3176 */
    40  0.1059 :      case key
   218  0.5770 :      when Symbol, Fixnum
   319  0.8444 :        return key.equal?(@key)
               :      end
               :
               :      @key_hash == key_hash and (Rubinius::Type::object_equal(key, @key) or key.eql?(@key))
               :    end
               :  end
...
               :end
...

You might wonder why Hash::Bucket#match? has so many counts while this didn't appear in the report for symbols. The reason is that the method is inlined to each its callers (maybe many top-appearing JIT-ted Ruby blocks). How cool this is! You can really know which Ruby code is actually taking too much time.

My last favor

I wish the official Ubuntu packages are fixed and there is no need to add any PPA. I want it-just-works. I reported affecting bugs to Ubuntu's bug tracking system. So, please vote it up for fixing (vote them with "affects you?", please!!):

Further profiling

There is new profiling tool on Linux: perf and operf (OProfile's new command). I don't know them well, but certainly they will be useful for profiling Rubinius.

As a last note, this is partially based on Unladen Swallow's information of profiling LLVM-based JIT-ted code by OProfile.

Happy profiling!

PDX Summit Recap

Last week, Dirkjan spent the week with me at the Engine Yard office in Portland, OR. We wanted to discuss a variety of topics related to Rubinius technology. We started the week by making the following list:

We spent the week mostly writing a ton of things on the whiteboard, discussing them, erasing them, and writing a bunch more. Ultimately, we touched on all the topics but we did not get to discussing the extension API in depth, so we'll skip that one for now. I've previously blogged about our inline cache discussion. Here I'll summarize the rest of our discussions.

Just-in-time Compiler

The just-in-time (JIT) compiler that converts our virtual machine bytecode to machine code is a fundamental part of making Ruby fast. It is also one of the most complex components of Rubinius. It requires interfacing with the virtual machine to get runtime data and coordinating with the garbage collector to keep other runtime data structures consistent.

We are able to leverage LLVM for especially difficult parts of machine code generation, like instruction selection, instruction scheduling and register allocation. However, the semantics of Ruby are so removed from those of the underlying CPU that we have a lot of work we still need to do to improve the quality of the machine code.

One important component of this work is the ability to represent operations that have Ruby semantics (but are not part of Ruby itself) in a way that we can perform transformations on those operations. The "not part of Ruby" is actually a bit fuzzy. The concept of object allocation is certainly a part of Ruby. But the specific operation of reserving a region in the heap is invisible to Ruby.

There are a couple operations in particular that are critical to Ruby performance, especially when emitting machine code from the JIT. Those operations are type guards and object allocation. If our assumptions for the particular type of a value hold, we want to emit the fewest number of guards as possible and we want the largest contiguous sequence of machine instructions that are valid under a specific guard.

Likewise, if an object is allocated but does not get passed outside of the code sequence we are compiling to machine code, and we can see there are no other side effects, we may be able to avoid allocating the object. A good example of this in Rubinius, where the core library is implemented in Ruby, is the ByteArray object that is used by String. A ByteArray is a fixed vector of 8-bit bytes. If an intermediate operation on a String creates a ByteArray and that object isn't used outside the JIT'd method, the JIT should be able to see through the object creation and recognize that just some machine bytes are being moved around or transformed. The ByteArray allocation can be elided.

These two examples, type guards and object allocation, are just a couple things we expect the JIT to do well. There are many others.

Right now, our JIT converts the virtual machine bytecode directly into LLVM IR. To support more sophisticated JIT transformations, we need a representation in between bytecode and LLVM IR, which is one of the things we will be working on.

Another aspect of the JIT that needs an overhauls is the infrastructure around running the JIT. It runs on a separate native thread with few synchronization points. Since Rubinius has no global lock, and since we need a variety of auxiliary threads for system operations (e.g. handling process signals, running object finalizers, handling debugger events), we have some infrastructure for handling the lifetimes of these auxiliary threads. This is a particularly complicated aspect of Rubinius because when Ruby code calls fork() or exec(), the Ruby code knows nothing about these auxiliary threads. We have to be careful that state is maintained across these calls. This component will be enhanced to better handle auxiliary threads that are actually proper Ruby threads (all auxiliary threads are fundamentally Ruby threads but not necessarily visible in Ruby). This will permit us to have multiple, potentially completely different, JIT threads running simultaneously.

Performance

For us as people, when our assumptions fail we often just blunder on making fools of ourselves. Too bad we don't have a "check your assumptions" mode as useful as the one we build into the JIT. In the JIT, if our assumptions fail and we don't catch it, we're going to compute the wrong values or cause the wrong actions to be performed.

So, in the JIT, when our assumptions are wrong, we must change course. This course change may be to go down a different path in the machine code or to exit the machine code entirely and go back to the interpreter. Both paths have significant performance implications.

A reason to exit the machine code is that one significant aspect of generating fast machine code is proper use of the type guards. If our assumptions about the type of a value are wrong, simply exiting back to the interpreter means that we don't pollute the LLVM IR with types that can inhibit LLVM optimizations from being as effective. On the other hand, there are runtime costs associated with both exiting to the interpreter and running the interpreter instead of machine code. Also, if we exit the generate machine code too often, we may trigger a de-opt, throwing away that version of the machine code and requiring the JIT to create a new one.

Another aspect of the performance of the machine code we emit is how we handle virtual machine operations. Some aspects of running Ruby require very low-level operations that we call primitives. These are coded in C++ and interact with internal virtual machine facilities. When compiling machine code, we want as few conditionals and as few function calls as possible to get the highest performance. Typically, we call a function that performs the primitive operation and then returns to the JIT'd machine code. As we move more and more things into Ruby itself, it is possible for the JIT to inline operations and get the greatest possible optimizations. This is another area we will be working on as we improve the JIT.

Concurrency

One of the goals for Rubinius since the very beginning is to provide excellent support for concurrency and parallelism in Ruby. We have already removed the global interpreter lock (GIL/GVL) from Rubinius. However, we have to ensure that we don't replace it with an effective global lock when we implement components of the system.

The key to improving lock-freedom throughout the system is to use less locks. That may seem obvious, but it is not necessarily that simple. As I mentioned above, the virtual machine actually uses a variety of threads internally. And with no global lock, any Ruby thread could invoke any virtual machine operation, like fork(), at any time. Multiple Ruby threads could be calling the same method on the same object at the same time. Considering my previous post about inline caches, this implies that all sorts of data races and concurrency bugs could arise just from calling Ruby methods.

In the virtual machine, we must be careful to ensure correctness under concurrent execution. The classical way of doing so is to use locking. However, there is tons of research from the '90s about lock-free and wait-free algorithms. This academic research is slowly trickling down to industry in various domains. We will be using lock-free implementations of various virtual machine data structures to improve support for concurrency and parallelism.

Garbage Collection

One particularly challenging area that requires careful control of concurrency is the garbage collector. Rubinius provides a generational garbage collector with very short pause times when collecting the young generation. The generational architecture allows us to only collect a small region of the heap instead of the whole heap every time. We also support object allocation using thread-local buffers to avoid locking and contention when Ruby code is running in parallel.

However, the garbage collector architecture pauses all threads to perform a collection. While the pause for a young generation collection is on the order of a few milliseconds (5-15 in many cases), there is the overhead of forcing all the various threads to pause.

There is only one way to address this issue and that is to make the garbage collector concurrent. Fortunately, we can achieve this with some modifications to our existing garbage collector. When adding concurrency to the garbage collector, we want to avoid slowing down non-garbage-collector operations, typically referred to as mutator performance. (Operations in the system can be viewed as partitioned into two sets: the heap management--allocation and garbage collection, and the operations that mutate the heap--mutator operations.)

With modifications to our mature generation collector (based on the Immix collector), we can perform partial heap collections and preserve the generational benefits while also limiting stop-the-world pauses to specific regions of the heap and bounding the pause times.

Semantic Versioning

Semantic Versioning is a methodology for versioning software to support correct dependency management. MRI has never had particularly strong support for versioning, with semantic changes being added in patch levels (not even visible in the RUBY_VERSION constant) and major version increments on special dates completely devoid of semantic relevance.

In Rubinius, we would like to provide a more sound versioning discipline, in particular, semantic versioning. However, this raises some significant challenges. Let's consider some of the components of Rubinius:

The key component on which semantic versioning is based is the public API. But what is included in that? Is it just Ruby as covered in RubySpec? Is it also our supporting Ruby classes or Ruby facilities like the parser and bytecode compiler? Since Rubinius is a language platform used to implement other programming languages besides Ruby, resolving these questions correctly is important.

Along with a sound versioning discipline, we also want to promote continuous delivery of enhancements to Rubinius. Technically, these are compatible goals, but require that we consider how to organize our development and source branches when we create backward-incompatible API changes.

As we work toward the Rubinius 2.0 release, we will do our best to implement semantic versioning. You can help us by testing your applications, gems, or libraries on Rubinius. One great place to do this, of course, is Travis CI.

All Work and No Play

Portland is a beautiful city. It's not very big but has enough city things to feel like a proper city. I didn't want to keep Dirkjan caged in the office all day every day, so when the weather looked promising on Friday, we packed our concurrency and garbage collection conversation up and headed out. First we went to Portland State University to attend a talk at a programming language club. The talk was on continuation-passing style, defunctionalization, and transformations. It was somewhat related to our compiler work and it was certainly interesting. But my real motive was to introduce Dirkjan to my former coworker Larry Diehl, who is a PhD student at PSU doing fascinating work in dependently typed programming language theory.

After the talk at PSU, we hopped on the MAX light rail and headed toward the Japanese Garden. The bus that would take us the rest of the way wasn't due for 25 minutes so we decided to walk. The garden is located in Portland's Washington Park, which is located in the west hills of Portland. So, we walked, uphill, for quite a while. We walked past the Lewis and Clark monument, through the international rose garden, past the tennis courts and finally up another steep hill to the Japanese Garden.

The Japanese Garden is one of the most beautiful and authentic outside of Japan. We are very lucky to have it here and my wife and I visit it frequently to enjoy the beautiful views and bask in the tranquility. Since it is early spring, not many plants were blossoming yet, but we saw some huge koi, some water falls, beautiful trees, and a stunning view of Portland. Unfortunately, Mt Hood was hiding in the clouds that day.

I'm not much of a photographer, but here are some pics from the trip.

Dirkjan photographing from a bridge Koi! Plants!

Welcome Dirkjan!

I am excited to announce that Engine Yard is sponsoring Dirkjan Bussink to work on Rubinius!

A long time ago (around 2008) when Dirkjan first showed up in the Rubinius project he was still a university student finishing his Masters degree. He let me read a draft of his thesis on brain-computer interfaces. Dirkjan has been helping out on Rubinius since then. Recently, he's been fixing concurrency, garbage collection, and JIT issues.

This week, Dirkjan is working from the Engine Yard office in Portland, OR. We've got a list of topics that we're delving into to get ready for several technology improvements in Rubinius. We started with a list of five things, one for each day. Unfortunately, time always goes faster than we expect so we've dropped one topic. I'll blog a summary of our discussions so you get an idea of some of the big stuff coming in Rubinius.

Our first discussion was about our inline cache mechanisms. An inline cache is used at a method call location to speed up calling a method. The first time it is called, the method is looked up. Subsequently, as long as the receiver is the correct type, the same method can safely be used. This is a typical execution-time versus memory size trade-off. We cache the method we found, which saves time looking it up again at the expense of some additional memory to store the method. As with all caches, we have to carefully consider cache invalidation. Let's use a quick code sample to illustrate the inline cache.

1 class A
2   def a
3   end
4 end
5 
6 class B < A
7 end

In the example above, the class B inherits from A, which provides the method a. If we call a on an instance of B, we look up the method and find it in A.

1 b = B.new
2 b.a

Here, in the code representation for the call b.a, we store an inline cache object. As long as the type of object we see in b is the same, we can safely call the cached method a. However, if the type of b changes, for example, because we define a method a on class B itself, we must invalidate the cache and look up the method again.

The inline cache can be very important for the speed of the virtual machine executing Ruby code. However, the inline cache is also very important for our just-in-time (JIT) compiler that generates machine code, completely avoiding the VM interpreter. Since the inline cache records the types of the Ruby objects seen at the call site, we can use that information in the JIT to generate fast code as long as our assumptions about the object type are correct.

Since inline caches are so important, Dirkjan and I did a complete review of our existing inline caching and analyzed how to improve it. The most important thing for improving execution speed is to do less. And the best way to do less is to reduce redundancy.

We started our analysis with an enumeration of all the kinds of calls that we see in Ruby. This is the list that we made:

This list is all kinds of method sends that you can see directly in Ruby code. We added to this list a few other things that are not in Ruby code:

The next thing we did was look at the method call sites from the perspective of the virtual machine. Certain method calls, like respond_to? imply certain VM operations. This perspective is very important. Sometimes people complain that Ruby is too dynamic because you could, for example, alias respond_to? or eval to some other name. Now any attempt to statically analyze Ruby fails because you can't know ahead of time what the alias may be. This is actually not an issue if you consider the perspective from the VM. No matter what the method name is, the ultimate VM operation for looking up a method will execute. By associating method call sites with a VM operation, we get more insight into how best to structure the inline cache.

The next part of our analysis involved looking at what values are stored in the inline cache for the various types we enumerated, and what values are used for cache invalidation. The purpose of this analysis is two-fold: correctness and performance. We want to ensure that our cache invalidation is correct but also not generating false positives (i.e. invalidating cases that are not actually invalid). We also want to ensure that storing and updating the data in the cache can be done efficiently.

The final part of our analysis concerned the cardinality of the cache. In object-oriented code, a method call site may be monomorphic (only one type is seen), polymorphic (a small number of different types are seen), or megamorphic (a huge number of different types are seen). This perspective can be very important for performance and for memory overhead. We don't want to waste memory for sites that are monomorphic. Nor do we want the cache to be thrashing when many different types are seen.

In summary, we looked at these aspects of inline caches: method call site types, VM operations that result from the method call, the kind of data cached and cache invalidation, and the cardinality of the method call site. We'll be taking the results of our discussion and implementing changes to our inline caching over the next few weeks.

If you get a chance, pop into the #rubinius IRC channel on freenode.net and say hello to Dirkjan!

Heroku (Cedar) + Rubinius 2 + Puma 2 in Three Easy Steps

tl;dr

It's now very easy to get a Rails app on Heroku's Cedar Stack running on Rubinius 2.0 and Puma 2.0.

The Backstory

The fine folks at Heroku have been adding support for using Bundler to specify which version of Ruby your app uses and even to specify which Ruby it uses.

I've checked in on the progress of using Rubinius with this same method every couple few months for the past year or so. I was always met with the same results... "not quite yet, we're working on it".

Last night I was getting mentally packed for Waza later this week and decided to check in on the state of things so I could ask the right people the right questions in person when we're all together. Turns out that sometime between last time and this time, they've got things working. Boom diggity! And it's pretty ding-dang easy too! Double boom diggity!

The Build Up

I'm going to assume that if you want to use Rubinius 2.0 that you also have the good sense to want to use Puma 2.0. There are three very simple steps to get up and running with Rubinius 2.0 + Puma 2.0.

Step 1

Add the Puma gem to your Gemfile.

Step 2

Tell Heroku to use Puma as your web server in your Procfile.

Step 3

Specify Rubinius as your Ruby engine in your Gemfile.

The Breakdown

Once you've done those things, your workflow is the same as before. bundle update && git commit -am "Double boom diggity!" && git push heroku master

If you're changing an existing app from Heroku's default Ruby (MRI 1.9.x) to Rubinius, you'll see a message like this when you git push heroku master.

The important lines are Old: ruby 1.9.3 and New: rubinius 2.0.0.rc1. After that, everything the same as before. The bundling starts, etc.

The Outro

That's it. You should now be up and running with Rubinius and Puma!

If you have problems, say something in the comments with a link to a gist with your problem output. If you succeed and get up and running in production, let us in the comments too. Tell the world that you love Rubinius/Puma.

I threw together this quick and dirty Rails app from scratch deployed to Heroku to show all this in action. http://heroku-rbx-puma-rails-app.herokuapp.com. I also migrated an existing MRI 1.9.3 app ( The Farmhouse site ) to Rubinius 2.0 to show the migration is easy too.

The Thank Yous

None of this would be possible without the hard work of Carl Lerche, Yehuda Katz, André Arko, Terrence Lee and the other contributors to Bundler. And a special thanks to Terrence Lee for his dutiful stewardship of the Heroku Ruby Buildpack. Once again, the awesome powers of Open Source make great things possible.

The Footnotes

As of this writing "Rubinius 2.0" actually means "Rubinius 2.0.0.rc1" and "Puma 2.0" actually means "Puma 2.0.0b7". The net result is the same though.

Debugging scary crashes of Rubinius

Bugs, we all are in the infinite fight of getting rid of them. Sometimes they are trivial, other times a bit harder. In this video I'm going to show how I have been debugging an issue that caused memory corruption of your dear Ruby objects in Rubinius.

Background

It's a pretty long video clocking at 55 minutes but I'm trying to be detailed in the discovery process. I'm explaining the steps I'm taking along the way and explaining why I'm taking those.

The real debugging process if of course littered with taking wrong turns so please don't think I'm brilliant and able to zoom into the issue immediately all the time. The video is not recorded as I was going through it the first time, so I exactly knew where I was going and which steps to take.

If you want to look up some more information on GDB that I'm using in the video, here are some links that explain some of the most used commands in the video:

GDB - Examining data GDB - Setting watchpoints

During the video there's a part from 8:18 until 11:28 where it's just watching a rebuild of Rubinius scrolling by. I'd suggest skipping it since it's not very useful. I tried to remove it but after having to reencode the video it only grew larger. If anyone has more skills to fix that, please let us know!

Download

You can find the video on Vimeo:

Watch on Vimeo

It's probably best to download the video and play it locally on your machine. It's 453 MB so it might take a while.

Please let us know if you find this useful, we might do more of them in the future if there's interest!