Rubinius has always tried to provide a lot of information to you as a
developer about runtime information. Things like the variable scope and
constant scope are already available. In this post I'd like to describe
a small new API that is available for even more information.
One of the important mechanisms in making your Ruby code fast is doing
inline caching. Inline caching means that we store the result of a method
lookup a specific call site.
So what is a call site you might ask? Well, that's every place in your
code where a method is called. So take for example the following
1 class Foo
2 def bar
@foo.quz is a call site, so the place where quz is called. What we
do here is cache the type of
@foo and store which method to execute for
The fact that inline caching works well is based on the idea that at
runtime for most call sites the receiver type never changes. So in this
case it means that
@foo is always of the same type. This is
called a monomorphic call site.
Of course there are cases where this reasoning doesn't apply. If we see
multiple types at a certain call site, it's called a polymorphic call site.
For this we store things in a different object and Rubinius currently stores
up to 3 different types in that case.
It might be surprising to some, but these cache objects are also just
Ruby objects. This means that they can be used just like any other Ruby
object and be inspected for example. So there are objects for an empty
call site, for monomorphic and also polymorphic sites.
There is also a specialized site for places where we call
respond_to? since that
happens often enough to warrant a special case. In the future there
might be new types introduced such as a specific type for
Accessing inline caches stored at call sites
So now that we have described all this, how do we get access to this
information? It is actually pretty straightforward and builds on other
things already available.
Under the hood, all Ruby code in Rubinius gets translated into a
CompiledCode object, which also provides the API for getting access to
inline caches. One way to get access to the
CompiledCode object is to
grab it through Ruby's Method API.
1 class Foo
2 def bar
7 f = Foo.new
8 compiled_code = f.method(:bar).executable
9 compiled_code.call_sites.each do |cs|
10 p cs # #<Rubinius::CallSite:0xXX script.rb:3#quz(0)>
What you can see here is that we have one call site located at
on line 3 (which is the class definition). Currently it's of the type
CallSite, which means that it has a basic structure in place with no
additional information because this method was never called. This can
also be inferred from the (0) at the end which indicates the number of
calls through this object.
So if we make sure we call this at least once, we should be able to see
1 class Foo
2 def bar
7 f = Foo.new
13 compiled_code = f.method(:bar).executable
14 compiled_code.call_sites.each do |cs|
15 p cs # #<Rubinius::MonoInlineCache:0xXX script.rb:3#quz(1)
16 # receiver_class=NilClass stored_module=Kernel
17 # method=#<Rubinius::CompiledCode method_missing file=kernel/delta/kernel.rb>>
What we do here is actually run this inside a begin / rescue clause
because the call will fail because @foo being nil. But this will also
show that we can actually cache that this will run the
method and that we cache that.
What you can see here is that we have a
MonoInlineCache, because we've
only seen one type for
@foo, which is
nil so the receiver class is
stored_module is the module the saved method is in, which
in this case is
Kernel. The actual method stored is the
method here. We can also see it has been called 1 time so far.
Currently you get these objects back and you can actually modify them.
Modifying them at the moment would actually result in undefined
behavior. You can't do this easily though since they only provide
accessor readers, but since it's Ruby you can of course hack around with
instance_variable_set for example.
You can probably do really crazy stuff with this, if it crashes the vm
it's probably a bug, otherwise you're pretty much on your own when
Another thing is that the counters are not guaranteed to be accurate.
This can mainly happen when the JIT kicks in. The JIT doens't update
counters if it inlines methods so that way the method could be called
more often than you'd think based on the counter. If it's zero though,
you can be sure it has been never called since it will always go through
the interpreter path first.
So besides for fun and crazy, what is this actually useful for? Well,
first of all we already use it to actually test in Ruby whether our
inline caches work properly. This made testing this a lot easier and we
can easily verify that an empty call sites turns into a monomorphic
cache and subsequently in a polymorphic one if needed.
Another reason this can be very useful is for building better tools.
With this information, it would be possible to create better
refactoring tools for example. The information provided here gives you
insight into which methods are bound where and could help perhaps in
It would also be interesting to gather this information from your
runtime application, so you could see which code does and doesn't get
executed in your production environment. All these things are just
ideas, but API's like the one discussed here make these future
developments possible and allow for more insight in your Ruby code.
So play with and let us know what you've done with it!
You can now profile JIT-ted Ruby code!
samples % linenr info app name symbol name
497 28.7117 instructions.cpp:82 rbx rubinius::MachineCode::interpreter(rubinius::State*, rubinius::MachineCode*, rubinius::InterpreterCallFrame*)
460 26.5742 (no location information) anon (tgid:4118 range:0xb581c000-0xb589c000) anon (tgid:4118 range:0xb581c000-0xb589c000)
# => The dark world of JIT-ted Ruby code
158 9.1277 jit_util.cpp:787 rbx rbx_set_local_depth
113 6.5280 block_environment.cpp:365 rbx rubinius::BlockEnvironment::call(rubinius::State*, rubinius::CallFrame*, rubinius::Arguments&, int)
102 5.8925 (no location information) rbx __x86.get_pc_thunk.bx
93 5.3726 jit_util.cpp:848 rbx rbx_push_local_depth
71 4.1017 variable_scope.cpp:108 rbx rubinius::VariableScope::set_local(int, rubinius::Object*)
68 3.9284 block_environment.cpp:62 rbx rubinius::BlockEnvironment::invoke(rubinius::State*, rubinius::CallFrame*, rubinius::BlockEnvironment*, rubinius::Arguments&, rubinius::BlockInvocation&)
52 3.0040 variable_scope.cpp:126 rbx rubinius::VariableScope::get_local(int)
47 2.7152 variable_scope.cpp:100 rbx rubinius::VariableScope::set_local(rubinius::State*, int, rubinius::Object*)
45 2.5997 kind_of.hpp:153 rbx bool rubinius::kind_of<rubinius::Fiber>(rubinius::Object const*)
24 1.3865 shared_state.cpp:295 rbx rubinius::SharedState::checkpoint(rubinius::ManagedThread*)
samples % linenr info app name symbol name
3468 27.0388 instructions.cpp:82 rbx rubinius::MachineCode::interpreter(rubinius::State*, rubinius::MachineCode*, rubinius::InterpreterCallFrame*)
3048 23.7642 loop.rb:7 5060.jo _X_Object#forever$block@1
# => Wow, the name of JIT-ted Ruby code is shown along side C++ function symbols!
1044 8.1397 jit_util.cpp:848 rbx rbx_push_local_depth
947 7.3834 (no location information) rbx __x86.get_pc_thunk.bx
916 7.1417 jit_util.cpp:787 rbx rbx_set_local_depth
781 6.0892 block_environment.cpp:365 rbx rubinius::BlockEnvironment::call(rubinius::State*, rubinius::CallFrame*, rubinius::Arguments&, int)
639 4.9821 block_environment.cpp:62 rbx rubinius::BlockEnvironment::invoke(rubinius::State*, rubinius::CallFrame*, rubinius::BlockEnvironment*, rubinius::Arguments&, rubinius::BlockInvocation&)
421 3.2824 variable_scope.cpp:108 rbx rubinius::VariableScope::set_local(int, rubinius::Object*)
389 3.0329 variable_scope.cpp:126 rbx rubinius::VariableScope::get_local(int)
326 2.5417 variable_scope.cpp:100 rbx rubinius::VariableScope::set_local(rubinius::State*, int, rubinius::Object*)
288 2.2454 shared_state.cpp:295 rbx rubinius::SharedState::checkpoint(rubinius::ManagedThread*)
283 2.2065 kind_of.hpp:153 rbx bool rubinius::kind_of<rubinius::Fiber>(rubinius::Object const*)
In short, you can tell how the machine code that the Rubinius JIT generates
Not satisfied yet? Even annotated profile is supported! That means you can even
know how much time it spends on each line of Ruby code or even on each CPU
* Total samples for file : "/tmp/loop.rb"
* 5658 19.9436
718 2.5308 :def increment(i)
482 1.6990 : i + 1
: i = 0
2933 10.3384 : loop do /* _X_Object#forever$block@1 total: 6990 24.6387 */
1525 5.3754 : i = increment(i)
We're using OProfile, a profiling
software. In this blog post, I'll show you how to profile using it!
It's a very useful profiling tool available on Linux. It's a sampling-based
one. That means there is absolutely no change to Rubinius and your Ruby code to
profile. Just run it as you normally do. Also, the overhead of profiling is
This is contrasted to measuring-based profiling. Rubinius's built-in profiler
and ruby-prof are both examples of measuring profilers. This means the overhead
is much bigger and can skew results because of that.
Basically, it works by reporting how many individual profiled items are sampled
compared to the overall total samples. It doesn't measure elapsed time. It's
much like top command's individual processes' CPU % usage with far greater
flexibility. The actual profiled items can be any of C/C++ libraries, C/C++
functions, C/C++ source code lines, or CPU instructions.
So, OProfile can't usually profile Ruby code because it works on CPU
instruction level. From OProfile, it can't tell which Ruby source code line
Rubinius currently executing from the CPU instructions. However, it can profile
JIT-ted Ruby code because Rubinius compiles it very down into the CPU
instructions by definition.
OProfile works as a Linux kernel module. So, it's supported only for Linux.
Sadly, Ubuntu's OProfile and LLVM have bugs relating to this feature.
Apparently, there is no one using this. So you can say, we are really on the
cutting edge. ;)
Anyway, we must work around these problems. But how? There are a few options.
Setup (PPA: the super simple way; Ubuntu 12.10 only)
I prepared a PPA just for you.
Add it to your system. To be specific, run this:
$ sudo add-apt-repository ppa:ryoqun/ppa
$ sudo apt-get update
$ sudo apt-get install oprofile llvm-3.1
$ sudo apt-get dist-upgrade # Upgrade preinstalled libllvm3.1 to the PPA
By default, Rubinius doesn't use system-provided LLVM, so re-configure Rubinius
to use it and re-build Rubinius:
$ cd /path/to/rubinius-git-repository
$ rake clean
$ ./configure --llvm-config llvm-config-3.1
Setup (manual build: the hard way)
If you really want to build LLVM and OProfile manually or if you're using an
other Linux distribution and the distribution doesn't provide OProfile-enabled
LLVM packages, manually build LLVM and OProfile like this:
(I tested this on Ubuntu 12.10. Minor adjustments may be needed to build on
Build and Install OProfile:
$ sudo apt-get build-dep oprofile # do equivalent thing on your distro.
$ cd /path/to/working-dir-to-build-things
$ wget http://prdownloads.sourceforge.net/oprofile/oprofile-0.9.8.tar.gz
$ tar -xf oprofile-0.9.8.tar.gz
$ cd oprofile-0.9.8
$ ./configure --prefix /usr # LLVM has a build issue. So install to /usr
$ sudo make install
$ adduser oprofile # this is needed for profiling JITted code
$ opreport --version
# => opreport: oprofile 0.9.8 compiled on Mar 8 2013 00:57:08
Build LLVM with OProfile support enabled and rebuild Rubinius:
$ sudo apt-get build-dep llvm # do equivalent thing on your distro.
$ cd /path/to/working-dir-to-build-things
$ wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
$ tar -xf llvm-3.2.src.tar.gz
$ cd llvm-3.2.src
$ ./configure --enable-optimized --disable-assertions --with-oprofile
$ sudo make install
If the compilation of
OProfileWrapper.cpp fails like this:
OProfileWrapper.cpp: In member function ‘bool llvm::OProfileWrapper::checkForOProfileProcEntry()’:
OProfileWrapper.cpp:141:62: error: ‘read’ was not declared in this scope
OProfileWrapper.cpp:142:24: error: ‘close’ was not declared in this scope
Apply this patch, then, re-
make and continue:
diff --git a/lib/ExecutionEngine/OProfileJIT/OProfileWrapper.cpp b/lib/ExecutionEngine/OProfileJIT/OProfileWrapper.cpp
index d67f537..7c0d395 100644
@@ -29,6 +29,7 @@
Phew, finally rebuild Rubinius!:
$ cd /path/to/rubinius-git-repository
$ rake clean
$ rm -rf vendor/llvm # If you build Rubinius with vendorized LLVM.
OProfile can be configured by a command called
opcontrol, not by a
configuration file, which are common.
$ sudo opcontrol --deinit
$ sudo modprobe oprofile timer=1 # Needed only inside VirtualBox VMs
$ echo 0 | sudo tee /proc/sys/kernel/nmi_watchdog
$ sudo opcontrol --no-vmlinux
$ sudo opcontrol --separete all
$ sudo opcontrol --start
$ sudo opcontrol --dump # Flushes buffered raw profile data
$ opreport --merge all --threshold 1 # Prints the report of profile
Setup is complete if you see an output from
opreport like this:
CPU: CPU with timer interrupt, speed 1858.39 MHz (estimated)
Profiling through timer interrupt
92127 67.6157 no-vmlinux
21920 16.0880 rbx
7950 5.8348 libc-2.15.so
4068 2.9857 runner
3429 2.5167 libstdc++.so.6.0.17
2139 1.5699 cc1
1416 1.0393 vm
Run Ruby code!
To annotate Ruby code correctly, your current directory must be the top
directory of the Rubinius git repository:
$ cd /path/to/rubinius-git-repository
$ ./bin/benchmark ./benchmark/core/hash/bench_access.rb
Generate profile report
Let's check the profile report of the above benchmark.
NOTE: try to run
opcontrol --dump at least once while running Rubinius if
JIT-ted Ruby code doesn't show in the profile report. OProfile automatically
dumps periodically, but if your Ruby process terminates too quickly, you
should manually run
opcontrol --dump while it's still alive.
$ sudo opcontrol --dump && sudo opjitconv /var/lib/oprofile/ 0 0
$ opreport --merge all --threshold 1 image:./bin/rbx --symbols --debug-info \
$ opannotate --merge all --threshold 0.5 image:./bin/rbx --source \
I'll omit but you can generate profile report of annotated assembly by passing
opannotate instead of
Reset current profile
By default, OProfile keeps the profile data indefinitely. To reset it, run
$ sudo opcontrol --reset
How to read the profile result
/tmp/hash-access-symbols, the content should be like this:
CPU: CPU with timer interrupt, speed 1860.42 MHz (estimated)
Profiling through timer interrupt
samples % linenr info app name symbol name
3571 31.5990 (no location information) no-vmlinux /no-vmlinux
619 5.4774 bench_access.rb:16 8295.jo _X_Object#__block__$block@7
599 5.3004 bench_access.rb:24 8295.jo _X_Object#__block__$block@11
531 4.6987 bench_access.rb:8 8295.jo _X_Object#__block__$block@1
457 4.0439 (no location information) rbx __x86.get_pc_thunk.bx
318 2.8139 jit_util.cpp:787 rbx rbx_set_local_depth
298 2.6369 bench_access.rb:42 8295.jo _X_Object#__block__$block@20
277 2.4511 object.cpp:362 rbx rubinius::Object::hash(rubinius::State*)
236 2.0883 jit_primitives.cpp:4494 rbx jit_stub_object_hash
236 2.0883 tuple.cpp:36 rbx rubinius::Tuple::put(rubinius::State*, int, rubinius::Object*)
222 1.9644 inline_cache.cpp:696 rbx rubinius::InlineCache::check_cache_poly(rubinius::State*, rubinius::InlineCache*, rubinius::CallFrame*, rubinius::Arguments&)
197 1.7432 jit_util.cpp:848 rbx rbx_push_local_depth
187 1.6547 variable_scope.cpp:126 rbx rubinius::VariableScope::get_local(int)
179 1.5839 compiledcode.cpp:212 rbx rubinius::CompiledCode::specialized_executor(rubinius::State*, rubinius::CallFrame*, rubinius::Executable*, rubinius::Module*, rubinius::Arguments&)
162 1.4335 hash.rb:220 8295.jo _X_Hash#key_index@15
147 1.3008 hash.rb:325 8295.jo _X_Hash#values_at$block@13
143 1.2654 object.cpp:400 rbx rubinius::Object::hash_prim(rubinius::State*)
137 1.2123 (no location information) libc-2.15.so /lib/i386-linux-gnu/libc-2.15.so
127 1.1238 variable_scope.cpp:100 rbx rubinius::VariableScope::set_local(rubinius::State*, int, rubinius::Object*)
111 0.9822 inline_cache.cpp:576 rbx rubinius::InlineCache::check_cache_reference(rubinius::State*, rubinius::InlineCache*, rubinius::CallFrame*, rubinius::Arguments&)
108 0.9557 vm.cpp:175 rbx rubinius::VM::new_object_typed_dirty(rubinius::Class*, unsigned int, rubinius::object_type)
99 0.8760 vm.cpp:207 rbx rubinius::VM::new_young_tuple_dirty(unsigned int)
97 0.8583 array18.rb:6 8295.jo _X_Array#map@12
93 0.8229 integer.cpp:84 rbx rubinius::Integer::from(rubinius::State*, int)
90 0.7964 hash.rb:137 8295.jo _X_Hash#@23
76 0.6725 array.cpp:205 rbx rubinius::Array::set(rubinius::State*, int, rubinius::Object*)
72 0.6371 linkedlist.cpp:27 rbx LinkedList::remove(LinkedList::Node*)
68 0.6017 tuple.cpp:64 rbx rubinius::Tuple::create(rubinius::State*, int)
57 0.5044 method_primitives.cpp:12327 rbx rubinius::Primitives::object_hash(rubinius::State*, rubinius::CallFrame*, rubinius::Executable*, rubinius::Module*, rubinius::Arguments&)
As you can guess, symbol names beginning with
_X_ is the JIT-ted Ruby code.
Many Ruby benchmark blocks are JIT-ted (like
bench_access.rb:16) and listed near the top.
_X_Array#map@12 is a JIT-ted code of
Array#map. I'll explain the
format of the report using it as an example:
97 0.8583 array18.rb:6 8295.jo _X_Array#map@12
97 is the number of counts OProfile found what Rubinius was executing, while
periodically sampling it.
0.8583 is the percentage to the total number of samples OProfile collected.
Note that summing all entries up doesn't equal to 100%, because
reported the top part of whole profile result (by
is the source location where this method is defined (this is same as
8295.jo is a special
app name for JIT-ted code. Usually
app name is the
name of file C/C++ functions reside in (shared libraries or executables).
So, for normal Rubinius' C++ functions,
app name is just
rbx because they
./bin/rbx. For JIT-ted Ruby code, there can be no meaningful
name, because JIT-ted Ruby code is generated at runtime and they reside in no
file, but only in memory. So, OProfile uses
app name in such cases.
8295 means this is profiled when running the process of PID 8295.
_X_Array#map@12 is the name of symbol for this entry of profile report
/tmp/hash-access-source, the content should be like this (I'll omit some
unimportant part, because this file is big):
* Total samples for file : "kernel/common/hash18.rb"
* 4870 12.8904
<credited to line zero> 19 0.0503 :
:# -*- encoding: us-ascii -*-
: include Enumerable
: class Bucket
: attr_accessor :key
633 1.6755 : def match?(key, key_hash) /* _X_Hash::Bucket#match?@16 75 0.1985, _X_Hash::Bucket#match?@16 45 0.1191, total: 120 0.3176 */
40 0.1059 : case key
218 0.5770 : when Symbol, Fixnum
319 0.8444 : return key.equal?(@key)
: @key_hash == key_hash and (Rubinius::Type::object_equal(key, @key) or key.eql?(@key))
You might wonder why
Hash::Bucket#match? has so many counts while this didn't
appear in the report for symbols. The reason is that the method is inlined to
each its callers (maybe many top-appearing JIT-ted Ruby blocks). How cool this
is! You can really know which Ruby code is actually taking too much time.
My last favor
I wish the official Ubuntu packages are fixed and there is no need to add any
PPA. I want it-just-works. I reported affecting bugs to Ubuntu's bug tracking
system. So, please vote it up for fixing (vote them with "affects you?",
- Bug 1154025 Install PIC version of libbfd.a
- Bug 1148682 symbol lookup error: /usr/lib/libopagent.so: undefined symbol: bfd_init
- Bug 1148529 OProfile support is disabled
There is new profiling tool on Linux:
operf (OProfile's new
command). I don't know them well, but certainly they will be useful for
As a last note, this is partially based on Unladen Swallow's
of profiling LLVM-based JIT-ted code by OProfile.
Last week, Dirkjan spent the week with me at the Engine Yard office in
Portland, OR. We wanted to discuss a variety of topics related to Rubinius
technology. We started the week by making the following list:
- Inline cache
- JIT IR
- Performance of JIT'd code
- Extension API
- Concurrency and garbage collection
We spent the week mostly writing a ton of things on the whiteboard, discussing
them, erasing them, and writing a bunch more. Ultimately, we touched on all
the topics but we did not get to discussing the extension API in depth, so
we'll skip that one for now. I've previously
blogged about our inline cache
discussion. Here I'll summarize the rest of our discussions.
The just-in-time (JIT) compiler that converts our virtual machine bytecode to
machine code is a fundamental part of making Ruby fast. It is also one of the
most complex components of Rubinius. It requires interfacing with the virtual
machine to get runtime data and coordinating with the garbage collector to
keep other runtime data structures consistent.
We are able to leverage LLVM for especially difficult parts
of machine code generation, like instruction selection, instruction scheduling
and register allocation. However, the semantics of Ruby are so removed from
those of the underlying CPU that we have a lot of work we still need to do to
improve the quality of the machine code.
One important component of this work is the ability to represent operations
that have Ruby semantics (but are not part of Ruby itself) in a way that we
can perform transformations on those operations. The "not part of Ruby" is
actually a bit fuzzy. The concept of object allocation is certainly a part of
Ruby. But the specific operation of reserving a region in the heap is
invisible to Ruby.
There are a couple operations in particular that are critical to Ruby
performance, especially when emitting machine code from the JIT. Those
operations are type guards and object allocation. If our assumptions for the
particular type of a value hold, we want to emit the fewest number of guards
as possible and we want the largest contiguous sequence of machine
instructions that are valid under a specific guard.
Likewise, if an object is allocated but does not get passed outside of the
code sequence we are compiling to machine code, and we can see there are no
other side effects, we may be able to avoid allocating the object. A good
example of this in Rubinius, where the core library is implemented in Ruby, is
the ByteArray object that is used by String. A ByteArray is a fixed vector of
8-bit bytes. If an intermediate operation on a String creates a ByteArray and
that object isn't used outside the JIT'd method, the JIT should be able to see
through the object creation and recognize that just some machine bytes are
being moved around or transformed. The ByteArray allocation can be elided.
These two examples, type guards and object allocation, are just a couple
things we expect the JIT to do well. There are many others.
Right now, our JIT converts the virtual machine bytecode directly into LLVM
IR. To support more sophisticated JIT transformations, we need a
representation in between bytecode and LLVM IR, which is one of the things we
will be working on.
Another aspect of the JIT that needs an overhauls is the infrastructure around
running the JIT. It runs on a separate native thread with few synchronization
points. Since Rubinius has no global lock, and since we need a variety of
auxiliary threads for system operations (e.g. handling process signals,
running object finalizers, handling debugger events), we have some
infrastructure for handling the lifetimes of these auxiliary threads. This is
a particularly complicated aspect of Rubinius because when Ruby code calls
exec(), the Ruby code knows nothing about these auxiliary
threads. We have to be careful that state is maintained across these calls.
This component will be enhanced to better handle auxiliary threads that are
actually proper Ruby threads (all auxiliary threads are fundamentally Ruby
threads but not necessarily visible in Ruby). This will permit us to have
multiple, potentially completely different, JIT threads running
For us as people, when our assumptions fail we often just blunder on making
fools of ourselves. Too bad we don't have a "check your assumptions" mode as
useful as the one we build into the JIT. In the JIT, if our assumptions fail
and we don't catch it, we're going to compute the wrong values or cause the
wrong actions to be performed.
So, in the JIT, when our assumptions are wrong, we must change course. This
course change may be to go down a different path in the machine code or to
exit the machine code entirely and go back to the interpreter. Both paths have
significant performance implications.
A reason to exit the machine code is that one significant aspect of generating
fast machine code is proper use of the type guards. If our assumptions about
the type of a value are wrong, simply exiting back to the interpreter means
that we don't pollute the LLVM IR with types that can inhibit LLVM
optimizations from being as effective. On the other hand, there are runtime
costs associated with both exiting to the interpreter and running the
interpreter instead of machine code. Also, if we exit the generate machine code
too often, we may trigger a de-opt, throwing away that version of the machine
code and requiring the JIT to create a new one.
Another aspect of the performance of the machine code we emit is how we handle
virtual machine operations. Some aspects of running Ruby require very
low-level operations that we call primitives. These are coded in C++ and
interact with internal virtual machine facilities. When compiling machine
code, we want as few conditionals and as few function calls as possible to get
the highest performance. Typically, we call a function that performs the
primitive operation and then returns to the JIT'd machine code. As we move
more and more things into Ruby itself, it is possible for the JIT to inline
operations and get the greatest possible optimizations. This is another area
we will be working on as we improve the JIT.
One of the goals for Rubinius since the very beginning is to provide excellent
support for concurrency and parallelism in Ruby. We have already removed the
global interpreter lock (GIL/GVL) from Rubinius. However, we have to ensure
that we don't replace it with an effective global lock when we implement
components of the system.
The key to improving lock-freedom throughout the system is to use less locks.
That may seem obvious, but it is not necessarily that simple. As I mentioned
above, the virtual machine actually uses a variety of threads internally. And
with no global lock, any Ruby thread could invoke any virtual machine
fork(), at any time. Multiple Ruby threads could be calling
the same method on the same object at the same time. Considering my previous
post about inline caches, this implies that all sorts of data races and
concurrency bugs could arise just from calling Ruby methods.
In the virtual machine, we must be careful to ensure correctness under
concurrent execution. The classical way of doing so is to use locking.
However, there is tons of research from the '90s about lock-free and wait-free
algorithms. This academic research is slowly trickling down to industry in
various domains. We will be using lock-free implementations of various virtual
machine data structures to improve support for concurrency and parallelism.
One particularly challenging area that requires careful control of concurrency
is the garbage collector. Rubinius provides a generational garbage collector
with very short pause times when collecting the young generation. The
generational architecture allows us to only collect a small region of the heap
instead of the whole heap every time. We also support object allocation using
thread-local buffers to avoid locking and contention when Ruby code is running
However, the garbage collector architecture pauses all threads to perform a
collection. While the pause for a young generation collection is on the order
of a few milliseconds (5-15 in many cases), there is the overhead of forcing
all the various threads to pause.
There is only one way to address this issue and that is to make the garbage
collector concurrent. Fortunately, we can achieve this with some modifications
to our existing garbage collector. When adding concurrency to the garbage
collector, we want to avoid slowing down non-garbage-collector operations,
typically referred to as mutator performance. (Operations in the system can be
viewed as partitioned into two sets: the heap management--allocation and
garbage collection, and the operations that mutate the heap--mutator
With modifications to our mature generation collector (based on the Immix
collector), we can perform partial heap collections and preserve the
generational benefits while also limiting stop-the-world pauses to specific
regions of the heap and bounding the pause times.
Semantic Versioning is a methodology for versioning
software to support correct dependency management. MRI has never had
particularly strong support for versioning, with semantic changes being added
in patch levels (not even visible in the RUBY_VERSION constant) and major
version increments on special dates completely devoid of semantic relevance.
In Rubinius, we would like to provide a more sound versioning discipline, in
particular, semantic versioning. However, this raises some significant
challenges. Let's consider some of the components of Rubinius:
- Ruby language itself in one of three language modes: 1.8, 1.9, and 2.0
- Ruby core classes
- Ruby standard library
- MRI compatibility "C-API" for supporting C-extensions
- Supporting core classes like ByteArray and Tuple
- Bytecode compiler
- Virtual machine instruction set
- Internal APIs for JIT, garbage collection, concurrency management
The key component on which semantic versioning is based is the public API. But
what is included in that? Is it just Ruby as covered in RubySpec? Is it also
our supporting Ruby classes or Ruby facilities like the parser and bytecode
compiler? Since Rubinius is a language platform used to implement other
programming languages besides Ruby, resolving these questions correctly is
Along with a sound versioning discipline, we also want to promote continuous
delivery of enhancements to Rubinius. Technically, these are compatible goals,
but require that we consider how to organize our development and source
branches when we create backward-incompatible API changes.
As we work toward the Rubinius 2.0 release, we will do our best to implement
semantic versioning. You can help us by testing your applications, gems, or
libraries on Rubinius. One great place to do this, of course, is Travis
All Work and No Play
Portland is a beautiful city. It's not very big but has enough city things to
feel like a proper city. I didn't want to keep Dirkjan caged in the office all
day every day, so when the weather looked promising on Friday, we packed our
concurrency and garbage collection conversation up and headed out. First we
went to Portland State University to attend a talk at a
programming language club. The talk was on continuation-passing style,
defunctionalization, and transformations. It was somewhat related to our
compiler work and it was certainly interesting. But my real motive was to
introduce Dirkjan to my former coworker Larry
Diehl, who is a PhD student at PSU doing
fascinating work in dependently typed programming language theory.
After the talk at PSU, we hopped on the MAX light rail and headed toward the
Japanese Garden. The bus that would take us the
rest of the way wasn't due for 25 minutes so we decided to walk. The garden is
located in Portland's Washington Park, which
is located in the west hills of Portland. So, we walked, uphill, for quite a
while. We walked past the Lewis and Clark monument, through the international
rose garden, past the tennis courts and
finally up another steep hill to the Japanese Garden.
The Japanese Garden is one of the most beautiful and authentic outside of
Japan. We are very lucky to have it here and my wife and I visit it frequently
to enjoy the beautiful views and bask in the tranquility. Since it is early
spring, not many plants were blossoming yet, but we saw some huge koi, some
water falls, beautiful trees, and a stunning view of Portland. Unfortunately,
Mt Hood was hiding in the clouds that day.
I'm not much of a photographer, but here are some pics from the trip.
I am excited to announce that Engine Yard is
sponsoring Dirkjan Bussink to work on Rubinius!
A long time ago (around 2008) when Dirkjan first showed up in the Rubinius
project he was still a university student finishing his Masters degree. He let
me read a draft of his thesis on brain-computer interfaces. Dirkjan has been
helping out on Rubinius since then. Recently, he's been fixing concurrency,
garbage collection, and JIT issues.
This week, Dirkjan is working from the Engine Yard office in Portland, OR.
We've got a list of topics that we're delving into to get ready for several
technology improvements in Rubinius. We started with a list of five things,
one for each day. Unfortunately, time always goes faster than we expect so
we've dropped one topic. I'll blog a summary of our discussions so you get an
idea of some of the big stuff coming in Rubinius.
Our first discussion was about our inline cache mechanisms. An inline cache is
used at a method call location to speed up calling a method. The first time it
is called, the method is looked up. Subsequently, as long as the receiver is
the correct type, the same method can safely be used. This is a typical
execution-time versus memory size trade-off. We cache the method we found,
which saves time looking it up again at the expense of some additional memory
to store the method. As with all caches, we have to carefully consider cache
invalidation. Let's use a quick code sample to illustrate the inline cache.
1 class A
2 def a
6 class B < A
In the example above, the class
B inherits from
A, which provides the method
a. If we call
a on an instance of
B, we look up the method and find it
Here, in the code representation for the call
b.a, we store an inline cache
object. As long as the type of object we see in
b is the same, we can safely
call the cached method
a. However, if the type of
b changes, for example,
because we define a method
a on class
B itself, we must invalidate the
cache and look up the method again.
The inline cache can be very important for the speed of the virtual machine
executing Ruby code. However, the inline cache is also very important for our
just-in-time (JIT) compiler that generates machine code, completely avoiding
the VM interpreter. Since the inline cache records the types of the Ruby
objects seen at the call site, we can use that information in the JIT to
generate fast code as long as our assumptions about the object type are
Since inline caches are so important, Dirkjan and I did a complete review of
our existing inline caching and analyzed how to improve it. The most important
thing for improving execution speed is to do less. And the best way to do less
is to reduce redundancy.
We started our analysis with an enumeration of all the kinds of calls that we
see in Ruby. This is the list that we made:
- fixed-arity method send
- variadic method send
super with explicit arguments
super with implicit arguments
This list is all kinds of method sends that you can see directly in Ruby code.
We added to this list a few other things that are not in Ruby code:
- extension API calls
- method combinations like
A.new which reduces to
- call_custom (a special Rubinius instruction)
- primitives (parts of Ruby functionality that must be implemented in the VM)
The next thing we did was look at the method call sites from the perspective
of the virtual machine. Certain method calls, like
respond_to? imply certain
VM operations. This perspective is very important. Sometimes people complain
that Ruby is too dynamic because you could, for example, alias
eval to some other name. Now any attempt to statically analyze Ruby fails
because you can't know ahead of time what the alias may be. This is actually
not an issue if you consider the perspective from the VM. No matter what the
method name is, the ultimate VM operation for looking up a method will
execute. By associating method call sites with a VM operation, we get more
insight into how best to structure the inline cache.
The next part of our analysis involved looking at what values are stored in
the inline cache for the various types we enumerated, and what values are used
for cache invalidation. The purpose of this analysis is two-fold: correctness
and performance. We want to ensure that our cache invalidation is correct
but also not generating false positives (i.e. invalidating cases that are not
actually invalid). We also want to ensure that storing and updating the data
in the cache can be done efficiently.
The final part of our analysis concerned the cardinality of the cache. In
object-oriented code, a method call site may be monomorphic (only one type
is seen), polymorphic (a small number of different types are seen), or
megamorphic (a huge number of different types are seen). This perspective
can be very important for performance and for memory overhead. We don't want
to waste memory for sites that are monomorphic. Nor do we want the cache to be
thrashing when many different types are seen.
In summary, we looked at these aspects of inline caches: method call site
types, VM operations that result from the method call, the kind of data cached
and cache invalidation, and the cardinality of the method call site. We'll be
taking the results of our discussion and implementing changes to our inline
caching over the next few weeks.
If you get a chance, pop into the #rubinius IRC channel on freenode.net and
say hello to Dirkjan!
It's now very easy to get a Rails app on Heroku's Cedar Stack running on Rubinius 2.0 and Puma 2.0.
The fine folks at Heroku have been adding support for using Bundler to specify which version of Ruby your app uses and even to specify which Ruby it uses.
I've checked in on the progress of using Rubinius with this same method every couple few months for the past year or so. I was always met with the same results...
"not quite yet, we're working on it".
Last night I was getting mentally packed for Waza later this week and decided to check in on the state of things so I could ask the right people the right questions in person when we're all together. Turns out that sometime between last time and this time, they've got things working. Boom diggity! And it's pretty ding-dang easy too! Double boom diggity!
The Build Up
I'm going to assume that if you want to use Rubinius 2.0 that you also have the good sense to want to use Puma 2.0. There are three very simple steps to get up and running with Rubinius 2.0 + Puma 2.0.
Add the Puma gem to your
Tell Heroku to use Puma as your web server in your
Specify Rubinius as your Ruby engine in your
Once you've done those things, your workflow is the same as before.
bundle update && git commit -am "Double boom diggity!" && git push heroku master
If you're changing an existing app from Heroku's default Ruby (MRI 1.9.x) to Rubinius, you'll see a message like this when you
git push heroku master.
The important lines are
Old: ruby 1.9.3 and
New: rubinius 2.0.0.rc1. After that, everything the same as before. The bundling starts, etc.
That's it. You should now be up and running with Rubinius and Puma!
If you have problems, say something in the comments with a link to a gist with your problem output. If you succeed and get up and running in production, let us in the comments too. Tell the world that you love Rubinius/Puma.
I threw together this quick and dirty Rails app from scratch deployed to Heroku to show all this in action. http://heroku-rbx-puma-rails-app.herokuapp.com. I also migrated an existing MRI 1.9.3 app ( The Farmhouse site ) to Rubinius 2.0 to show the migration is easy too.
The Thank Yous
None of this would be possible without the hard work of Carl Lerche, Yehuda Katz, André Arko, Terrence Lee and the other contributors to Bundler. And a special thanks to Terrence Lee for his dutiful stewardship of the Heroku Ruby Buildpack. Once again, the awesome powers of Open Source make great things possible.
As of this writing "Rubinius 2.0" actually means "Rubinius 2.0.0.rc1" and "Puma 2.0" actually means "Puma 2.0.0b7". The net result is the same though.
Bugs, we all are in the infinite fight of getting rid of them. Sometimes
they are trivial, other times a bit harder. In this video I'm going to
show how I have been debugging an issue that caused memory corruption of
your dear Ruby objects in Rubinius.
It's a pretty long video clocking at 55 minutes but I'm trying to be
detailed in the discovery process. I'm explaining the steps I'm taking
along the way and explaining why I'm taking those.
The real debugging process if of course littered with taking wrong turns
so please don't think I'm brilliant and able to zoom into the issue
immediately all the time. The video is not recorded as I was going
through it the first time, so I exactly knew where I was going and
which steps to take.
If you want to look up some more information on GDB that I'm using
in the video, here are some links that explain some of the most
used commands in the video:
GDB - Examining data
GDB - Setting watchpoints
During the video there's a part from 8:18 until 11:28 where it's just
watching a rebuild of Rubinius scrolling by. I'd suggest skipping it
since it's not very useful. I tried to remove it but after having
to reencode the video it only grew larger. If anyone has more skills
to fix that, please let us know!
You can find the video on Vimeo:
Watch on Vimeo
It's probably best to download the video and play it locally on your
machine. It's 453 MB so it might take a while.
Please let us know if you find this useful, we might do more of them
in the future if there's interest!