I recently thought I would be able to use Unix signals to solve a problem in a Ruby program I’m writing. It turned out not to be workable, but was a fun journey into Unix signal handling and how they work (or don’t) with MRI.
I’m not a signals expert – corrections and opinions are very welcome! Also, I started to feel a bit out of my depth when patching the longjmp() calls. I’d love to find out more about this stuff.
Brief signals primer
Signals are used to alert processes or threads about a particular event. Synchronous signals are usually the result of errors in executing some instruction (such as an illegal address reference) and are delivered to the thread that caused the error. Asynchronous signals are external to the execution context and are probably the ones you’re more familiar with – they can be sent between processes using things like kill or delivered when needed by the kernel.
When a signal is generated it is immediately put into the “pending” state. If the process has a thread that has not blocked signals of that type, it is delivered straight away. If that type of signal is blocked by all threads in the process, it remains pending until they are unblocked in one of the threads, at which point it is delivered immediately. Delivered signals can be ignored (often the default response) or processed by a signal handler. In Ruby, we define a SIGUSR1 handler like this:
Signal.trap("USR1") do
puts "USR1 caught"
end
Why might we want to block signals?
Blocking signals is often used when we have a section of code that must not be interrupted. To enable this, each thread maintains a signal mask. This is the list of signal types that the thread is blocking, which we can examine and change using pthread_sigmask() (sigprocmask() in single-threaded programs). A new thread inherits the signal mask from the parent. However, each thread does not have its own set of signal handlers – these are shared throughout the process. Asynchronous signals that are delivered to the process can be processed by any thread that has not blocked those signals.
Signals in MRI
Unfortunately, MRI isn’t really very friendly to Unix programmers wanting to play with the signal mask, as we’ll see.
MRI defines the Signal module, that only contains two methods: Signal.trap and Signal.list, which provides the mapping of signal names to numbers for your platform. Since none of the other libc signal handling functions are defined, I created a library to provide them (and some other system calls sometime). syscalls is built using the lovely FFI library. This mirrors the libc functions closely, with a couple of Ruby-style shortcuts added.
Ruby 1.8
As Joe Damato found, MRI 1.8 with pthreads enabled is rather rt_sigprocmask() happy. It seemed obvious that all that mucking about with the signal mask would cause strange behaviour when blocking signal, but let’s see how:
require "syscalls/signal"
mask = Syscalls::Sigset_t.new.to_ptr
Syscalls.sigemptyset(mask)
Syscalls.sigaddset(mask, "USR1")
puts "Block and roll!"
Syscalls.sigprocmask(Syscalls::SIG_SETMASK, mask, nil)
puts "Looks fine so far - let's raise an exception..."
begin
raise
rescue
end
puts "Aw-naw!"
What’s going on?
Using strace with ruby 1.8.6 (2009-08-04 patchlevel 383) [x86_64-linux] gives us:
write(1, "Block and roll!\n", 16Block and roll!
) = 16
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [USR1], NULL, 8) = 0
write(1, "Looks fine so far - let's raise "..., 48Looks fine so far - let's raise an exception...
) = 48
rt_sigprocmask(SIG_BLOCK, NULL, [USR1], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [USR1], NULL, 8) = 0
write(1, "Aw-naw!\n", 8Aw-naw!
) = 8
Line 3 is from Ruby calling getcontext – it is pretty harmless since it is passing SIG_BLOCK along with an empty set of signals, which adds nothing to the existing signal mask. Line 4 is our own call to sigprocmask – note we’re using SIG_SETMASK, which replaces the existing mask. So far, so good. However, on lines 7-9 Ruby stores the old mask (our SIGUSR1), replaces it with an empty mask and then immediately replaces that with our SIGUSR1 mask again.
But, the mask is only empty for a fraction of a second – I think I’ll be alright!
Think again! It’s tempting to think that this wouldn’t be a problem in most real-world situations, but you may recall that when a signal cannot be delivered because it’s blocked it is put into a pending state. When that signal type is unblocked, the signal is immediately delivered. This means there can actually be plenty of time to queue up a signal to cause a problem here.
The REE stuff below all applies to the other MRI 1.8 flavours I tested too – that’s 1.8.{6,7} on 64-bit Linux.
REE 1.8.7-2010.01
REE has Joe’s --disable-ucontext patch applied, which meant a lot fewer sigprocmask()s to wade through! In fact, it nearly worked – just our old SIG_SETMASK friend set during the exception handling:
write(1, "Block and roll!\n", 16Block and roll!
) = 16
rt_sigprocmask(SIG_SETMASK, [USR1], NULL, 8) = 0
write(1, "Looks fine so far - let's raise "..., 48Looks fine so far - let's raise an exception...
) = 48
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(1, "Aw-naw!\n", 8Aw-naw!
) = 8
Time to dig around and see why Ruby is doing that.
Calling raise resulted in a call to rb_longjmp(), which appears to be a reimplementation of siglongjmp() (or _longjmp() – I don’t know which). This in turn calls rb_trap_restore_mask(), which sets the signal mask back to the mask that was stored when Ruby starts up or the last call to Signal.trap was made.
Patching Ruby 1.8 – including 1.8.6, 1.8.7 and REE
This might be dangerous or not even sensible. Let me know if you find out!
As far as I can tell, simply removing the call to rb_trap_restore_mask() shouldn’t break anything since the places that the trap_last_mask variable is set are very limited. It may not the best place for the fix (if rb_longjmp() is actually siglongjmp(), this might break the reimplementation), but it does at least appear to work.
Here’s the truly tiny patch.
Ruby 1.9
Ruby 1.9.1-p376 is a slightly more tricky case. As you’ll be aware, MRI 1.9 maps each Ruby thread to a native C thread and uses the GIL to ensure only one runs at any one time. The interpreter uses a thread to trigger an interrupt in order to schedule threads. This thread is created on initialisation of the interpreter and means that even very simple programs have two native threads running.
As we can see in the abbreviated strace below, the signal mask is empty when the timer thread is created and this will be inherited. This means that if we block a signal type in our main Ruby thread, they will still be able to be delivered and handled by the timer thread.
Below that we can see rb_trap_restore_mask() emptying the mask when it is called from rb_longjmp(). The sigaltstack() and following sigaction()call on lines 4-5 tell Ruby to handle segfaults on a different stack.
rt_sigaction(SIGHUP, {0x48b9f0, [], SA_RESTORER|SA_SIGINFO, 0x3deca0f0f0}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGUSR1, {0x48b9f0, [], SA_RESTORER|SA_SIGINFO, 0x3deca0f0f0}, {SIG_DFL, [], 0}, 8) = 0
...
sigaltstack({ss_sp=0x1b0baf0, ss_flags=0, ss_size=16384}, {ss_sp=0, ss_flags=SS_DISABLE, ss_size=0}) = 0
rt_sigaction(SIGSEGV, {0x48bce0, [], SA_RESTORER|SA_STACK|SA_SIGINFO, 0x3deca0f0f0}, {SIG_DFL, [], 0}, 8)
...
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
...
clone(Process 23020 attached
child_stack=0x7f3b2c188ff0, flags=CLONE_VM|CLONE_FS| CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM| CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f3b2c1899e0, tls=0x7f3b2c189710, child_tidptr=0x7f3b2c1899e0) = 23020
...
[pid 23019] write(1, "Block and roll!", 15Block and roll!) = 15
[pid 23019] write(1, "\n", 1
) = 1
[pid 23019] rt_sigprocmask(SIG_SETMASK, [USR1], NULL, 8) = 0
[pid 23019] write(1, "Looks fine so far - let's raise "..., 47Looks fine so far - let's raise an exception...) = 47
[pid 23019] write(1, "\n", 1
) = 1
[pid 23019] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 23019] write(1, "Aw-naw!", 7Aw-naw!) = 7
[pid 23019] write(1, "\n", 1
) = 1
Patching Ruby 1.9.1
First off, we need to mask all signals as soon as the timer thread is created. This makes sure that all signals can only be delivered to the main Ruby thread (until we create some more). I suppose this will actually slow down signal delivery by some small amount.
Ruby itself blocks some signals for a time to allow sections of code to run without interrupts, using rb_disable_interrupt() and rb_enable_interrupt(). These functions mask and unmask all signals. We need to make Ruby save the existing signal mask while it blocks all signals, then restore the old mask, rather than unblocking everything.
Here’s the patch. Again, I don’t think this has any negative side-effects, but it also might not be a good idea. Let me know if you find out!
Further reading