Using rr and Visual/VM Debugger to debug Smalltalk/X crash

Introduction

This year on ESUG 2018 conference in Cagliary, Sardinia, I gave a talk showing how to debug a faulty smalltalk runtime using GNU debugger, Mozilla rr and Visual / VM Debugger I have shown how I have debugged a real bug found in Smalltalk/X - issue #200.

Following post is pretty much the ESUG 2018 talk in a form of an article. And, by the way, the conference was great and with some very interesting talks.

For those who want to try it all themselves

If you want to try all what will follow yourself, you need to get few things ready.

  1. First, you need a Linux system on x86_64 CPU. I use Debian but any reasonably modern Linux distro.

  2. Then you need to build your own Smalltalk/X to exhibit the problem, since it has been long time fixed. To build Smalltalk/X, follow build instructions. However, after rake checkout and before commencing rake compile, you need to revert revert commit 574962856f04

    pushd build/stx/libbasic hg backout –no-commit 574962856f04
    popd

  3. To get all features of Visual / VM Debugger (VDB for short), you’d also need to build most recent GNU Debugger (GDB for short) from master branch. Once built, configure VDB to use it:

    mkdir -p ~/.smalltalk echo “UserPreferences current vdbGdbExecutable:’/path/to/built/gdb/gdb’.” » ~/.smalltalk/settings.stx

  4. Finally, you would need rr. The easiest is to install it using package manager, e.g.:

    apt-get install rr

Preparation

The very first step - and the most tricky if you ask me - is to prepare a reproducible and minimal test case. In this case the bug report contains such a test case, so all we need to do is copy-paste-hack it into .st file so we can easily run our test case from a command line. So an issue_200.st would be like:

| t | 
t := Unicode16String new: 46. 
[ 
    1 to: 30 do:[:i | 
        Stderr nextPutLine: i printString , ' - ', (t size*2) printString. 
        Stderr flush.            
        t := t , t.
    ] 
] on: Error do:[:ex | 
    Stdout nextPutLine: ex description.
    Smalltalk exit: 1.
]. 
Stdout nextPutLine: 'Passed'.
Smalltalk exit:0"

Now let’s run the code above - if there’s a problem, it should crash:

cd build\projects\smalltalk
./smalltalk --record --abortOnSIGSEGV --execute issue_200.st

The --execute tells Smalltalk/X to run the code in given file. The --abortOnSIGSEGV tells the VM to abort() when segmentation violation. Normally, the VM just raises a smalltalk exception and lets smalltalk code to handle it. Finally, --record tells the launcher script to record and save whole execution (including the crash) so we can later use this record for debugging purposes.

The output should look something like:

rr: Saving execution to trace directory `/home/jv/.local/share/rr/stx-33'.
1 - 92
2 - 184
3 - 368
4 - 736
5 - 1472
6 - 2944
7 - 5888
8 - 11776
9 - 23552
10 - 47104
11 - 94208
12 - 188416
13 - 376832
14 - 753664
15 - 1507328
16 - 3014656
17 - 6029312
18 - 12058624
19 - 24117248
20 - 48234496
21 - 96468992
22 - 192937984
23 - 385875968
24 - 771751936
25 - 1543503872
VM [sigsegv]: sigNo 11
VM [error] (18-08-04 13:11:45): sigsegv
 new: 7f7d47a55000 .. 7f7d47c55000  old: 55d85d667000 .. 55d9cf44f000
 surv: 7f7d47c57000 .. 7f7d47c57000 .. 7f7d47e57000
VM [error] (18-08-04 13:11:45): in process 5
last selector: replaceFrom:to:with:startingAt:
last class: ArrayedCollection
sigsegv 4631
sigsegv 4639
os signal interrupt

Debugging

Once we have recorded the crash, we’re ready for debugging. So, start a “Visual / VM Debugger” (VDB for short) and connect to a replay server. Smalltalk/X launcher script can automate all this so it’s as easy as:

./smalltalk --replay --vdb

The --replay options tell the launcher script to replay last recorded execution, --vdb opens VDB and connects it to replay server. Not the (simulated) execution state is at process entry point (even before main() is executed).

Let just run it so we can see where it crashed. That’d be a good starting point anyway. So issue a c CLI command or use menu Run / Resume (or press F9, whatever you prefer). This would take a while but then it finally crashes:

Screenshot of VDB (1)

Here’s an excerpt from stack trace (from the screenshot):

/home/jv/Priv.../smalltalk/stx [stopped]
  Thread 15903.15903 [stopped]
     0 0x00007FD44066D025 [RUN] __lookup_lookup (runtime.h:311)
     1 0x00007FD44066D025 [RUN] __externalLooupCheck (send.c:3092)
     2 0x00007FD44066D025 [RUN] _SEND1 (send.c:3329)
     3 0x00007FD44605997E [S-C] SequenceableCollection >> #replaceFrom:to:with:startingAt: (SequenceableCollection.st:6587)

…and from the source (from the screenshot)…

...
if (lookup == nil) {
    int cnt = 0;
    while (current_cls != nil) {
>       lookup = __ClassInstPtr(current_cls)->c_lookupObject;
        if (lookup != nil) break;
        if (++cnt > 10000) goto cycle;
        current_cls =__ClassInstPtr(current_cls)->c_superclass;
    }
}  
...

Now lets talk about what we see:

The __lookup_lookup() (see above stack trace) is an internal VM routine that looks up a lookup object. This object is used to lookup a method to execute. Sounds complicated but for now it does not matter much what it does. If you’re interested, you may find details in 2. The important thing is that it crashes when dereferencing current_cls. If you look at variables (top-right pane in VDB window), you can immediately see that current_cls is somewhat wrong:

current_cls '0x0020002000200020 (invalid object: MemoryError: Cannot access memory at address 0x20002000200020)'

If you look closer at the object reference value, you see it’s quite funny. Let’s walk down the stack to see from where this funny value came from. Frame 2 __SEND1() is the place to look. When you select the frame in VDB window, again you immediately see that value of self is somewhat wrong:

Screenshot of VDB (2)

To stress the fact that something is wrong with the value of self, VDB shows it in red. See the value of a class? (TODO: Add screenshot) The self reference itself is kind of OK. I mean it points to an object in object heap, but its class is wrong. From the looks of it something is overwriting the contents of the object (self) with some nonsense.

Now comes the interesting part. The memory has already been overwritten, possibly many many cycles in the past, so how to find out what code exactly overwrites it? This is where all the recording and replaying becomes invaluable. The replay server allows one to go backward in time.

So, we set up a watchpoint on selfs class slot. In Smalltalk/X, object’s class is stored in first word of the object header. You can have a look in stc.h to see the C definition or, easier, use ptype 3 command:

(vdb) ptype selftype = struct __instance {    
    OBJ o_class;
    unsigned int o_size;    
    unsigned char o_space;    
    char o_flags;    
    unsigned char o_age;    
    unsigned char o_hashLow;    
    OBJ i_instvars[1];
} *

To set the watchpoint:

(vdb) p self
$2 = 0x0000564ff0482260 (invalid class: 0x0020002000200020)
(vdb) watch *0x0000564ff0482260
Hardware watchpoint 1: *0x0000564ff0482260

We must use raw object reference value rather than symbolic one (like watch &(self->o_class)) because the watchpoint would become invalid as soon as execution leaves _SEND4() since self is defined only in its scope. Now we just let the program to run in “reverse”. We are going back in time until the watchpoint hits and then check why it hit. Use reverse-continue command 4 or use menu Run -> Reverse.

In this case, this will take a long time (mainly due to the large heap). Eventually, the watchpoint will hit in Object >> #basicAt:put:

Screenshot of VDB (3)

Again an excerpt from the stack trace:

/home/jv/Priv.../smalltalk/stx [stopped]
  Thread 15903.15903 [stopped]
     0 0x00007FD445BC82F5 [S-C] Object >> #basicAt:put: (Object.st:1175)
     1 0x00007FD44651B0E4 [S-C] TwoByteString >> #basicAt:put: (TwoByteString.st:108)
     2 0x00007FD445B997DE [S-C] Object >> #at:put: (Object.st:742)
     3 0x00007FD44605729C [S-C] SequenceableCollection >> #from:to:put: (SequenceableCollection.st:6140)
     4 0x00007FD446056FFF [S-C] SequenceableCollection >> #atAllPut: (SequenceableCollection.st:6114)
     5 0x00007FD44651AAE1 [S-C] TwoByteString >> #basicNew: (TwoByteString.st:72)
     6 0x00007FD445C2D636 [S-C] Behavior >> #new: (Behavior.st:2929)
     7 0x00007FD4460A4727 [S-C] ArrayedCollection >> #copyEmptyAndGrow: (ArrayedCollection.st:391)
     8 0x00007FD446042BB8 [S-C] SequenceableCollection >> #, (SequenceableCollection.st:3712)
     9 0x00007FD446374686 [S-C] CharacterArray >> #, (CharacterArray.st:4114)
     ...

…and from the source:

...
if (__isSmallInteger(anObject)) {
    val = __intVal(anObject);
    if ((unsigned)val <= 0xFFFF) {
        if ((unsigned)indx < (nbytes>>1)) {
            unsigned short *sp;
 
            sp = (unsigned short *)(pFirst + (indx<<1));
>           *sp = val;
            RETURN ( anObject );
        }
    }
}
break;
...

The code above is little tricky and needs an explanation, at least what’s its intent. You may want to have a look at full source. Variable:

  • indx is “unboxed” value of argument index (given as SmallInteger, which is tagged),
  • nbytes is number of bytes in variable part of the receiver (essentially, a size of an object in bytes minus size of an object header and named instance variables, if any)
  • and finally, pFirst is a pointer of first indexed byte of the receiver (in this case, the receiver is an instance of Unicode16String so pFirst is a pointer to first character).

What the code above does is that after some type- and bound-checks it computes the address (sp) where to store the value (val) and then stores it:

sp = (unsigned short *)(pFirst + (indx<<1));
*sp = val;

The code is hard to read, I admit. The pFirst is actually a char * so we have to multiply the indx by 2 when using C pointer arithmetic. This is done by left-shift by one (indx<<1) rather than more readable multiplication, presumably because it was faster than multiplication in ancient time and compilers were not smart enough to optimize this themselves.

Anyways, let’s see what has happened. The watchpoint stops right after memory is modified, so let’s see how the corrupt object looks like now:

(vdb) p (OBJ)0x0000564ff0482260
$6 = 0x0000564ff0482260 (invalid class: 0x0000564f926e0020)

As you may see, the class reference looks “nearly” all right, only lower 2 bytes are these funny 0020 (remember, at the time of crash, the value was 0x0020002000200020). So, let’s go backward once more (use reverse-continue command or Reverse menu item). After a while, it hits again at the same place. Let’s see the corrupt object:

p (OBJ)0x0000564ff0482260
$8 = 0x0000564ff0482260 (a Unicode16String)

Here we go! Now the object is all right (references Unicode16String). So now we know that it is Object >> #basicAt:put: that smashes object’s class reference. But why? Let’s see what are the values of indx, pFirst and sp:

(vdb) p /x indx
$3 = 0x51fffff0
(vdb) p /x pFirst
$4 = 0x56504c482280
(vdb) p /x sp
$5 = 0x564ff0482260

As you can see, the index is rather big (nearly 2G). However, more interesting fact is that sp is smaller than pFirst. This is clearly wrong - pFirst is pointer to first indexed variable so sp cannot be smaller than that. Looking at the code that computes sp, the problem is probably clear to every C programmer:

sp = (unsigned short *)(pFirst + (indx<<1));

This is very likely an integer overflow. What is the type of indx?

(vdb) ptype indx
type = int

Oh, an int, a signed 32 bit integer. Since value of indx is little below 2G, indx<<1 would be negative:

(vdb) p indx << 1
$12 = -1543503904

Here we go! Job done.

Fun with ZFS, part 2: Creating Debian ZFS Rescue USB Image

How to generate TSL/SSL certificate and get it signed

STX:LIBJAVA used for real

Unmanaged delegates in CLR

Debugging mixed native-CLR application in WinDBG

A taste of .NET in Bee Smalltalk

Fun with ZFS, part 1: Installing Debian Jessie on ZFS Root

First Post