Monday, December 05, 2016

When C uninitialised variables and misleading whitespace combine

Recently, LiveCode Builder has gained a namespace resolution operator .. It allows LCB modules to declare functions, constants, and variables which have the same name, by providing a way for modules that import them to distinguish between them.

During this work, we ran into a problem: the modified LCB compiler (lc-compile) worked correctly in "Debug" builds, but reliably crashed in "Release" builds. More peculiarly, we found that lc-compile crashes depended on which compiler was used: some builds using certain versions of GCC crashed reliably, while some builds using clang worked fine. We spent a lot of time staring at output from gdb and Valgrind, and came to the conclusion that maybe it was a compiler bug in GCC.

It turned out that we were wrong. When we switched to using clang to build full LiveCode releases, the mysterious crashes popped up again. Since this had now become a problem that was breaking the build, I decided to dig into it again. Originally, we'd not been able to duplicate the crash in very recent versions of GCC and clang, so my first step was to try and make lc-compile crash when compiled with GCC 6.

The problem seemed to revolve around some code in the following form:

class T;
typedef T* TPtr;

// (1) function returning true iff r_value was set
bool maybeFetch(TPtr& r_value);

void f()
{
    TPtr t_value;
    if (maybeFetch(t_value))
    {
        // (2) dereference t_value
    }
}

lc-compile was sometimes, but not reliably, crashing at point (2).

Initially, when I compiled with GCC 6, I was not able to induce a crash. However, I did receive a warning that t_value might be used without being initialised. I therefore modified the implementation of f() to initialise t_value at its declaration:

void f()
{
    TPtr t_value = nullptr;
    // ...
}

With that modification, the crash became reliably reproducible in all build modes using all of the compilers I had available. This drew my suspicion to the maybeFetch() function (1). The function's API contract requires it to return true if (and only if) it sets its out parameter r_value, and return false otherwise.

So, I had a look at it, and it looked fine. What else could be going wrong?

Much of lc-compile is implemented using a domain-specific language called Gentle, which generates bison and flex grammars, which are in turn used to generate some megabytes of C code that's hard to read and harder to debug.

I disappeared into this code for quite a while, and couldn't find anything to suggest that the Gentle grammar was wrong, or that the generated code was the cause of the segfault. What I did find suggested that there were problems with the values being provided by the maybeFetch() function.

Because explicit initialisation made the crashes reliable and reproducible, I came to the conclusion that maybeFetch() was sometimes returning true without setting its out parameter. So, what was maybeFetch() doing?

A simplified form of maybeFetch() as I found it was:

bool maybeFetch(TPtr& r_value)
{
    for (TPtr t_loop_var = /* loop form ... */)
    {
        if (condition(t_loop_var))
            *r_value = t_loop_var;
            return true;
    }
    return false;
}

Needless to say, when I saw the problem it was moment of slightly bemused hilarity. This function had been reviewed several times by various team members, and all of us had missed the missing block braces { ... } hidden by misleading indentation.

if (condition(t_loop_var))
{ // missing open brace
    *r_value = t_loop_var;
    return true;
} // missing close brace

Once these braces had been inserted, all of the problems went away.

What lessons could be taken away from this?

  1. The bug itself eluded review because of misleading indentation. GCC 6 provides a "misleading indentation" warning which would have immediately flagged up this warning if it had been enabled. We do not use GCC 6 for LiveCode builds; even if we did, we wouldn't be able to enable the "misleading indentation" warning to good effect because the LiveCode C++ sources don't currently use a consistent indentation style. This problem could maybe be avoided if LiveCode builds enforced a specific indentation style (in which case the bug would have been obvious in review), or if we regularly did builds with GCC 6 and -Werror=misleading-indentation.
  2. The effect of the bug was an API contract violation, where the relationship between the return value and the value of an out parameter wasn't satisfied. The problem could have been avoided if the API contract was expressed in a way that the compiler could check. C++17 adds std::optional, which combines the idea of "is there a value or not" with returning the value itself. If the function took the form std::optional maybeFetch() then it would have been impossible for it to claim to return a value without actually returning one.
  3. Finally, the problem was obfuscated by failing to initialise stack variables. Although, if maybeFetch() was working correctly, the pointer on the stack would get initialised before use, in this case, it didn't. Diagnosing the problem may have been much easier if we routinely initialised stack variables to suitable uninformative values at the point of declaration, even if we think they _should_ get initialised via a function's out parameter before use.

This was a very easy mistake to make, and an easy issue to miss in a code review, but it was very costly to clean up. I hope that we'll be able to make some changes to our development processes and our coding style to try and avoid things like this happening in the future.

Update: My colleague points out another contributing factor to making the error hard to spot: the condition(t_loop_var) was a composite condition spread across multiple lines, i.e.

if (conditionA(t_loop_var) ||
    (conditionB(t_loop_var) &&
     conditionC(t_loop_var)))

    *r_value = t_loop_var;
    return true;

This code layout maybe makes it slightly less obvious where the body of the if lies.

Thursday, October 27, 2016

Playing with Bus1

David Herrmann and Tom Gundersen have been working on new, performant Linux interprocess communication (IPC) proposals for a few years now. First came their proposed kdbus system, which would have provided a DBus-compatible IPC system, but this didn't actually get merged because of several design issues that couldn't be worked around (mostly security-related).

So, they went back to the drawing board, and now have come back with a new IPC system called Bus1, which was described in a LWN article back in August. Yesterday, they posted draft patches to the Linux kernel mailing list, and the kernel module and userspace libraries are available on GitHub for your convenience.

I decided to find out what's involved in getting the experimental Bus1 code up and running on my system. I run Fedora Linux, but broadly similar steps can be used on other Linux distributions.

Installing tools

The first thing to do is to install some development tools and headers.

dnf install git kernel-devel
dnf builddep kernel

I'm going to need git for getting the source code, and the kernel-devel development headers for compiling the Bus1 kernel module. The special dnf builddep command automatically fetches all of the packages needed for compiling a particular package — in this case, we're compiling a kernel module, so just grabbing the tools needed for compiling the kernel should include everything necessary.

Building the kernel module

I need to get the Bus1 kernel module's source code using git:

mkdir ~/git
cd ~/git
git clone https://github.com/bus1/bus1.git
cd bus1

With all of the tools I need already installed, I can very simply run

make

to compile the Bus1 module.

Finally, the Bus1 Makefile provides an all-in-one solution for running the module's tests and loading it into the running kernel:

make tt

After several seconds of testing and benchmarking, I get some messages like:

[ 1555.889884] bus1: module verification failed: signature and/or required key missing - tainting kernel
[ 1555.891534] bus1: run selftests..
[ 1555.893530] bus1: loaded

Success! Now my Linux system has Bus1 loaded into its kernel! But what can be done with it? I need some userspace code that understands how to use Bus1 IPC.

Building the userspace library

The Bus1 authors have provided a basic userspace library for use when writing programs that use Bus1. How about building it and running its tests to check that Bus1 is actually usable?

Some additional tools are needed for compiling libbus1, because it uses GNU Autotools rather than the kernel build system:

sudo dnf install autoconf automake

As before, I need to checkout the source code:

cd ~/git
git clone https://github.com/bus1/libbus1.git

I can then set up its build system and configure the build by running:

./autogen.sh
./configure

But there's a problem! I need to install a couple of obscure dependencies: David Herrmann's c-sundry and c-rbtree libraries.

This is accomplished by something along the lines of:

cd ~/git
git clone https://github.com/c-util/c-sundry.git
git clone https://github.com/c-util/c-rbtree
# Install c-sundry
cd ~/git/c-sundry
./autogen.sh
./configure
make
sudo make install
# Install c-rbtree
cd ~/git/c-rbtree
./autogen.sh
./configure
make
sudo make install

So, with dependency libraries installed, it's now possible to build libbus1. Note that the configure script won't pick up the dependencies installed because on Fedora it doesn't scan the /usr/local/lib/pkgconfig directory by default, so I have to give it a bit of help.

cd ~/git/libbus1
./autogen.sh
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./configure
make

Amusingly, this failed the first time due to a bug for which I submitted a patch. However, with the patch applied to c-sundry, I've got a successful build of libbus1!

I also ended up having to add /usr/local/lib to /etc/ld.so.conf so that the c-rbtree library got detected properly when running the libbus1 test suite.

Even after that, unfortunately the test suite failed. Clearly the Bus1 userspace libraries aren't as well-developed as the kernel module! Maybe someone could do something about that...?