Monday, December 05, 2016

When C uninitialised variables and misleading whitespace combine

Recently, LiveCode Builder has gained a namespace resolution operator .. It allows LCB modules to declare functions, constants, and variables which have the same name, by providing a way for modules that import them to distinguish between them.

During this work, we ran into a problem: the modified LCB compiler (lc-compile) worked correctly in "Debug" builds, but reliably crashed in "Release" builds. More peculiarly, we found that lc-compile crashes depended on which compiler was used: some builds using certain versions of GCC crashed reliably, while some builds using clang worked fine. We spent a lot of time staring at output from gdb and Valgrind, and came to the conclusion that maybe it was a compiler bug in GCC.

It turned out that we were wrong. When we switched to using clang to build full LiveCode releases, the mysterious crashes popped up again. Since this had now become a problem that was breaking the build, I decided to dig into it again. Originally, we'd not been able to duplicate the crash in very recent versions of GCC and clang, so my first step was to try and make lc-compile crash when compiled with GCC 6.

The problem seemed to revolve around some code in the following form:

class T;
typedef T* TPtr;

// (1) function returning true iff r_value was set
bool maybeFetch(TPtr& r_value);

void f()
{
    TPtr t_value;
    if (maybeFetch(t_value))
    {
        // (2) dereference t_value
    }
}

lc-compile was sometimes, but not reliably, crashing at point (2).

Initially, when I compiled with GCC 6, I was not able to induce a crash. However, I did receive a warning that t_value might be used without being initialised. I therefore modified the implementation of f() to initialise t_value at its declaration:

void f()
{
    TPtr t_value = nullptr;
    // ...
}

With that modification, the crash became reliably reproducible in all build modes using all of the compilers I had available. This drew my suspicion to the maybeFetch() function (1). The function's API contract requires it to return true if (and only if) it sets its out parameter r_value, and return false otherwise.

So, I had a look at it, and it looked fine. What else could be going wrong?

Much of lc-compile is implemented using a domain-specific language called Gentle, which generates bison and flex grammars, which are in turn used to generate some megabytes of C code that's hard to read and harder to debug.

I disappeared into this code for quite a while, and couldn't find anything to suggest that the Gentle grammar was wrong, or that the generated code was the cause of the segfault. What I did find suggested that there were problems with the values being provided by the maybeFetch() function.

Because explicit initialisation made the crashes reliable and reproducible, I came to the conclusion that maybeFetch() was sometimes returning true without setting its out parameter. So, what was maybeFetch() doing?

A simplified form of maybeFetch() as I found it was:

bool maybeFetch(TPtr& r_value)
{
    for (TPtr t_loop_var = /* loop form ... */)
    {
        if (condition(t_loop_var))
            *r_value = t_loop_var;
            return true;
    }
    return false;
}

Needless to say, when I saw the problem it was moment of slightly bemused hilarity. This function had been reviewed several times by various team members, and all of us had missed the missing block braces { ... } hidden by misleading indentation.

if (condition(t_loop_var))
{ // missing open brace
    *r_value = t_loop_var;
    return true;
} // missing close brace

Once these braces had been inserted, all of the problems went away.

What lessons could be taken away from this?

  1. The bug itself eluded review because of misleading indentation. GCC 6 provides a "misleading indentation" warning which would have immediately flagged up this warning if it had been enabled. We do not use GCC 6 for LiveCode builds; even if we did, we wouldn't be able to enable the "misleading indentation" warning to good effect because the LiveCode C++ sources don't currently use a consistent indentation style. This problem could maybe be avoided if LiveCode builds enforced a specific indentation style (in which case the bug would have been obvious in review), or if we regularly did builds with GCC 6 and -Werror=misleading-indentation.
  2. The effect of the bug was an API contract violation, where the relationship between the return value and the value of an out parameter wasn't satisfied. The problem could have been avoided if the API contract was expressed in a way that the compiler could check. C++17 adds std::optional, which combines the idea of "is there a value or not" with returning the value itself. If the function took the form std::optional maybeFetch() then it would have been impossible for it to claim to return a value without actually returning one.
  3. Finally, the problem was obfuscated by failing to initialise stack variables. Although, if maybeFetch() was working correctly, the pointer on the stack would get initialised before use, in this case, it didn't. Diagnosing the problem may have been much easier if we routinely initialised stack variables to suitable uninformative values at the point of declaration, even if we think they _should_ get initialised via a function's out parameter before use.

This was a very easy mistake to make, and an easy issue to miss in a code review, but it was very costly to clean up. I hope that we'll be able to make some changes to our development processes and our coding style to try and avoid things like this happening in the future.

Update: My colleague points out another contributing factor to making the error hard to spot: the condition(t_loop_var) was a composite condition spread across multiple lines, i.e.

if (conditionA(t_loop_var) ||
    (conditionB(t_loop_var) &&
     conditionC(t_loop_var)))

    *r_value = t_loop_var;
    return true;

This code layout maybe makes it slightly less obvious where the body of the if lies.