Recently, LiveCode Builder has gained a namespace resolution operator .. It allows LCB modules to declare functions, constants, and variables which have the same name, by providing a way for modules that import them to distinguish between them.
During this work, we ran into a problem: the modified LCB compiler (lc-compile) worked correctly in "Debug" builds, but reliably crashed in "Release" builds. More peculiarly, we found that lc-compile crashes depended on which compiler was used: some builds using certain versions of GCC crashed reliably, while some builds using clang worked fine. We spent a lot of time staring at output from gdb and Valgrind, and came to the conclusion that maybe it was a compiler bug in GCC.
It turned out that we were wrong. When we switched to using clang to build full LiveCode releases, the mysterious crashes popped up again. Since this had now become a problem that was breaking the build, I decided to dig into it again. Originally, we'd not been able to duplicate the crash in very recent versions of GCC and clang, so my first step was to try and make lc-compile crash when compiled with GCC 6.
The problem seemed to revolve around some code in the following form:
class T; typedef T* TPtr; // (1) function returning true iff r_value was set bool maybeFetch(TPtr& r_value); void f() { TPtr t_value; if (maybeFetch(t_value)) { // (2) dereference t_value } }
lc-compile was sometimes, but not reliably, crashing at point (2).
Initially, when I compiled with GCC 6, I was not able to induce a crash. However, I did receive a warning that t_value might be used without being initialised. I therefore modified the implementation of f() to initialise t_value at its declaration:
void f() { TPtr t_value = nullptr; // ... }
With that modification, the crash became reliably reproducible in all build modes using all of the compilers I had available. This drew my suspicion to the maybeFetch() function (1). The function's API contract requires it to return true if (and only if) it sets its out parameter r_value, and return false otherwise.
So, I had a look at it, and it looked fine. What else could be going wrong?
Much of lc-compile is implemented using a domain-specific language called Gentle, which generates bison and flex grammars, which are in turn used to generate some megabytes of C code that's hard to read and harder to debug.
I disappeared into this code for quite a while, and couldn't find anything to suggest that the Gentle grammar was wrong, or that the generated code was the cause of the segfault. What I Because explicit initialisation made the crashes reliable and reproducible, I came to the conclusion that maybeFetch() was sometimes returning true A simplified form of maybeFetch() as I found it was: Needless to say, when I saw the problem it was moment of slightly bemused hilarity. This function had been reviewed several times by various team members, and all of us had missed the missing block braces { ... } hidden by misleading indentation. Once these braces had been inserted, all of the problems went away. What lessons could be taken away from this? This was a very easy mistake to make, and an easy issue to miss in a code review, but it was very costly to clean up. I hope that we'll be able to make some changes to our development processes and our coding style to try and avoid things like this happening in the future. Update: My colleague points out another contributing factor to making the error hard to spot: the condition(t_loop_var) was a composite condition spread across multiple lines, i.e. This code layout maybe makes it slightly less obvious where the body of the if lies.
bool maybeFetch(TPtr& r_value)
{
for (TPtr t_loop_var = /* loop form ... */)
{
if (condition(t_loop_var))
*r_value = t_loop_var;
return true;
}
return false;
}
if (condition(t_loop_var))
{ // missing open brace
*r_value = t_loop_var;
return true;
} // missing close brace
if (conditionA(t_loop_var) ||
(conditionB(t_loop_var) &&
conditionC(t_loop_var)))
*r_value = t_loop_var;
return true;
No comments:
Post a Comment