Tuesday, February 21, 2017

How to stop mspdbsrv from breaking your continuous integration system

Over the last month, I've been working on getting the LiveCode build cluster to do Windows builds using Visual Studio 2015. We've been using Visual Studio 2010 since I originally set up the build service in mid-2015. This upgrade was prompted by needing support for some C++ language features used by the latest version of libskia.

Once the new Windows Server buildbot workers had their tools installed and were connected to the build service, I noticed a couple of pretty weird things going on:

  • after one initial build, the build workers were repeated failing to clean the build tree in preparation for for the next build
  • builds were getting "stuck" after completing successfully, and were then being detected as timed out and forcibly killed

Blocked build tree cleanup

The first problem was easy to track down. I guessed that the clean step was failing because some process still had an open file handle to one of the files or directories that the worker was trying to delete. I used the Windows 'Resource Monitor' application (resmon.exe), which can be launched from the 'Start' menu or from 'Task Manager', to find the offending process. The 'CPU' tab lets you search all open file handles on the system by filename, and I quickly discovered that mspdbsrv.exe was holding a file handle to one of the build directories.

What is mspdbsrv?

mspdbsrv is a helper service used by the Visual Studio C and C++ compiler, cl.exe; it collects debugging information for code that's being compiled and writes out .pdb databases. CL automatically spawns mspdbsrv if debug info is being generated and it connect to an existing instance. When the build completes, CL doesn't clean up any mspdbsrv that it spawned; it just leaves it running. There's no way to prevent CL from doing this.

So, it looked like the abandoned mspdbsrv instance had its current working directory set to one of the directories that the build worker was trying to delete, and on Windows you can't delete a directory if there's a process running there. So much for the first problem.

Build step timeouts

The second issue was more subtle -- but it also appeared to be due to the lingering mspdbsrv process! I noticed that mspdbsrv was actually holding a file handle to one of the buildbot worker's internal log files. It appears that buildbot doesn't close file handles when starting build processes, and these handles were being inherited by mspbdsrv, which was holding them open. As result, the buildbot worker (correctly) inferred that there were still unfinished build job processes running, and didn't report the build as completed.

Mismatched MSVC versions

When I thought through this a bit further, I realised there was another problem being caused by lingering mspdbsrv instances. Some of the builds being handled by the Windows build workers need to use MSVC 2015, and some still need to use MSVC 2010. Each type of build should use the corresponding version of mspdbsrv, but by default CL always connects to any available service process.

Steps towards a fix

So, what was the solution?

  1. Run mspdbsrv explicitly as part of the build setup, and keep a handle to the process so that it can be terminated once the build has finished.

  2. Launch mspdbsrv with a current working directory outside the build tree.
  3. Force CL to use a specific mspdbsrv instance rather than just picking any available one.

LiveCode CI builds are now performed using a Python helper script. Here's a snippet that implements all of these requirements (note that it hardcodes the path to the MSVC 2010 mspbdsrv.exe:

import os
import subprocess
import uuid

# Find the 32-bit program files directory
def get_program_files_x86():
    return os.environ.get('ProgramFiles(x86)',
                          os.environ.get('ProgramFiles',
                                         'C:\\Program Files\\'))

# mspdbsrv is the service used by Visual Studio to collect debug
# data during compilation.  One instance is shared by all C++
# compiler instances and threads.  It poses a unique challenge in
# several ways:
#
# - If not running when the build job starts, the build job will
#   automatically spawn it as soon as it needs to emit debug symbols.
#   There's no way to prevent this from happening.
#
# - The build job _doesn't_ automatically clean it up when it finishes
#
# - By default, mspdbsrv inherits its parent process' file handles,
#   including (unfortunately) some log handles owned by Buildbot.  This
#   can prevent Buildbot from detecting that the compile job is finished
#
# - If a compile job starts and detects an instance of mspdbsrv already
#   running, by default it will reuse it.  So, if you have a compile
#   job A running, and start a second job B, job B will use job A's
#   instance of mspdbsrv.  If you kill mspdbsrv when job A finishes,
#   job B will die horribly.  To make matters worse, the version of
#   mspdbsrv should match the version of Visual Studio being used.
#
# This class works around these problems:
#
# - It sets the _MSPDBSRV_ENDPOINT_ to a value that's probably unique to
#   the build, to prevent other builds on the same machine from sharing
#   the same mspdbsrv endpoint
#
# - It launches mspdbsrv with _all_ file handles closed, so that it
#   can't block the build from being detected as finished.
#
# - It explicitly kills mspdbsrv after the build job has finished.
#
# - It wraps all of this into a context manager, so mspdbsrv gets killed
#   even if a Python exception causes a non-local exit.
class UniqueMspdbsrv(object):
    def __enter__(self):
        os.environ['_MSPDBSRV_ENDPOINT_'] = str(uuid.uuid4())

        mspdbsrv_exe = os.path.join(get_program_files_x86(),
            'Microsoft Visual Studio 10.0\\Common7\\IDE\\mspdbsrv.exe')
        args = [mspdbsrv_exe, '-start', '-shutdowntime', '-1']
        print(' '.join(args))
        self.proc = subprocess.Popen(args, cwd='\\', close_fds=True)
        return self

    def __exit__(self, type, value, traceback):
        self.proc.terminate()
        return False
You can then use this when implementing a build step:
with UniqueMspdbsrv() as mspdbsrv:
    # Do your build steps here (e.g. msbuild invocation)
    pass

# mspdbsrv automatically cleaned up by context manager

It took me a couple of days to figure out what was going on and to find an adequate solution. A lot of very tedious trawling through obscure bits of the Internet were required to find all of the required pieces; for example, Microsoft do not document arguments to mspdbsrv or the environment variables that it understands anywhere on MSDN.

Hopefully, if you are running into problems with your Jenkins or buildbot workers interacting weirdly with Microsoft Visual Studio C or C++ builds, this will save you some time!

2 comments:

Unknown said...

Thanks Peter for your findings and publishing them.
I'm trying to setup a CI-build using Visual Studio 2015.

In short I'm starting mspdbsrv with option "-endpoint ABC" and then
running cl.exe with the env-variable "_MSPDBSRV_ENDPOINT_=ABC".

However cl.exe throws this error:
... fatal error C1902: Program database manager mismatch; please check your installation

Did you stumble over this or do you have any hints?

Peter Brett said...

Hi Felix, I'm glad you found this useful. Unfortunately I haven't come across that error, so I don't think I'll be able to help you out. The main thing I recommend is double checking that the build works properly when run outside buildbot, as that can help narrow down problems that are caused by the buildbot environment as opposed to other configuration issues (e.g. an incompletely installed Visual Studio toolchain).