Wednesday, February 22, 2017

Deploying buildbot workers on Windows Server 2016

At LiveCode, we use a buildbot system to perform our continuous integration and release builds. Recently, we moved from building our Windows binaries in a Linux container using Wine to building on a native Windows system running in an Azure virtual machine.

Deploying buildbot on Windows is not totally straightforward, and the documentation for installing it is quite hard to follow. It's quite important to us that our build infrastructure is reproducible, so we wanted to have a procedure that could bring up a buildbot worker on a newly-allocated server quickly and with as little manual intervention as possible.

This blog post provides step-by-step instructions for installing buildbot 0.8.12 on Windows Server 2016 Datacenter Edition, with explanations of what's going on at each step. The target configuration is a buildbot worker that runs as an unprivileged user and communicates with the buildbot master over an SSL tunnel. All of the commands are written using PowerShell. It's recommended to run them via the 'PowerShell ISE' application, running as a user in the 'Administrators' group. The full script is available as a GitHub Gist.

Although this describes installing buildbot 0.8.12, there's no reason it shouldn't work for buildbot 0.9.x. If you try it, please let me know how you get on in the comments.

Note: Don't run these commands unless you've checked them very carefully first. They're adapted from the scripts used for our buildbot deployment, and may not work as you expect. You should use them as the basis of your own installation script and test it thoroughly before using it in production.

Support functions

First, ensure that the script stops immediately if any error is thrown, and that "verbose" messages are displayed.

$VerbosePreference = 'Continue'
$ErrorActionPreference = 'Stop'

By default, PowerShell doesn't convert non-zero exit codes from subprocess into errors, so define a helper function that you can use to accomplish this. By default, CheckLastExitCode will throw an error on a non-zero exit code, but if there are other exit codes that should be considered successful, you can pass in an array of permitted exit codes, e.g. CheckLastExitCode(@(0,10)).

function CheckLastExitCode {
    param ([int[]]$SuccessCodes = @(0))
    if ($SuccessCodes -notcontains $LASTEXITCODE) {
        throw "Command failed (exit code $LASTEXITCODE)"
    }
}

For this to work, you'll need to implement a Fetch-BuildbotResource function that obtains a named resource file and places it in a given output location. Fill in the blanks (possibly with some sort of Invoke-WebRequest):

function Fetch-BuildbotResource {
    param([string]$Path,
          [string]$OutFile)
    # Your code goes here
}

It's also a good idea to activate Windows. The virtual machines provisioned by Azure may not have been activated; this command will do so automatically.

cscript.exe C:\Windows\System32\slmgr.vbs /ato

Finally, define variables with the root path for the buildbot installation and the IP or DNS address of the buildbot master, and create the buildbot worker's root directory

$k_buildbot_root = 'C:\buildbot'
$k_buildbot_master = 'buildbot.example.org'

New-Item -Path $k_buildbot_root -ItemType Container -Force | Out-Null

Installing programs with Chocolatey

Chocolatey is a package manager for Windows that can automatically install a variety of applications and services in much the same way as the Linux `apt-get`, `dnf` or `yum` programs. Here, you can use it for installing Python (for running buildbot) and for installing the stunnel SSL tunnel service.

Install Chocolatey by the time-honoured process of "downloading a random script from the Internet and running it as a superuser".

$env:ChocolateyInstall = 'C:\ProgramData\chocolatey'

# Install Chocolatey, if not already present
if (!(Test-Path -LiteralPath $env:ChocolateyInstall -PathType Container)) {
    Invoke-WebRequest 'https://chocolatey.org/install.ps1' -UseBasicParsing | Invoke-Expression
}

Next, use Chocolatey to install stunnel and Python 2.7:

Write-Verbose 'Installing Python and stunnel'
choco install --yes stunnel python2
CheckLastExitCode

Installing Python modules and buildbot

It's easiest to install buildbot and its dependencies using the pip Python package manager.

Write-Verbose 'Installing Python modules'
$t_pip = 'C:\Python27\Scripts\pip.exe'
& $t_pip install pypiwin32 buildbot-slave==0.8.12
CheckLastExitCode

The pypiwin32 package installs some DLLs that are required for buildbot to run as a service, but when installed with pip, these DLLs are not automatically registered in the Windows registry. This caused me at least a day of wondering why my buildbot service was failing to start with the super informative message:

Luckily, pypiwin32 installs a script that will set everything up properly.

Write-Verbose 'Registering pywin32 DLLs'
$t_python = C:\Python27\python.exe
& $t_python C:\Python27\Scripts\pywin32_postinstall.py -install

SSL tunnel service

You'll need to configure stunnel to run on your buildbot master, and listen on port 9988. I recommend configuring the buildbot master's stunnel with a certificate, and then making sure workers always fully authenticate the certificate when connecting to it. This will prevent people from obtaining your workers' login credentials by impersonating the buildbot master machine.

Write-Verbose 'Installing buildbot-stunnel service'
$t_stunnel = 'C:\Program Files (x86)\stunnel\bin\stunnel.exe'
$t_stunnel_conf = Join-Path $k_buildbot_root 'stunnel.conf'
$t_stunnel_crt  = Join-Path $k_buildbot_root 'buildbot.crt'

# Fetch the client certificate that will be used to authenticate
# the buildbot master
Fetch-BuildbotResource `
    -Path 'buildbot/stunnel/master.crt' -Outfile $t_stunnel_crt

# Create the stunnel configuration file
Set-Content -Path $t_stunnel_conf -Value @"
[buildbot]
client = yes
accept = 127.0.0.1:9989
cafile = $t_stunnel_crt
verify = 3 
connect = $k_buildbot_master:9988
"@

# Register the stunnel service, if not already present
if (!(Get-Service buildbot-stunnel -ErrorAction Ignore)) {
    New-Service -Name buildbot-stunnel `
        -BinaryPathName "$t_stunnel -service $t_stunnel_conf" `
        -DisplayName 'Buildbot Secure Tunnel' `
        -StartupType Automatic
}

The buildbot worker instance

Creating and configuring the worker instance, and setting up buildbot to run as a Windows service, are the most complicated part of the installation process. Before dealing with the Windows service, instantiate a worker with the info it needs to connect to the buildbot master.

First, set up a bunch of values that will be needed later. The worker's name will just be the name of the server it's running on, and it will be configured to use a randomly-generated password.

Write-Verbose 'Initialising buildbot worker'

# Needed for password generation
Add-Type -AssemblyName System.Web

$t_buildbot_worker_script = 'C:\Python27\Scripts\buildslave'

$t_worker_dir = Join-Path $k_buildbot_root worker
$t_worker_name = "$env:COMPUTERNAME-$_"
$t_worker_password = `
    [System.Web.Security.Membership]::GeneratePassword(12,0)
$t_worker_admin = 'Example Organisation'

Run buildbot to actually instantiate the worker. We have to manually check the contents of the standard output from the setup process, because the exit status isn't a reliable indicator of success.

$t_log = Join-Path $k_buildbot_root setup.log
Start-Process -Wait -NoNewWindow -FilePath $t_python `
    -ArgumentList @($t_buildbot_worker_script, 'create-slave', `
        $t_worker_dir, 127.0.0.1, $t_worker_name,
        $t_worker_password) `
    -RedirectStandardOutput $t_log

# Check log file contents
$t_expected = "buildslave configured in $t_worker_dir"
if ((Get-Content $t_log)[-1] -ne $t_expected) {
    Get-Content $t_log | Write-Error
    throw "Build worker setup failed (exit code $LASTEXITCODE)"
}

It's helpful to provide some information about the host and who administrates it.

Set-Content -Path (Join-Path $t_worker_dir 'info\admin') `
    -Value $t_worker_admin
Set-Content -Path (Join-Path $t_worker_dir 'info\host') `
    -Value (Get-WmiObject -Class Win32_OperatingSystem).Caption

While testing our Windows-based buildbot workers, I found that I was getting "slave lost" errors during many build steps. I found that getting the workers to send really frequent "keep alive" messages to the build master prevented this from happening almost entirely. I used a 10 second period, but you might find that unnecessarily frequent.

$t_config = Join-Path $t_worker_dir buildbot.tac
Get-Content $t_config | `
    ForEach {$_ -replace '^keepalive\s*=\s*.*$', 'keepalive = 10'} | `
    Set-Content "$t_config.new"
Remove-Item $t_config
Move-Item "$t_config.new" $t_config

Configuring the buildbot service

Now for the final part: getting buildbot to run as a Windows service. It's a bad idea to run the worker as a privileged user, so this will create a 'BuildBot' user with a randomly-generated password, configure the service to use that account, and make sure it has full access to the worker's working directory.

Some of the commands used in this section expect passwords to be handled in the form of "secure strings" and some expect them to be handled in the clear. There's a fair degree of shuttling between the two representations.

Once again, begin by setting up some variables to use during these steps.

Write-Verbose 'Installing buildbot service'

$t_buildbot_service_script = 'C:\Python27\Scripts\buildbot_service.py'
$t_service_name = 'BuildBot'
$t_user_name = $t_service_name
$t_full_user_name = "$env:COMPUTERNAME\$t_service_name"

$t_user_password_clear = `
    [System.Web.Security.Membership]::GeneratePassword(12,0)
$t_user_password = `
    ConvertTo-SecureString $t_user_password_clear -AsPlainText -Force

Create the 'BuildBot' user:

$t_user = New-LocalUser -AccountNeverExpires `
    -PasswordNeverExpires `
    -UserMayNotChangePassword `
    -Name $t_user_name `
    -Password $t_user_password

You need to create the buildbot service by running the installation script provided by buildbot. Although there's a New-Service command in PowerShell, the pywin32 support for services written in Python expects a variety of registry keys to be set up correctly, and it won't work properly if they're not.

& $t_python $t_buildbot_service_script `
    --username $t_full_user_name `
    --password $t_user_password_clear `
    --startup auto install
CheckLastExitCode

It's still necessary to tell the service where to find the worker directory. You can do this by creating a special registry that the service checks on startup to discover its workers.

$t_parameters_key = "HKLM:\SYSTEM\CurrentControlSet\Services\$t_service_name\Parameters"
New-Item -Path $t_parameters_key -Force
Set-ItemProperty -Path $t_parameters_key -Name "directories" `
    -Value $t_worker_dir

Although the service is configured to start as the 'BuildBot' user, that user doesn't yet have the permissions required to read and write in the worker directory.

$t_acl = Get-Acl $t_worker_dir
$t_access_rule = New-Object `
    System.Security.AccessControl.FileSystemAccessRule `
    -ArgumentList @($t_full_user_name, 'FullControl', `
        'ContainerInherit,ObjectInherit', 'None', 'Allow')
$t_acl.SetAccessRule($t_access_rule)
Set-Acl $t_worker_dir $t_acl

Granting 'Log on as a service' rights

Your work is nearly done! However, there's one task that I have not yet worked out how to automate, and still requires manual intervention: granting the 'Buildbot' user the right to log on as a service. Without granting this right, the buildbot service will fail to start with a permissions error.

  1. Open the 'Local Security Policy' tool
  2. Choose 'Local Policies' -> 'User Rights Assignment' in the tree
  3. Double-click on 'Log on as a service' in the details pane
  4. Click 'Add User or Group', and add 'BuildBot' to the list of accounts

Time to launch

Everything should now be correctly configured!

There's one final bit of work required: you need to add the worker's username and password to the buildbot master's list of authorised workers. If you need it, you can obtain the username and password for the worker using PowerShell:

Get-Content C:\buildbot\worker\buildbot.tac | `
    Where {$_ -match '^(slavename|passwd)' }

You can use the `Start-Service` command to start the stunnel and buildbot services:

Start-Service buildbot-stunnel
Start-Service buildbot

Conclusions

You can view the full script described in this blog post as a GitHub Gist.

On top of installing buildbot itself, you'll need to install the various toolchains that you require. If you're using Microsoft Visual Studio, the "build tools only" installers provided by Microsoft for MSVC 2010 and MSVC 2015 are really useful. Many other dependencies can be installed using Chocolatey.

Installing buildbot on Windows is currently a pain, and I hope that someone who knows more about Windows development than I do can help the buildbot team make it easier to get started.

Tuesday, February 21, 2017

How to stop mspdbsrv from breaking your continuous integration system

Over the last month, I've been working on getting the LiveCode build cluster to do Windows builds using Visual Studio 2015. We've been using Visual Studio 2010 since I originally set up the build service in mid-2015. This upgrade was prompted by needing support for some C++ language features used by the latest version of libskia.

Once the new Windows Server buildbot workers had their tools installed and were connected to the build service, I noticed a couple of pretty weird things going on:

  • after one initial build, the build workers were repeated failing to clean the build tree in preparation for for the next build
  • builds were getting "stuck" after completing successfully, and were then being detected as timed out and forcibly killed

Blocked build tree cleanup

The first problem was easy to track down. I guessed that the clean step was failing because some process still had an open file handle to one of the files or directories that the worker was trying to delete. I used the Windows 'Resource Monitor' application (resmon.exe), which can be launched from the 'Start' menu or from 'Task Manager', to find the offending process. The 'CPU' tab lets you search all open file handles on the system by filename, and I quickly discovered that mspdbsrv.exe was holding a file handle to one of the build directories.

What is mspdbsrv?

mspdbsrv is a helper service used by the Visual Studio C and C++ compiler, cl.exe; it collects debugging information for code that's being compiled and writes out .pdb databases. CL automatically spawns mspdbsrv if debug info is being generated and it connect to an existing instance. When the build completes, CL doesn't clean up any mspdbsrv that it spawned; it just leaves it running. There's no way to prevent CL from doing this.

So, it looked like the abandoned mspdbsrv instance had its current working directory set to one of the directories that the build worker was trying to delete, and on Windows you can't delete a directory if there's a process running there. So much for the first problem.

Build step timeouts

The second issue was more subtle -- but it also appeared to be due to the lingering mspdbsrv process! I noticed that mspdbsrv was actually holding a file handle to one of the buildbot worker's internal log files. It appears that buildbot doesn't close file handles when starting build processes, and these handles were being inherited by mspbdsrv, which was holding them open. As result, the buildbot worker (correctly) inferred that there were still unfinished build job processes running, and didn't report the build as completed.

Mismatched MSVC versions

When I thought through this a bit further, I realised there was another problem being caused by lingering mspdbsrv instances. Some of the builds being handled by the Windows build workers need to use MSVC 2015, and some still need to use MSVC 2010. Each type of build should use the corresponding version of mspdbsrv, but by default CL always connects to any available service process.

Steps towards a fix

So, what was the solution?

  1. Run mspdbsrv explicitly as part of the build setup, and keep a handle to the process so that it can be terminated once the build has finished.

  2. Launch mspdbsrv with a current working directory outside the build tree.
  3. Force CL to use a specific mspdbsrv instance rather than just picking any available one.

LiveCode CI builds are now performed using a Python helper script. Here's a snippet that implements all of these requirements (note that it hardcodes the path to the MSVC 2010 mspbdsrv.exe:

import os
import subprocess
import uuid

# Find the 32-bit program files directory
def get_program_files_x86():
    return os.environ.get('ProgramFiles(x86)',
                          os.environ.get('ProgramFiles',
                                         'C:\\Program Files\\'))

# mspdbsrv is the service used by Visual Studio to collect debug
# data during compilation.  One instance is shared by all C++
# compiler instances and threads.  It poses a unique challenge in
# several ways:
#
# - If not running when the build job starts, the build job will
#   automatically spawn it as soon as it needs to emit debug symbols.
#   There's no way to prevent this from happening.
#
# - The build job _doesn't_ automatically clean it up when it finishes
#
# - By default, mspdbsrv inherits its parent process' file handles,
#   including (unfortunately) some log handles owned by Buildbot.  This
#   can prevent Buildbot from detecting that the compile job is finished
#
# - If a compile job starts and detects an instance of mspdbsrv already
#   running, by default it will reuse it.  So, if you have a compile
#   job A running, and start a second job B, job B will use job A's
#   instance of mspdbsrv.  If you kill mspdbsrv when job A finishes,
#   job B will die horribly.  To make matters worse, the version of
#   mspdbsrv should match the version of Visual Studio being used.
#
# This class works around these problems:
#
# - It sets the _MSPDBSRV_ENDPOINT_ to a value that's probably unique to
#   the build, to prevent other builds on the same machine from sharing
#   the same mspdbsrv endpoint
#
# - It launches mspdbsrv with _all_ file handles closed, so that it
#   can't block the build from being detected as finished.
#
# - It explicitly kills mspdbsrv after the build job has finished.
#
# - It wraps all of this into a context manager, so mspdbsrv gets killed
#   even if a Python exception causes a non-local exit.
class UniqueMspdbsrv(object):
    def __enter__(self):
        os.environ['_MSPDBSRV_ENDPOINT_'] = str(uuid.uuid4())

        mspdbsrv_exe = os.path.join(get_program_files_x86(),
            'Microsoft Visual Studio 10.0\\Common7\\IDE\\mspdbsrv.exe')
        args = [mspdbsrv_exe, '-start', '-shutdowntime', '-1']
        print(' '.join(args))
        self.proc = subprocess.Popen(args, cwd='\\', close_fds=True)
        return self

    def __exit__(self, type, value, traceback):
        self.proc.terminate()
        return False
You can then use this when implementing a build step:
with UniqueMspdbsrv() as mspdbsrv:
    # Do your build steps here (e.g. msbuild invocation)
    pass

# mspdbsrv automatically cleaned up by context manager

It took me a couple of days to figure out what was going on and to find an adequate solution. A lot of very tedious trawling through obscure bits of the Internet were required to find all of the required pieces; for example, Microsoft do not document arguments to mspdbsrv or the environment variables that it understands anywhere on MSDN.

Hopefully, if you are running into problems with your Jenkins or buildbot workers interacting weirdly with Microsoft Visual Studio C or C++ builds, this will save you some time!