Wednesday 15 January 2020

Bootstrapping Windows for programmatic fun & profit

At work we do a bunch of work to ensure our database product works as expected. You're familiar with unit tests and integration tests of course, but take that up a notch and consider the domain - a database product - and you just need a lot more engineering around the thing. A full blown database is a different beast than your bog standard mom & pop web shop, and product is another level of complexity over common website type of stuff, in that we send out binary artifacts to customers, and they run them for years. I trust you are suitably impressed by now. The sharp end of the software industry innit.

Anyway one thing we do to keep a high standard for our product is, we run soak tests regularly. Soak testing involves testing a system with a typical production load, over a continuous availability period, to validate system behaviour under production use. For us that means days or weeks of (synthetic) production workloads run on production grade hardware, with injected faults, backup events and whatnot. We have been doing it for years, and we have become exceedingly efficient at it. It is a very effective method for catching problems before customers do.

In practice it is a case of programmatically setting up things like a database cluster, a load generator, reporting etc. into a resource-oriented, hypermedia-driven set of servers, agents and client-side APIs. Programmatically because we need economics and automation and all those other good bits. We are at a point where starting a soak test is a case of running a single (albeit rather long) command line. Pretty sweet.

We use cloud machines to scale, and therefore one crux we had to solve was bootstrapping machines. On AWS that's done by injecting a script via user data, and that script unfolds into more scripts downloaded from S3, software installed via packages, and our custom services and workloads kicked off. Untouched by human hand and it really is beautiful - on Linux.

Windows!

As awesome as the above is, it is qualified by "on Linux" because the other platforms we support are just not as easy to work with. Until now anyway. In past when we looked, through the prism of non-Windows developers/ having other priorities, we never managed to crack this problem of bootstrapping and it was left to one side. But I looked again when I had to pick up Windows Server 2019 recently for other work, and I am happy to report it is mature enough for what we need. So what do we need?

Well we effectively just need a script to run automatically on a Windows box at launch that can install Java (we are a 4j company after all!), download an executable jar and launch it. Plus a few details. That initial jar blooms into all the other bits we need, and is well-known shared code that we run all the time on Linux (Write Once, Run Everywhere, remember?)

A minimal (PowerShell) user data script

Here is just enough PowerShell to get going. You surround it with <powershell>...</powershell> (more detail in the official documentation) and stick it in as user data when launching the machine:

# bootstrap-script.ps1

# find a home
$workspace = $env:LocalAppData + "\myworkspace"
New-Item -ItemType "directory" -Path $workspace

# download the business script
$scriptfile = $workspace + "\business-script.ps1"
Read-S3Object
  -BucketName "mycloudhome"
  -Key "business-script.ps1"
  -File $scriptfile

# execute it
$logfile = $workspace + "\bootstrap.log"
Push-Location $workspace
& $scriptfile *> $logfile

Your business script now runs as Administrator in its own workspace. The possibilities are endless.

Why/ how does that work?

Well for starters, the Windows Server 2019 base image on AWS comes with certain AWS cmdlets pre-installed. Firstly Read-S3Object which you can use to call home effectively, you can specify an S3 bucket + key from your own account and ensure access using IAM instance profiles. That makes it a single line of code to get at the business script you want. Here is an example IAM policy:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::mycloudhome/*"
    }
  ]
}

Data in that bucket can be static or dynamic, meaning you can change the business script ahead of time. You could for example stick versioned scripts in there, think business-script-$UUID.ps1, but then you'd need to inject that UUID somehow. An exercise for the reader, and there are several options: dynamic bootstrap script or maybe reading instance metadata leading to reading instance tags - be creative.

You could inline everything into a gigantic bootstrap script in principle, but I think for robustness' sake better to separate into one script focused on bootstrap, which never changes, and which is small enough to get under the cap on user data length (16K); and business scripts that are specialized into various business scenarios.

In the end you launch your business script, pipe output to a logfile, and let that take over. Easy peasy.

A minimal script for starting a Java service

Coming back to our motivation, getting tools and services running for e.g. soak testing, there are a few more handy bits to consider:

  • We need Java, that does not come pre-installed
  • We need the business software - executable jars are nice here
  • We need a port opened so that our client side orchestration scripts can take over

You could do something like this. First, download and unzip a JDK from our magic bucket:

# business-script.ps1

# download JDK
$jdkzip = "jdk-11.0.5_windows-x64_bin.zip"
Read-S3Object
  -BucketName "mycloudhome"
  -Key "jdk-11.0.5_windows-x64_bin.zip"
  -File $jdkzip
Expand-Archive -LiteralPath $jdkzip -DestinationPath jdk

Now let's get the JDK onto path. Here that's a case of getting the first sub-folder of a folder - simple and effective:

# "install" JDK
$jdkfoldername = (Get-ChildItem -Directory jdk)[0].name
$jdkfolderpath = Resolve-Path ("jdk\" + $jdkfoldername + "\bin")
$env:Path += (";" + $jdkfolderpath)

Time for magic bucket to shine again. I guess I'd recommend injecting a single correlation (uu)id and otherwise isolating reference data from dynamic data using a bucket named using that (uu)id, for robustness. But the mechanics are in essence this:

# download business jar
Read-S3Object
  -BucketName "mycloudhome"
  -Key "business.jar"
  -File "business.jar"

Remember the Windows firewall, we need to open it from the inside:

# open up firewall
New-NetFirewallRule
  -DisplayName "Allow inbound port 20000"
  -LocalPort 20000
  -Protocol TCP
  -Action Allow

And now it is just Java software which we know and love (and which we share across Linux too):

# go time!
& java -jar business.jar <parameters go here>

Now you sit down and poll for your service to be ready.

The client end

Here's some example client code. It gets more advanced when you start injecting more variability, and of course you need to create auxiliary bits like security groups also. We make use of CloudFormation to create isolated experiments:

public void createServiceOnWindows() throws Exception {
    // refresh business script in magic bucket
    PutObjectRequest putObjectRequest = new PutObjectRequest("mycloudhome", "business-script.ps1", /* the actual business-script.ps1 file */);
    AmazonS3ClientBuilder.defaultClient().putObject(putObjectRequest);

    // launch instance using bootstrap script
    String userData = /* <powershell> contents of bootstrap-script.ps1 <powershell>, base64 encoded */
    RunInstancesRequest runInstancesRequest = new RunInstancesRequest()
            .withIamInstanceProfile(
                    new IamInstanceProfileSpecification()
                            .withArn(/* something to give access to magic bucket */)
            )
            .withImageId(/* tested with Windows Server 2019 AMI */)
            .withSecurityGroupIds(/* open port 20000 for our example, maybe RDP for debugging */)
            .withUserData(userData)
            [...]
            );
    AmazonEC2ClientBuilder.defaultClient().runInstances(runInstancesRequest);
}

Parting words

There are no technical blockers to extending our extensive Java-based tooling to also cover Windows. And with minimal scaffolding too.

I am not fond of PowerShell's model and error handling, which is a good reason for getting out from PowerShell and into our Java tooling as quickly as possible - remember, our tooling initially is just empty shell and we need client driven orchestration to achieve business goals like installing our database product, seeding data, changing config, yada yada. But I guess PowerShell is just good enough, and the AWS command line tools in general are just rock solid, on Windows or otherwise.

No comments: