Lasse Westh's Idle Thoughts: 2020

I had opportunity to use Squid recently at work and want to share - also because Google didn't just hand me this blueprint as I had expected.

Actually it was Julian Simpson, he of ThoughtWorks/ Neo4j/ DevOps New Zealand fame, who inspired me down the pub one night (John Scott's Lund, lovely place).

The problem we had was Maven Central's infrastructure not being good enough for our needs; we had too many failing builds because of connectivity issues - artifact downloads. Basically HTTP GET. Between Maven-the-client and Sonatype's CDN things just weren't working well enough. We tried bintray.com too, with similar disappointing results. Something had to be done. It is worth noting here that Maven-the-client does do retries by default, so this isn't entirely basic stuff.

Background

We run a bunch of AWS instances on demand as part of CI, and they in turn run Maven - hundreds of concurrent jobs (and therefore machines) during office hours. Machines come up blank and go away when not needed, so most of the time we have to bootstrap the local Maven repository from somewhere. In fact if a bunch of these come up at the same time they conduct a form of DOS attack on Maven Central's CDN - oops. Anyway, Julian outlined a solution where we build a proxy and fiddle with artifact caching headers, to form a shared service for these cloud build agents. Intra-AWS HTTP downloads FTW.

Maven-the-client insists on sending some vicious headers: Cache-store: no-store, Cache-control: no-cache, Expires: 0. This forces requests to go all the way to origin server, and I guess they do it to gather usage statistics - certainly Maven Central artifacts never change, or the internetz would break.

Therefore first item of business would be to stop respecting those headers and have Squid cache objects anyway, and for a long time. Yes, this breaks the HTTP contract. No, that isn't a problem.

Secondly, we do not want to replace one problem with another one so this needs to run as a 24/7/365 shared-nothing highly-available stateless 0.99999nines99 service, pretty much. You know, text book scavailable stuff. AWS can help us there, think an ELB in front of an auto-scaling group full of instances running Squid: one machine fails and auto-scaling replaces it. Data loss is no problem, the new instance is just a cold cache which will work, just more slowly.

One-page scavailability

The blueprint, such as it is; not pretty, but neat to not have more bits: a CloudFormation template, instances running Ubuntu 16.04, service sitting on port 1080, bit of security and auth.

{
  "Description": "Maven proxy service",
  "Parameters": {
    "keyname": { "Type": "AWS::EC2::KeyPair::KeyName" }
  },
  "Resources": {
    "loadbalancer": {
      "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
      "Properties": {
        "AvailabilityZones": { "Fn::GetAZs": "" },
        "Listeners": [
          {
            "LoadBalancerPort": "1080",
            "InstancePort": "3128",
            "Protocol": "TCP"
          }
        ]
      }
    },
    "securitygroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "SSH and Squid",
        "SecurityGroupIngress": [
          {
            "CidrIp": "0.0.0.0/0",
            "FromPort": "22",
            "IpProtocol": "tcp",
            "ToPort": "22"
          },
          {
            "FromPort": "3128",
            "IpProtocol": "tcp",
            "SourceSecurityGroupOwnerId": {
              "Fn::GetAtt": [ "loadbalancer", "SourceSecurityGroup.OwnerAlias" ]
            },
            "SourceSecurityGroupName": {
              "Fn::GetAtt": [ "loadbalancer", "SourceSecurityGroup.GroupName" ]
            },
            "ToPort": "3128"
          }
        ]
      }
    },
    "waithandle": {
      "Type": "AWS::CloudFormation::WaitConditionHandle"
    },
    "waitcondition": {
      "Type": "AWS::CloudFormation::WaitCondition",
      "Properties": {
        "Count": "3",
        "Handle": {
          "Ref": "waithandle"
        },
        "Timeout": "500"
      }
    },
    "launchconfiguration": {
      "Type": "AWS::AutoScaling::LaunchConfiguration",
      "Properties": {
        "ImageId": "ami-c1167eb8",
        "InstanceType": "m3.large",
        "KeyName": { "Ref": "keyname" },
        "SecurityGroups": [ { "Ref": "securitygroup" } ],
        "UserData": {
          "Fn::Base64": {
            "Fn::Join": [
              "",
              [
                "#!/bin/bash -eux", "\n",
                "\n",
                "apt update", "\n",
                "apt install --yes squid3 heat-cfntools apache2-utils", "\n",
                "\n",
                "echo '# Maven proxy service'                                                                  >  /etc/squid/squid.conf", "\n",
                "echo 'maximum_object_size_in_memory 30 MB'                                                    >> /etc/squid/squid.conf", "\n",
                "echo 'acl mavencentral dstdomain repo.maven.apache.org uk.maven.org'                          >> /etc/squid/squid.conf", "\n",
                "echo 'cache deny !mavencentral'                                                               >> /etc/squid/squid.conf", "\n",
                "echo 'refresh_pattern repo.maven.apache.org 288000 100% 576000 ignore-reload ignore-no-store' >> /etc/squid/squid.conf", "\n",
                "echo 'refresh_pattern uk.maven.org          288000 100% 576000 ignore-reload ignore-no-store' >> /etc/squid/squid.conf", "\n",
                "echo 'http_port 3128'                                                                         >> /etc/squid/squid.conf", "\n",
                "echo 'auth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/passwd'              >> /etc/squid/squid.conf", "\n",
                "echo 'auth_param basic realm Maven proxy service'                                             >> /etc/squid/squid.conf", "\n",
                "echo 'acl ncsa_users proxy_auth REQUIRED'                                                     >> /etc/squid/squid.conf", "\n",
                "echo 'http_access allow ncsa_users'                                                           >> /etc/squid/squid.conf", "\n",
                "\n",
                "htpasswd -b -c /etc/squid/passwd someone something",
                "\n",
                "systemctl reload squid", "\n",
                "\n",
                "export SIGNAL_URL='", { "Ref" : "waithandle" }, "'\n",
                "echo 'signal url: ${SIGNAL_URL}'", "\n",
                "cfn-signal ${SIGNAL_URL}", "\n",
                "\n"
              ]
            ]
          }
        }
      }
    },
    "autoscalinggroup": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "AvailabilityZones": { "Fn::GetAZs": ""  },
        "LaunchConfigurationName": { "Ref": "launchconfiguration" },
        "LoadBalancerNames": [ { "Ref": "loadbalancer" } ],
        "MinSize": "3",
        "MaxSize": "3"
      }
    }
  },
  "Outputs" : {
    "dnsname" : {
      "Description": "DNS name for load balancer",
      "Value" : { "Fn::GetAtt" : [ "loadbalancer", "DNSName" ] }
    }
  }
}

So how did it go?

Well actually, it didn't go particularly well. We did succeed in reducing the frequency of failures by a factor 100 or something, yay! But the business problem persisted, just too many broken builds and people kept complaining.

The next approach we took was to start caching artifacts on machines: mvn dependency:go-offline with a twist because that plugin is broken and abandoned (thanks Cosmin et al for fixing/ forking!). In fact seeding machines when we build AMIs, then patching them up just before running builds - all because Maven! That works well enough in conjunction with some system properties we divined, system.maven.wagon.http.retryHandler.class=default and system.maven.wagon.http.retryHandler.nonRetryableClasses=java.io.InterruptedIOException,java.net.UnknownHostException,java.net.ConnectException (don't ask). But what a mess! In future surely we will all come around to sticking dependencies in source control eh? Not sure why that has become a religious topic, it's not the 90s anymore, we have bandwidth and disk space.

Also, while working on this I kept thinking Mark Nottingham's stale-while-revalidate would be the perfect fit here, but it wasn't in Squid 3.5 (or I couldn't get it working). Meh. Something to solve if doing this again.

At work we do a bunch of work to ensure our database product works as expected. You're familiar with unit tests and integration tests of course, but take that up a notch and consider the domain - a database product - and you just need a lot more engineering around the thing. A full blown database is a different beast than your bog standard mom & pop web shop, and product is another level of complexity over common website type of stuff, in that we send out binary artifacts to customers, and they run them for years. I trust you are suitably impressed by now. The sharp end of the software industry innit.

Anyway one thing we do to keep a high standard for our product is, we run soak tests regularly. Soak testing involves testing a system with a typical production load, over a continuous availability period, to validate system behaviour under production use. For us that means days or weeks of (synthetic) production workloads run on production grade hardware, with injected faults, backup events and whatnot. We have been doing it for years, and we have become exceedingly efficient at it. It is a very effective method for catching problems before customers do.

In practice it is a case of programmatically setting up things like a database cluster, a load generator, reporting etc. into a resource-oriented, hypermedia-driven set of servers, agents and client-side APIs. Programmatically because we need economics and automation and all those other good bits. We are at a point where starting a soak test is a case of running a single (albeit rather long) command line. Pretty sweet.

We use cloud machines to scale, and therefore one crux we had to solve was bootstrapping machines. On AWS that's done by injecting a script via user data, and that script unfolds into more scripts downloaded from S3, software installed via packages, and our custom services and workloads kicked off. Untouched by human hand and it really is beautiful - on Linux.

Windows!

As awesome as the above is, it is qualified by "on Linux" because the other platforms we support are just not as easy to work with. Until now anyway. In past when we looked, through the prism of non-Windows developers/ having other priorities, we never managed to crack this problem of bootstrapping and it was left to one side. But I looked again when I had to pick up Windows Server 2019 recently for other work, and I am happy to report it is mature enough for what we need. So what do we need?

Well we effectively just need a script to run automatically on a Windows box at launch that can install Java (we are a 4j company after all!), download an executable jar and launch it. Plus a few details. That initial jar blooms into all the other bits we need, and is well-known shared code that we run all the time on Linux (Write Once, Run Everywhere, remember?)

A minimal (PowerShell) user data script

Here is just enough PowerShell to get going. You surround it with <powershell>...</powershell> (more detail in the official documentation) and stick it in as user data when launching the machine:

# bootstrap-script.ps1

# find a home
$workspace = $env:LocalAppData + "\myworkspace"
New-Item -ItemType "directory" -Path $workspace

# download the business script
$scriptfile = $workspace + "\business-script.ps1"
Read-S3Object
-BucketName "mycloudhome"
-Key "business-script.ps1"
-File $scriptfile

# execute it
$logfile = $workspace + "\bootstrap.log"
Push-Location $workspace
& $scriptfile *> $logfile

Your business script now runs as Administrator in its own workspace. The possibilities are endless.

Why/ how does that work?

Well for starters, the Windows Server 2019 base image on AWS comes with certain AWS cmdlets pre-installed. Firstly Read-S3Object which you can use to call home effectively, you can specify an S3 bucket + key from your own account and ensure access using IAM instance profiles. That makes it a single line of code to get at the business script you want. Here is an example IAM policy:

{
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::mycloudhome/*"
}
]
}

Data in that bucket can be static or dynamic, meaning you can change the business script ahead of time. You could for example stick versioned scripts in there, think business-script-$UUID.ps1, but then you'd need to inject that UUID somehow. An exercise for the reader, and there are several options: dynamic bootstrap script or maybe reading instance metadata leading to reading instance tags - be creative.

You could inline everything into a gigantic bootstrap script in principle, but I think for robustness' sake better to separate into one script focused on bootstrap, which never changes, and which is small enough to get under the cap on user data length (16K); and business scripts that are specialized into various business scenarios.

In the end you launch your business script, pipe output to a logfile, and let that take over. Easy peasy.

A minimal script for starting a Java service

Coming back to our motivation, getting tools and services running for e.g. soak testing, there are a few more handy bits to consider:

We need Java, that does not come pre-installed
We need the business software - executable jars are nice here
We need a port opened so that our client side orchestration scripts can take over

You could do something like this. First, download and unzip a JDK from our magic bucket:

# business-script.ps1

# download JDK

$jdkzip = "jdk-11.0.5_windows-x64_bin.zip"

Read-S3Object

-BucketName "mycloudhome"

-Key "jdk-11.0.5_windows-x64_bin.zip"

-File $jdkzip

Expand-Archive -LiteralPath $jdkzip -DestinationPath jdk

Now let's get the JDK onto path. Here that's a case of getting the first sub-folder of a folder - simple and effective:

# "install" JDK

$jdkfoldername = (Get-ChildItem -Directory jdk)[0].name

$jdkfolderpath = Resolve-Path ("jdk\" + $jdkfoldername + "\bin")

$env:Path += (";" + $jdkfolderpath)

Time for magic bucket to shine again. I guess I'd recommend injecting a single correlation (uu)id and otherwise isolating reference data from dynamic data using a bucket named using that (uu)id, for robustness. But the mechanics are in essence this:

# download business jar

Read-S3Object

-BucketName "mycloudhome"

-Key "business.jar"

-File "business.jar"

Remember the Windows firewall, we need to open it from the inside:

# open up firewall

New-NetFirewallRule

-DisplayName "Allow inbound port 20000"

-LocalPort 20000

-Protocol TCP

-Action Allow

And now it is just Java software which we know and love (and which we share across Linux too):

# go time!

& java -jar business.jar <parameters go here>

Now you sit down and poll for your service to be ready.

The client end

Here's some example client code. It gets more advanced when you start injecting more variability, and of course you need to create auxiliary bits like security groups also. We make use of CloudFormation to create isolated experiments:

public void createServiceOnWindows() throws Exception {

// refresh business script in magic bucket

PutObjectRequest putObjectRequest = new PutObjectRequest("mycloudhome", "business-script.ps1", /* the actual business-script.ps1 file */);

AmazonS3ClientBuilder.defaultClient().putObject(putObjectRequest);

// launch instance using bootstrap script

String userData = /* <powershell> contents of bootstrap-script.ps1 <powershell>, base64 encoded */

RunInstancesRequest runInstancesRequest = new RunInstancesRequest()

.withIamInstanceProfile(

new IamInstanceProfileSpecification()

.withArn(/* something to give access to magic bucket */)

)

.withImageId(/* tested with Windows Server 2019 AMI */)

.withSecurityGroupIds(/* open port 20000 for our example, maybe RDP for debugging */)

.withUserData(userData)

[...]

);

AmazonEC2ClientBuilder.defaultClient().runInstances(runInstancesRequest);

}

Parting words

There are no technical blockers to extending our extensive Java-based tooling to also cover Windows. And with minimal scaffolding too.

I am not fond of PowerShell's model and error handling, which is a good reason for getting out from PowerShell and into our Java tooling as quickly as possible - remember, our tooling initially is just empty shell and we need client driven orchestration to achieve business goals like installing our database product, seeding data, changing config, yada yada. But I guess PowerShell is just good enough, and the AWS command line tools in general are just rock solid, on Windows or otherwise.

Lasse Westh's Idle Thoughts

Thursday 16 January 2020

A scavailable Squid proxy cache because Maven!