Thursday 16 January 2020

A scavailable Squid proxy cache because Maven!

I had opportunity to use Squid recently at work and want to share - also because Google didn't just hand me this blueprint as I had expected.

Actually it was Julian Simpson, he of ThoughtWorks/ Neo4j/ DevOps New Zealand fame, who inspired me down the pub one night (John Scott's Lund, lovely place).

The problem we had was Maven Central's infrastructure not being good enough for our needs; we had too many failing builds because of connectivity issues - artifact downloads. Basically HTTP GET. Between Maven-the-client and Sonatype's CDN things just weren't working well enough. We tried bintray.com too, with similar disappointing results. Something had to be done. It is worth noting here that Maven-the-client does do retries by default, so this isn't entirely basic stuff.

Background

We run a bunch of AWS instances on demand as part of CI, and they in turn run Maven - hundreds of concurrent jobs (and therefore machines) during office hours. Machines come up blank and go away when not needed, so most of the time we have to bootstrap the local Maven repository from somewhere. In fact if a bunch of these come up at the same time they conduct a form of DOS attack on Maven Central's CDN - oops. Anyway, Julian outlined a solution where we build a proxy and fiddle with artifact caching headers, to form a shared service for these cloud build agents. Intra-AWS HTTP downloads FTW.

Maven-the-client insists on sending some vicious headers: Cache-store: no-store, Cache-control: no-cache, Expires: 0. This forces requests to go all the way to origin server, and I guess they do it to gather usage statistics - certainly Maven Central artifacts never change, or the internetz would break.

Therefore first item of business would be to stop respecting those headers and have Squid cache objects anyway, and for a long time. Yes, this breaks the HTTP contract. No, that isn't a problem.

Secondly, we do not want to replace one problem with another one so this needs to run as a 24/7/365 shared-nothing highly-available stateless 0.99999nines99 service, pretty much. You know, text book scavailable stuff. AWS can help us there, think an ELB in front of an auto-scaling group full of instances running Squid: one machine fails and auto-scaling replaces it. Data loss is no problem, the new instance is just a cold cache which will work, just more slowly.

One-page scavailability

The blueprint, such as it is; not pretty, but neat to not have more bits: a CloudFormation template, instances running Ubuntu 16.04, service sitting on port 1080, bit of security and auth.

{
  "Description": "Maven proxy service",
  "Parameters": {
    "keyname": { "Type": "AWS::EC2::KeyPair::KeyName" }
  },
  "Resources": {
    "loadbalancer": {
      "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
      "Properties": {
        "AvailabilityZones": { "Fn::GetAZs": "" },
        "Listeners": [
          {
            "LoadBalancerPort": "1080",
            "InstancePort": "3128",
            "Protocol": "TCP"
          }
        ]
      }
    },
    "securitygroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "SSH and Squid",
        "SecurityGroupIngress": [
          {
            "CidrIp": "0.0.0.0/0",
            "FromPort": "22",
            "IpProtocol": "tcp",
            "ToPort": "22"
          },
          {
            "FromPort": "3128",
            "IpProtocol": "tcp",
            "SourceSecurityGroupOwnerId": {
              "Fn::GetAtt": [ "loadbalancer", "SourceSecurityGroup.OwnerAlias" ]
            },
            "SourceSecurityGroupName": {
              "Fn::GetAtt": [ "loadbalancer", "SourceSecurityGroup.GroupName" ]
            },
            "ToPort": "3128"
          }
        ]
      }
    },
    "waithandle": {
      "Type": "AWS::CloudFormation::WaitConditionHandle"
    },
    "waitcondition": {
      "Type": "AWS::CloudFormation::WaitCondition",
      "Properties": {
        "Count": "3",
        "Handle": {
          "Ref": "waithandle"
        },
        "Timeout": "500"
      }
    },
    "launchconfiguration": {
      "Type": "AWS::AutoScaling::LaunchConfiguration",
      "Properties": {
        "ImageId": "ami-c1167eb8",
        "InstanceType": "m3.large",
        "KeyName": { "Ref": "keyname" },
        "SecurityGroups": [ { "Ref": "securitygroup" } ],
        "UserData": {
          "Fn::Base64": {
            "Fn::Join": [
              "",
              [
                "#!/bin/bash -eux", "\n",
                "\n",
                "apt update", "\n",
                "apt install --yes squid3 heat-cfntools apache2-utils", "\n",
                "\n",
                "echo '# Maven proxy service'                                                                  >  /etc/squid/squid.conf", "\n",
                "echo 'maximum_object_size_in_memory 30 MB'                                                    >> /etc/squid/squid.conf", "\n",
                "echo 'acl mavencentral dstdomain repo.maven.apache.org uk.maven.org'                          >> /etc/squid/squid.conf", "\n",
                "echo 'cache deny !mavencentral'                                                               >> /etc/squid/squid.conf", "\n",
                "echo 'refresh_pattern repo.maven.apache.org 288000 100% 576000 ignore-reload ignore-no-store' >> /etc/squid/squid.conf", "\n",
                "echo 'refresh_pattern uk.maven.org          288000 100% 576000 ignore-reload ignore-no-store' >> /etc/squid/squid.conf", "\n",
                "echo 'http_port 3128'                                                                         >> /etc/squid/squid.conf", "\n",
                "echo 'auth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/passwd'              >> /etc/squid/squid.conf", "\n",
                "echo 'auth_param basic realm Maven proxy service'                                             >> /etc/squid/squid.conf", "\n",
                "echo 'acl ncsa_users proxy_auth REQUIRED'                                                     >> /etc/squid/squid.conf", "\n",
                "echo 'http_access allow ncsa_users'                                                           >> /etc/squid/squid.conf", "\n",
                "\n",
                "htpasswd -b -c /etc/squid/passwd someone something",
                "\n",
                "systemctl reload squid", "\n",
                "\n",
                "export SIGNAL_URL='", { "Ref" : "waithandle" }, "'\n",
                "echo 'signal url: ${SIGNAL_URL}'", "\n",
                "cfn-signal ${SIGNAL_URL}", "\n",
                "\n"
              ]
            ]
          }
        }
      }
    },
    "autoscalinggroup": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "AvailabilityZones": { "Fn::GetAZs": ""  },
        "LaunchConfigurationName": { "Ref": "launchconfiguration" },
        "LoadBalancerNames": [ { "Ref": "loadbalancer" } ],
        "MinSize": "3",
        "MaxSize": "3"
      }
    }
  },
  "Outputs" : {
    "dnsname" : {
      "Description": "DNS name for load balancer",
      "Value" : { "Fn::GetAtt" : [ "loadbalancer", "DNSName" ] }
    }
  }
}

So how did it go?

Well actually, it didn't go particularly well. We did succeed in reducing the frequency of failures by a factor 100 or something, yay! But the business problem persisted, just too many broken builds and people kept complaining.

The next approach we took was to start caching artifacts on machines: mvn dependency:go-offline with a twist because that plugin is broken and abandoned (thanks Cosmin et al for fixing/ forking!). In fact seeding machines when we build AMIs, then patching them up just before running builds - all because Maven! That works well enough in conjunction with some system properties we divined, system.maven.wagon.http.retryHandler.class=default and system.maven.wagon.http.retryHandler.nonRetryableClasses=java.io.InterruptedIOException,java.net.UnknownHostException,java.net.ConnectException (don't ask). But what a mess! In future surely we will all come around to sticking dependencies in source control eh? Not sure why that has become a religious topic, it's not the 90s anymore, we have bandwidth and disk space.

Also, while working on this I kept thinking Mark Nottingham's stale-while-revalidate would be the perfect fit here, but it wasn't in Squid 3.5 (or I couldn't get it working). Meh. Something to solve if doing this again.

Wednesday 15 January 2020

Bootstrapping Windows for programmatic fun & profit

At work we do a bunch of work to ensure our database product works as expected. You're familiar with unit tests and integration tests of course, but take that up a notch and consider the domain - a database product - and you just need a lot more engineering around the thing. A full blown database is a different beast than your bog standard mom & pop web shop, and product is another level of complexity over common website type of stuff, in that we send out binary artifacts to customers, and they run them for years. I trust you are suitably impressed by now. The sharp end of the software industry innit.

Anyway one thing we do to keep a high standard for our product is, we run soak tests regularly. Soak testing involves testing a system with a typical production load, over a continuous availability period, to validate system behaviour under production use. For us that means days or weeks of (synthetic) production workloads run on production grade hardware, with injected faults, backup events and whatnot. We have been doing it for years, and we have become exceedingly efficient at it. It is a very effective method for catching problems before customers do.

In practice it is a case of programmatically setting up things like a database cluster, a load generator, reporting etc. into a resource-oriented, hypermedia-driven set of servers, agents and client-side APIs. Programmatically because we need economics and automation and all those other good bits. We are at a point where starting a soak test is a case of running a single (albeit rather long) command line. Pretty sweet.

We use cloud machines to scale, and therefore one crux we had to solve was bootstrapping machines. On AWS that's done by injecting a script via user data, and that script unfolds into more scripts downloaded from S3, software installed via packages, and our custom services and workloads kicked off. Untouched by human hand and it really is beautiful - on Linux.

Windows!

As awesome as the above is, it is qualified by "on Linux" because the other platforms we support are just not as easy to work with. Until now anyway. In past when we looked, through the prism of non-Windows developers/ having other priorities, we never managed to crack this problem of bootstrapping and it was left to one side. But I looked again when I had to pick up Windows Server 2019 recently for other work, and I am happy to report it is mature enough for what we need. So what do we need?

Well we effectively just need a script to run automatically on a Windows box at launch that can install Java (we are a 4j company after all!), download an executable jar and launch it. Plus a few details. That initial jar blooms into all the other bits we need, and is well-known shared code that we run all the time on Linux (Write Once, Run Everywhere, remember?)

A minimal (PowerShell) user data script

Here is just enough PowerShell to get going. You surround it with <powershell>...</powershell> (more detail in the official documentation) and stick it in as user data when launching the machine:

# bootstrap-script.ps1

# find a home
$workspace = $env:LocalAppData + "\myworkspace"
New-Item -ItemType "directory" -Path $workspace

# download the business script
$scriptfile = $workspace + "\business-script.ps1"
Read-S3Object
  -BucketName "mycloudhome"
  -Key "business-script.ps1"
  -File $scriptfile

# execute it
$logfile = $workspace + "\bootstrap.log"
Push-Location $workspace
& $scriptfile *> $logfile

Your business script now runs as Administrator in its own workspace. The possibilities are endless.

Why/ how does that work?

Well for starters, the Windows Server 2019 base image on AWS comes with certain AWS cmdlets pre-installed. Firstly Read-S3Object which you can use to call home effectively, you can specify an S3 bucket + key from your own account and ensure access using IAM instance profiles. That makes it a single line of code to get at the business script you want. Here is an example IAM policy:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::mycloudhome/*"
    }
  ]
}

Data in that bucket can be static or dynamic, meaning you can change the business script ahead of time. You could for example stick versioned scripts in there, think business-script-$UUID.ps1, but then you'd need to inject that UUID somehow. An exercise for the reader, and there are several options: dynamic bootstrap script or maybe reading instance metadata leading to reading instance tags - be creative.

You could inline everything into a gigantic bootstrap script in principle, but I think for robustness' sake better to separate into one script focused on bootstrap, which never changes, and which is small enough to get under the cap on user data length (16K); and business scripts that are specialized into various business scenarios.

In the end you launch your business script, pipe output to a logfile, and let that take over. Easy peasy.

A minimal script for starting a Java service

Coming back to our motivation, getting tools and services running for e.g. soak testing, there are a few more handy bits to consider:

  • We need Java, that does not come pre-installed
  • We need the business software - executable jars are nice here
  • We need a port opened so that our client side orchestration scripts can take over

You could do something like this. First, download and unzip a JDK from our magic bucket:

# business-script.ps1

# download JDK
$jdkzip = "jdk-11.0.5_windows-x64_bin.zip"
Read-S3Object
  -BucketName "mycloudhome"
  -Key "jdk-11.0.5_windows-x64_bin.zip"
  -File $jdkzip
Expand-Archive -LiteralPath $jdkzip -DestinationPath jdk

Now let's get the JDK onto path. Here that's a case of getting the first sub-folder of a folder - simple and effective:

# "install" JDK
$jdkfoldername = (Get-ChildItem -Directory jdk)[0].name
$jdkfolderpath = Resolve-Path ("jdk\" + $jdkfoldername + "\bin")
$env:Path += (";" + $jdkfolderpath)

Time for magic bucket to shine again. I guess I'd recommend injecting a single correlation (uu)id and otherwise isolating reference data from dynamic data using a bucket named using that (uu)id, for robustness. But the mechanics are in essence this:

# download business jar
Read-S3Object
  -BucketName "mycloudhome"
  -Key "business.jar"
  -File "business.jar"

Remember the Windows firewall, we need to open it from the inside:

# open up firewall
New-NetFirewallRule
  -DisplayName "Allow inbound port 20000"
  -LocalPort 20000
  -Protocol TCP
  -Action Allow

And now it is just Java software which we know and love (and which we share across Linux too):

# go time!
& java -jar business.jar <parameters go here>

Now you sit down and poll for your service to be ready.

The client end

Here's some example client code. It gets more advanced when you start injecting more variability, and of course you need to create auxiliary bits like security groups also. We make use of CloudFormation to create isolated experiments:

public void createServiceOnWindows() throws Exception {
    // refresh business script in magic bucket
    PutObjectRequest putObjectRequest = new PutObjectRequest("mycloudhome", "business-script.ps1", /* the actual business-script.ps1 file */);
    AmazonS3ClientBuilder.defaultClient().putObject(putObjectRequest);

    // launch instance using bootstrap script
    String userData = /* <powershell> contents of bootstrap-script.ps1 <powershell>, base64 encoded */
    RunInstancesRequest runInstancesRequest = new RunInstancesRequest()
            .withIamInstanceProfile(
                    new IamInstanceProfileSpecification()
                            .withArn(/* something to give access to magic bucket */)
            )
            .withImageId(/* tested with Windows Server 2019 AMI */)
            .withSecurityGroupIds(/* open port 20000 for our example, maybe RDP for debugging */)
            .withUserData(userData)
            [...]
            );
    AmazonEC2ClientBuilder.defaultClient().runInstances(runInstancesRequest);
}

Parting words

There are no technical blockers to extending our extensive Java-based tooling to also cover Windows. And with minimal scaffolding too.

I am not fond of PowerShell's model and error handling, which is a good reason for getting out from PowerShell and into our Java tooling as quickly as possible - remember, our tooling initially is just empty shell and we need client driven orchestration to achieve business goals like installing our database product, seeding data, changing config, yada yada. But I guess PowerShell is just good enough, and the AWS command line tools in general are just rock solid, on Windows or otherwise.

Tuesday 11 June 2019

Installation Testing @ Neo4j

At work we produce a database product which we deliver to customers in a variety of formats. Tarballs and zips you can just extract and use on your home computer, be it Linux, Mac or Windows; A Docker image for the hipsters; and old school Debian and RPM packages for the slow movers^H^H^H^H^H^H^H^H^H^H^Hstable enterprises. Oh and shameless, out of context plug.

We had all those packages sans Docker when I joined years ago, and they weren't great. Testing was manual and not thorough at all. They have improved, just by having skilled people work on them - so far, so regular every day software. These are critical components actually, they are a core mechanism for getting our software to customers, and customers in turn rely on them for their critical business operations. So no pressure.

Anyway, one day I had an epiphany: why don't we just write automated tests around these packages, so that we can have confidence they work as expected when making changes and adding features? Duh!

We do that for everything else, and on multiple levels. When you pare it back this is deterministic stuff about file locations and permissions, really. Turns out we already had all the building blocks and blueprints, but I still feel this was a tiny local game changer.

Asking around the place it was clear this was not something anyone had seen before, and I do think we are quite a testing-forward place too. Some spitballing by the water cooler with my mate Steven Baker helped clarify things - he's had the actual package knowledge, I was just the ideas man - and we were ready to go. And so the very first Installation Testing framework was birthed.

So what is it?


Well what is a package? It's a thing that runs on a computer, it puts files in places when you install, in our case it starts a service that you can poke with a few ancillary tools, and you can uninstall and even purge it when you lose interest.

So there are the contours of the framework: exercising the package installer is a case of running it and observing files appear where expected, with expected permissions, and verifying a service is started automatically and becomes available. That the right dependencies are installed too. Uninstalling and purging is the reverse. A bit of poking at the ancillary tools to see they are also working. Easy peasy!

Once you get into it, a first challenge is sandboxing and isolating: you want a clean slate for every run, so that you have reproducibility and an aura of science about it. We already do this in other areas, using throw-away AWS instances. Spot priced too because economics, and these days you are charged by the minute so very little waste. Indeed the catalogue of AWS instance OSes helps us reach different OSes (Debian, Ubuntu, RHEL, Amazon Linux, ...) at different versions, spanning the space our customer base lives in. So much winning there.

The tests are really a series of commands sent via SSH, there are calls and waits and assertions like you would use in similar system testing, classic stuff. It gets frameworky once you realise the same high level script applies to each platform, yay reuse. But really it is basic stuff when you think about it.

Now, there were stumbles. Internal ones like picking technology, where we tried and discarded Vagrant+VirtualBox as the container for the different platforms, we couldn't make it work reliably for us. We also discarded Cucumber for the high level scripting, because really, we don't have non-coders looking at this and Ruby isn't a core skill here (we're a 4j shop remember!), so a bunch of hassle with no payoff.

External stumbles currently include Zypper which is a PITA to work with, and of course all the little niggles and of course bugs found+fixed. But external stumbles is exactly what we hunting for here, so that's just grand.

Status


Current incarnation of Installation Testing was implemented by my good colleague Jenny Owen as Java + JUnit with Maven, using the standard AWS client libraries and JSch. It works great, we have readable code and running this in CI gives us so much clarity and confidence which we never had before. There are thoughts about open sourcing it as it might be useful to others. Do you think it is? I'd love to hear experience reports if you faced similar problems.

So, you launch AWS instances, commandeer them via SSH, exercise and evaluate the software, terminate the instance. In parallel across the 12 and growing platforms multiplied by currently 2 editions of Neo4j we are interested in, for that low latency feedback. Textbook. What's not to like?

Thursday 12 July 2018

The International E-road Network and Neo4j

I was having fun recently with some E-road data that I found. E-roads are highly intuitive for spatial/ graph-y stuff: you will be on one regularly, and they will lead you to Rome, eventually. Or Aarhus. And because it is graph-y and spatial at the same time, it is obvious to try some shortest path queries on it, which Neo4j has built-in. Think route planning.

This particular dataset has 895 reference places and 1250 sections of road between them. Roads have a distance attribute, which will come in handy. And there is some more metadata to play with like country codes and whether a road is a water crossing.

The data needed some massaging into CSV format before it can be imported into Neo4j. I ended up with one line of CSV per section of E-road, so some duplication of reference places, but meh - Neo4j can merge them back together:


Oh and please go ahead and use the dataset if you find it interesting.

Let's go from Århus to Rome:

Cypher has a feature called variable length pattern matching. Here is the simplest possible Cypher query for finding the path we want:

MATCH p=((aarhus {name: "Århus"})-[:EROAD*]-(rome {name: "Roma"}))
RETURN p

Well, that didn't work, Spinner-of-Death™. The dataset is too large, or the query is too broad, or my laptop is too small. Aha! But I happen to know from playing with the dataset that there exists a part of length 28 between Aarhus and Rome, so we can give the path finder a maximum:

MATCH p=((aarhus {name: "Århus"})-[:EROAD*28]-(rome {name: "Roma"}))
RETURN p

Meh, Spinner-of-Death™ again...

Waypoints! Let's make it even bit easier and constrain the query by inserting waypoints. The path I found while exploring goes via Stuttgart and Milan:

MATCH p=((aarhus {name: "Århus"})-[:EROAD*11]-(stuttgart {name: "Stuttgart"})-[:EROAD*11]-(milan {name: "Milano"})-[:EROAD*6]-(rome {name: "Roma"}))
RETURN p


Result!

We can even find the length of the paths:

MATCH p=(aarhus {name: "Århus"})-[:EROAD*11]-(stuttgart {name: "Stuttgart"})-[:EROAD*11]-(milan {name: "Milano"})-[:EROAD*6]-(rome {name: "Roma"})
RETURN REDUCE(s = 0, r IN relationships(p) | s + TOINT(r.distance)) AS total_distance ORDER BY total_distance ASC

It comes up to 2329 km for the shortest path, 3491 km for the longest, and there are 48 paths that fit the pattern.

We need a shortest path algorithm

Alright, the trouble with the above approach is, the waypoints I chose are probably be sub-optimal, and therefore the path isn't as short as it could be. Also it is a jumble to look at, there are several paths between Stuttgart and Milan of length 11 for example, and it is hard to get an intuition of this intuitive spatial/ graph-y data.

Luckily, Cypher has a shortest path algorithm built in:

MATCH p=shortestPath((aarhus {name: "Århus"})-[rels:EROAD*]-(rome {name: "Roma"}))
RETURN p, length(p), REDUCE(s = 0, r IN rels | s + TOINT(r.distance)) AS total_distance


So there is the shortest path from Aarhus to Rome - in terms of hops. 22 hops and 2948 km, including sailing across the Adriatic Sea. Let's call it the scenic route.

We know that is suboptimal, but at least the query looks sane now without the waypoints. Oh and some graphics skills, we can edit the graph in Neo4j Browser, drag and arrange the nodes so they neatly overlay locations on a map.

Weighted shortest path FTW!

Right. Last refinement, promise. Weighted shortest path is supported in Neo4j using Dijkstra from the APOC procedure library plugin, and we need that so we can minimise distance instead of just hops:

MATCH (aarhus {name: 'Århus'}), (rome {name: 'Roma'})
CALL apoc.algo.dijkstra(aarhus, rome, 'EROAD', 'distance') YIELD path, weight
RETURN path, length(path), weight


Neat and simple query giving us a path with 26 hops, 2147 km, and quite straight-looking on the map. We have a winner.

Thursday 27 June 2013

Divide and conquer your merge hell with Git

I'm no Git wizard, and it is a scary powerful tool. But from today I absolutely love it!

I was doing the dreaded master merge, facing merge hell, and feeling kinda glum. But I found this as a way of breaking the problem up and merging only a manageable chunk at a time - say merge master as of 5 days ago:

$ git merge "master@{5 days ago}"

or master as it was 3 commits ago:

$ git merge master^^^

More work, sure, you have to potentially re-visit files, but it helps triangulate the hard spots. Indeed, you can do a backwards binary search for a manageable chunk.

More revision-vocabulary here: http://git-scm.com/docs/gitrevisions.html

Wednesday 4 July 2012

Versioning

In my new job as a product engineer - as opposed to being on the application side - this is a good, concise read about semantic versioning.

Wednesday 23 May 2012

The E in ELB

From http://aws.amazon.com/articles/1636185810492479:
There are a variety of load testing tools, and most of the tools are designed to address the question of how many servers a business must procure based on the amount of traffic the servers are able to handle. To test server load in this situation, it was logical to quickly ramp up the traffic to determine when the server became saturated, and then to try iterations of the tests based on request and response size to determine the factors affecting the saturation point.

When you create an elastic load balancer, a default level of capacity is allocated and configured. As Elastic Load Balancing sees changes in the traffic profile, it will scale up or down. The time required for Elastic Load Balancing to scale can range from 1 to 7 minutes, depending on the changes in the traffic profile. When Elastic Load Balancing scales, it updates the DNS record with the new list of IP addresses. To ensure that clients are taking advantage of the increased capacity, Elastic Load Balancing uses a TTL setting on the DNS record of 60 seconds. It is critical that you factor this changing DNS record into your tests. If you do not ensure that DNS is re-resolved or use multiple test clients to simulate increased load, the test may continue to hit a single IP address when Elastic Load Balancing has actually allocated many more IP addresses. Because your end users will not all be resolving to that single IP address, your test will not be a realistic sampling of real-world behavior.
Can you live with O(minutes) responsiveness?