Actually it was Julian Simpson, he of ThoughtWorks/ Neo4j/ DevOps New Zealand fame, who inspired me down the pub one night (John Scott's Lund, lovely place).
The problem we had was Maven Central's infrastructure not being good enough for our needs; we had too many failing builds because of connectivity issues - artifact downloads. Basically HTTP GET. Between Maven-the-client and Sonatype's CDN things just weren't working well enough. We tried bintray.com too, with similar disappointing results. Something had to be done. It is worth noting here that Maven-the-client does do retries by default, so this isn't entirely basic stuff.
Background
We run a bunch of AWS instances on demand as part of CI, and they in turn run Maven - hundreds of concurrent jobs (and therefore machines) during office hours. Machines come up blank and go away when not needed, so most of the time we have to bootstrap the local Maven repository from somewhere. In fact if a bunch of these come up at the same time they conduct a form of DOS attack on Maven Central's CDN - oops. Anyway, Julian outlined a solution where we build a proxy and fiddle with artifact caching headers, to form a shared service for these cloud build agents. Intra-AWS HTTP downloads FTW.Maven-the-client insists on sending some vicious headers: Cache-store: no-store, Cache-control: no-cache, Expires: 0. This forces requests to go all the way to origin server, and I guess they do it to gather usage statistics - certainly Maven Central artifacts never change, or the internetz would break.
Therefore first item of business would be to stop respecting those headers and have Squid cache objects anyway, and for a long time. Yes, this breaks the HTTP contract. No, that isn't a problem.
Secondly, we do not want to replace one problem with another one so this needs to run as a 24/7/365 shared-nothing highly-available stateless 0.99999nines99 service, pretty much. You know, text book scavailable stuff. AWS can help us there, think an ELB in front of an auto-scaling group full of instances running Squid: one machine fails and auto-scaling replaces it. Data loss is no problem, the new instance is just a cold cache which will work, just more slowly.
One-page scavailability
The blueprint, such as it is; not pretty, but neat to not have more bits: a CloudFormation template, instances running Ubuntu 16.04, service sitting on port 1080, bit of security and auth.{ "Description": "Maven proxy service", "Parameters": { "keyname": { "Type": "AWS::EC2::KeyPair::KeyName" } }, "Resources": { "loadbalancer": { "Type": "AWS::ElasticLoadBalancing::LoadBalancer", "Properties": { "AvailabilityZones": { "Fn::GetAZs": "" }, "Listeners": [ { "LoadBalancerPort": "1080", "InstancePort": "3128", "Protocol": "TCP" } ] } }, "securitygroup": { "Type": "AWS::EC2::SecurityGroup", "Properties": { "GroupDescription": "SSH and Squid", "SecurityGroupIngress": [ { "CidrIp": "0.0.0.0/0", "FromPort": "22", "IpProtocol": "tcp", "ToPort": "22" }, { "FromPort": "3128", "IpProtocol": "tcp", "SourceSecurityGroupOwnerId": { "Fn::GetAtt": [ "loadbalancer", "SourceSecurityGroup.OwnerAlias" ] }, "SourceSecurityGroupName": { "Fn::GetAtt": [ "loadbalancer", "SourceSecurityGroup.GroupName" ] }, "ToPort": "3128" } ] } }, "waithandle": { "Type": "AWS::CloudFormation::WaitConditionHandle" }, "waitcondition": { "Type": "AWS::CloudFormation::WaitCondition", "Properties": { "Count": "3", "Handle": { "Ref": "waithandle" }, "Timeout": "500" } }, "launchconfiguration": { "Type": "AWS::AutoScaling::LaunchConfiguration", "Properties": { "ImageId": "ami-c1167eb8", "InstanceType": "m3.large", "KeyName": { "Ref": "keyname" }, "SecurityGroups": [ { "Ref": "securitygroup" } ], "UserData": { "Fn::Base64": { "Fn::Join": [ "", [ "#!/bin/bash -eux", "\n", "\n", "apt update", "\n", "apt install --yes squid3 heat-cfntools apache2-utils", "\n", "\n", "echo '# Maven proxy service' > /etc/squid/squid.conf", "\n", "echo 'maximum_object_size_in_memory 30 MB' >> /etc/squid/squid.conf", "\n", "echo 'acl mavencentral dstdomain repo.maven.apache.org uk.maven.org' >> /etc/squid/squid.conf", "\n", "echo 'cache deny !mavencentral' >> /etc/squid/squid.conf", "\n", "echo 'refresh_pattern repo.maven.apache.org 288000 100% 576000 ignore-reload ignore-no-store' >> /etc/squid/squid.conf", "\n", "echo 'refresh_pattern uk.maven.org 288000 100% 576000 ignore-reload ignore-no-store' >> /etc/squid/squid.conf", "\n", "echo 'http_port 3128' >> /etc/squid/squid.conf", "\n", "echo 'auth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/passwd' >> /etc/squid/squid.conf", "\n", "echo 'auth_param basic realm Maven proxy service' >> /etc/squid/squid.conf", "\n", "echo 'acl ncsa_users proxy_auth REQUIRED' >> /etc/squid/squid.conf", "\n", "echo 'http_access allow ncsa_users' >> /etc/squid/squid.conf", "\n", "\n", "htpasswd -b -c /etc/squid/passwd someone something", "\n", "systemctl reload squid", "\n", "\n", "export SIGNAL_URL='", { "Ref" : "waithandle" }, "'\n", "echo 'signal url: ${SIGNAL_URL}'", "\n", "cfn-signal ${SIGNAL_URL}", "\n", "\n" ] ] } } } }, "autoscalinggroup": { "Type": "AWS::AutoScaling::AutoScalingGroup", "Properties": { "AvailabilityZones": { "Fn::GetAZs": "" }, "LaunchConfigurationName": { "Ref": "launchconfiguration" }, "LoadBalancerNames": [ { "Ref": "loadbalancer" } ], "MinSize": "3", "MaxSize": "3" } } }, "Outputs" : { "dnsname" : { "Description": "DNS name for load balancer", "Value" : { "Fn::GetAtt" : [ "loadbalancer", "DNSName" ] } } } }
So how did it go?
Well actually, it didn't go particularly well. We did succeed in reducing the frequency of failures by a factor 100 or something, yay! But the business problem persisted, just too many broken builds and people kept complaining.The next approach we took was to start caching artifacts on machines: mvn dependency:go-offline with a twist because that plugin is broken and abandoned (thanks Cosmin et al for fixing/ forking!). In fact seeding machines when we build AMIs, then patching them up just before running builds - all because Maven! That works well enough in conjunction with some system properties we divined, system.maven.wagon.http.retryHandler.class=default and system.maven.wagon.http.retryHandler.nonRetryableClasses=java.io.InterruptedIOException,java.net.UnknownHostException,java.net.ConnectException (don't ask). But what a mess! In future surely we will all come around to sticking dependencies in source control eh? Not sure why that has become a religious topic, it's not the 90s anymore, we have bandwidth and disk space.
Also, while working on this I kept thinking Mark Nottingham's stale-while-revalidate would be the perfect fit here, but it wasn't in Squid 3.5 (or I couldn't get it working). Meh. Something to solve if doing this again.