Thoughtpiece

AWS ‘Hyperplane’

VPC networking improvements move our startup closer to lifting off like Saturn V

First published: 9 Dec 2019 | 6 min read
Author: Mark Broadmore

Huma’s backend is based on AppSync (GraphQL) and Lambdas, which are great because they scale so easily. There’s no need to guess how many requests we’ll have in any given second and spin EC2s up and down in response, which would be a waste of money and engineering effort. If we get a spike in user activity AWS will automatically handle the load for us and the cost of extra Lambda invocations is fractions of a penny.

It’s brilliant — in theory.

In practice, until AWS’s “Hyperplane”-based networking improvements became available in early September this year (2019) there was an under-publicised but severe constraint that made Lambdas risky for us to use. If a Lambda needed access to a VPC then its startup time from ‘cold’ could explode from a few hundred milliseconds to tens of seconds. This great post from Coinbase has a detailed yet easy-to-follow explanation.

At Huma, our backend database needs to be secured inside a VPC so we are hit by this issue. Our users were regularly encountering the dreaded ‘buffering’ circle for ten or more(!) seconds:

This wasn’t exactly helping our users become the raving fans that a startup needs.

A further impact came via Cognito because we need to check our backend database while signing users up and logging them in. We have configured Cognito to call our Lambdas, but Cognito has a fixed 5-second timeout on these calls. Five seconds should be more than enough — except if your Lambda needs to access a VPC. Sometimes our users couldn’t even sign up because Lambda VPC cold starts caused Cognito timeout errors!

[Cognito is an AWS service that provides user sign-up and login as a service, it deals with verifying email addresses, password complexity rules, and the security bits and pieces with ensuring we only accept requests from valid users.]

When we first saw the slow responses in our QA environment we were puzzled. Things were visibly slow to the human eye, yet in our Lambda logs they were starting and finishing quickly, generally sub-second. But then we saw it — the startup delays were hidden. Lambda runtime needs to start with its VPC attached before it can log anything, so the slow startup was not logged and therefore undetectable.

AppSync logs cover the entire GraphQL transaction with a Begin Request line followed by many detailed lines, finishing with an Execution Summary line. Timestamps on Begin Request proved the requests were arriving at the server in good time but taking a long time to be fulfilled by our Lambdas.

Using AppSync turned out to be a good choice for debugging this because we were able to add the AppSync server timestamp into requests on their way to our Lambdas and thus infer and log startup delay on every invocation. If we had added a timestamp into our web or mobile clients we would have had to worry about the delay between their phone or computer and the AWS datacentre. We added it into our VTL templates as follows, using

$util.time.nowISO8601():

{
   "version": "2017–02–28",
   "operation": "Invoke",
   "payload": {
       "appSyncDateTime": "$util.time.nowISO8601()"
       "application_values": "…would go here as usual"
   }
}

Now our Lambdas could get the current time as soon as they started, and compare that with the AppSync timestamp in appSyncDateTime. Any difference would be due to Lambda’s startup time. This didn’t show whether the delays were caused by the VPC issue, general Lambda VM startup and Python import processing, or anything else, but as the latter two have an effect measured in milliseconds compared to the multi-second delay caused by the former, this was acceptable. A Python (3.6+) function to do this is shown here — calling it from the top of each Lambda’s handler entry point did the trick nicely:

_LAST_ACCESS_TIME = None

def log_invocation_delay(event: Dict):
    # event should be exactly as it arrives in the first parameter
    # of the Lambda’s handler function
    global _LAST_ACCESS_TIME

    now = arrow.utcnow()
    if "appSyncDateTime" in event:
        invoke_delay = now — arrow.get(event["appSyncDateTime"])
    else:
        invoke_delay = None

    if _LAST_ACCESS_TIME is None and invoke_delay is not None:
        log.info("Lambda instance cold started, latency %0.0fms",
        invoke_delay.total_seconds() * 1000.0)
    elif invoke_delay is not None:
        log.info("Lambda instance re-used, latency %0.0fms",
                 invoke_delay.total_seconds() * 1000.0)
    _LAST_ACCESS_TIME = now

Now the logs in CloudWatch had something searchable to measure the impact of this issue, at least for Lambdas serving AppSync. Global variables i.e. _LAST_ACCESS_TIME aren’t re-initialised between Lambda invocations, it’s easy to use them to track whether or not a Lambda has been cold-started.

Unfortunately, Cognito’s request messages do not include a timestamp so we couldn’t measure their invocation latency. In general, there is very little visibility into errors in Cognito configuration or operation, but that’s a topic for another blog post.

It’s also possible to infer startup time using X-Ray, but it’s not directly visible on traces (despite there being a bar labelled “Invocation” in the X-Ray console).

Mitigations and workarounds

Others have noticed this VPC problem and pointed out solutions, including this post from Jeremy Daly. Basically, if you can keep your Lambdas “warm” then you never suffer the cold start problem. So we created a Lambda warmer along the lines suggested by Jeremy.

But.

As the comments on that post indicate, API Gateway tends not to use already-warmed Lambda instances, preferring instead to start its own. This is especially true when a Lambda uses X-Ray, which performance-critical Lambdas tend to do. So, trying to warm most of our VPC Lambdas didn’t help — we tried it and were quite disappointed!

It did work for the few Lambdas that we had in our Cognito configuration as Cognito does appear to use existing Lambda instances. Once we had our warmer in place our users could at least complete sign-up and login reliably (thanks Jeremy).

Boarding the Hyperplane

When AWS released their networking improvements the impact was dramatic:

Before the improvements, our mean cold-start time was nearly ten seconds (yellow dot in the left-most stack)! With the improvements, we had only one anomalous cold start that took more than two seconds (second from right).

This makes for happy users and lets us focus on making more of them happier, rather than worrying about VPC performance.

To the moon!

Epilogue: Log Insights queries

The graphs above were produced using Google Sheets to make charts from data extracted from CloudWatch logs using Log Insights queries. The query language is fairly powerful but the documentation is a bit short on examples.

Almost as a note to self, these are the queries I used to extract the data. The logging function we actually use is slightly different from the one listed above, because it also handles Lambda warming events from our warmer, so the parsing here won’t work if you simply copy and paste log_invocation_delay(). They should be fairly easy to modify as needed.

This one uses glob strings and was used for the warmed Lambdas:

fields @timestamp, @message
| filter @message like "Invocation latency"
| parse @message "warming_event_handler:*] *. Invocation latency *ms" as @lineno, @launchType, @latency
| stats count() as c, min(@latency), pct(@latency, 50) as med, avg(@latency), pct(@latency, 95) as pct95, pct(@latency, 99) as pct99, max(@latency) by @launchType
| sort maxLatency desc

And this one uses regexes, and was used for the non-warmed Lambdas:

fields @timestamp, @message
| filter @message like /warming_event_handler:.+ms/
| parse @message /warming_event_handler:(?<@lineno>\d+)\] (Lambda instance \d )?(?<@launchType>[^.,]*).* took about (?<@latency>\d+) ms/
| stats count(), min(@latency), pct(@latency, 50) as med, avg(@latency), pct(@latency, 95) as pct95, pct(@latency, 99) as pct99, max(@latency) by @launchType
| sort maxLatency desc

Read Next

Download our COVID-19 Return to Work Research Report here.

cross