Auth0, a provider of authentication, authorization and single sign on services, moved their infrastructure from multiple cloud providers (AWS, Azure and Google Cloud) to just AWS. An increasing dependency on AWS services necessitated this, and today their systems are spread across 4 AWS regions with services replicated across zones.
Auth0’s design goal was to run on either on-premises or on the cloud. In the last 4 years, their systems have scaled to serve more than 1.5 billion logins per month. The number of services have increased from 10 to 30, and servers from a couple of dozen in a single AWS region to more than 1000 spread across 4 regions. Their architecture is composed of a routing layer, which has backends of auto-scaling groups of different services, a data storage layer with MongoDB, Elasticsearch, Redis, and PostgreSQL and supported by Kinesis streams and messages queues – RabbitMQ, SNS, and SQS.
Auth0’s initial architecture was spread across Azure and AWS, with some bits on Google Cloud. Although Azure was the primary region for their SaaS deployment at first with AWS as a failover, they reversed the roles later. Failover between clouds was DNS based, which implied that the TTL had to be very low for clients to switch quickly when failover happened. Dirceu Tiegs, Production Engineer at Auth0, writes that “as we began using more AWS resources like Kinesis and SQS, we started having trouble keeping the same feature set in both providers.” Azure did have a service similar to SQS called Azure Service Bus at that time, and it’s not mentioned which other AWS services lacked equivalents in Azure. There were probably a few cases too where the lack of a service in a specific AWS region led them to write it using something else.
One of Auth0’s outages occurred in 2016 when their VPN endpoint in AWS started dropping network packets from Azure and GCE. The database clustering architecture at that time utilized all three cloud providers, and due to this issue, the primary database node at AWS failed to receive heartbeat packets from Azure. All the cluster nodes marked themselves as secondary in subsequent recovery attempts and service was affected. A DNS misconfiguration contributed to the problem. The team finally decided to support Azure with only a minimum working version of their auth service when AWS went down. AWS became their primary cloud provider.
Auth0’s AWS architecture has all their services including databases running on 3 availability zones (AZ) in a region. If an AZ fails, services are still available from the other 2. If an entire region fails, Route53 – AWS’s DNS service – can be updated to point their domains to another active region. Some services have higher availability guarantees than others. For example, the user search service based on Elasticsearch might have slightly stale data, but all core functionality would continue to work. The database layer consists of a cross-region MongoDB cluster, RDS replication for PostgreSQL, and per-region Elasticsearch clusters.
Auth0 ran their own Content Delivery Network (CDN) until 2017, when they transitioned to CloudFront. Their home-grown CDN was backed by Amazon S3, and was built using Varnish and nginx. Their transition to CloudFront has resulted in lesser maintenance and easier configuration.
Auth0 started with Pingdom for monitoring, then developed their own health-check system which ran node.js scripts and notified via Slack. Their current stack has Datadog, CloudWatch, Pingdom and Sentinel. Time series metrics are collected by Datadog, and sent to Slack, with a few being sent to PagerDuty. Slack is also used to automate tasks, in the spirit of the ChatOps collaboration model. The log processing pipeline uses Amazon Kinesis, Elasticsearch and Kibana to collect application logs, while Sumologic records audit trails and AWS generated logs.