One of Pantheon’s primary purposes is to safeguard the security, stability and scalability of our customers’ websites. This is the foundation of everything else we do, and this year one of the new challenges we’ve faced is an increasing number of sophisticated denial of service activities, which necessitated a creative response from our engineers and product managers.
Understanding The Challenge
Historically, most DDoS attacks were brute-force attempts to overwhelm networking resources as a Layer 3/4 network attack (TCP/IP). These are devastating when effective — losing network access means game over — but also relatively trivial to prevent with good network operations. Services like Fastly and Cloudflare have made these somewhat obsolete, and we’ve been able to shield customers from this type of risk for years.
However, in 2023 we saw an uptick in Layer 7 attacks that use real web browsers to mimic visitors, specifically targeting weak points in web content management systems to do their damage. This is not something that can be mitigated by hardening the network configuration. Because the incoming requests aren’t immediately distinguishable from authentic traffic, sites without a Web Application Firewall, or WAF, can be overwhelmed by inauthentic traffic.
While we provide a tunable WAF as part of our Advanced Global CDN product, not every customer purchases this premium add-on. And while Pantheon’s serverless infrastructure scales horizontally, there are still limits, particularly if an attacker finds a weak point (e.g., spamming database-intensive search queries with randomly generated queries).
We know site instability is incredibly frustrating, and we’ve seen instances this year where particularly large attacks — some into the thousands of requests per second — created wider stability issues for the platform. That’s unacceptable. Our customers depend on us to handle this. So we took a fresh look at the evolving threat landscape to ensure we could keep the Pantheon promise of resilience.
Customer Traffic and Rate Limiting Overview
Rate limiting is the process of setting thresholds for traffic, and returning HTTP 429 (Too Many Requests) error responses when that threshold is reached instead of allowing the traffic to reach the intended site. To achieve this, a counter in our Global CDN tracks the number of requests for a given identity (for example, an IP address). When the average number of requests per second exceeds the threshold for a given window (say, over a period of 10 seconds), requests from that identity will be served 429 errors from the edge.
After analyzing billions of requests from our logs, Pantheon Product and Engineering squads were able to deeply analyze traffic at our Global CDN, as well as past traffic-related incidents, to determine what we would deem abusive. We were then able to leverage rate-limiting capabilities within Fastly to create a mitigation strategy.
After setting appropriate thresholds for acceptable use, we needed to decide upon a unique identifier for measuring incoming traffic. While a client’s IP address is a good general indicator of identity, there are scenarios like network address translation, where multiple clients share a single external IP address. We didn’t want to risk having “false positives” where authentic site visitors would have their experience impacted.
To target traffic even more specifically, we are using a form of identification known as a JA3 fingerprint. JA3 fingerprinting uses details from a client’s TLS (Transport Layer Security) configuration, including its version, accepted ciphers, and list of extensions to create a more unique identifier for a specific client. By using the combination of client IP address and JA3 fingerprint, we can be sure that we are specifically blocking abusive traffic without affecting other clients, even if they share the same IP address.
This change not only helps us protect customer sites but also eliminates a concern some customers have expressed about inauthentic traffic leading to higher costs. Because Pantheon’s metering system for visits only counts successful requests for full web pages (meaning response code 200), rate-limited 429 responses are automatically filtered out of the metrics used for site billing.
As we work our way into 2024, Pantheon Engineering squads will continue to work on platform resilience and scalability to protect our customers. The ever-evolving nature of traffic-related threats and disruptions demands a proactive approach, and our engineering teams will be at the forefront, ensuring that customers continue to have a wholesome WebOps experience.