Scaling Your Monitoring Stack Horizontally: Infinite Growth

Vertical scaling has limits. Learn how to shard your monitoring across 100 nodes using Consistent Hashing.

J
Jesus Paz
1 min read

You monitored 100 servers. It was easy. You monitored 1,000 servers. You bought a bigger CPU. (Vertical Scaling). Now you need to monitor 100,000 servers. No single CPU is big enough. You need Horizontal Scaling.

The Problem: “Who Check What?”

If you have 100 Docker containers running the Cluster Uptime agent, how do you decide which container checks google.com?

  • If all checking google.com, you are DDOSing Google.
  • If none check it, you missed the outage.

Solution 1: Static Sharding (The Bad Way)

Partition by ID.

  • Agent 1: Checks ID 1-1000.
  • Agent 2: Checks ID 1001-2000.

Failure Mode: If Agent 1 crashes, Monitors 1-1000 disappear. You have a blind spot.

Solution 2: Leader Election

Agents talk to each other (via Etcd/Consul) to elect a “Leader.” The Leader assigns tasks to Workers. Pros: Robust. Cons: Complex. The Leader becomes a bottleneck.

Solution 3: Consistent Hashing (The Cluster Uptime Way)

We use a “Share Nothing” architecture based on Consistent Hashing (Ring Hashing).

  1. All agents know about all other agents (Memberlist protocol).
  2. Hash the Monitor URL: hash('google.com') = 9348.
  3. Map 9348 to the closest Agent ID on the ring.

If an Agent dies, the ring automatically rebalances. The neighbors pick up the slack instantly. This allows us to scale from 1 node to 1,000 nodes linearly.

Conclusion

Horizontal scaling turns “Capacity Planning” into “Just add another node.” It is the only way to handle hyperscale monitoring.

👨‍💻

Jesus Paz

Founder

Read Next

Join 1,000+ FinOps and platform leaders

Get uptime monitoring and incident response tactics delivered weekly.