Scaling Your Monitoring Stack Horizontally: Infinite Growth

Vertical scaling has limits. Learn how to shard your monitoring across 100 nodes using Consistent Hashing.

Jesus Paz

Jan 6, 2026 • 1 min read

You monitored 100 servers. It was easy. You monitored 1,000 servers. You bought a bigger CPU. (Vertical Scaling). Now you need to monitor 100,000 servers. No single CPU is big enough. You need Horizontal Scaling.

The Problem: “Who Check What?”

If you have 100 Docker containers running the Cluster Uptime agent, how do you decide which container checks google.com?

If all checking google.com, you are DDOSing Google.
If none check it, you missed the outage.

Solution 1: Static Sharding (The Bad Way)

Partition by ID.

Agent 1: Checks ID 1-1000.
Agent 2: Checks ID 1001-2000.

Failure Mode: If Agent 1 crashes, Monitors 1-1000 disappear. You have a blind spot.

Solution 2: Leader Election

Agents talk to each other (via Etcd/Consul) to elect a “Leader.” The Leader assigns tasks to Workers. Pros: Robust. Cons: Complex. The Leader becomes a bottleneck.

Solution 3: Consistent Hashing (The Cluster Uptime Way)

We use a “Share Nothing” architecture based on Consistent Hashing (Ring Hashing).

All agents know about all other agents (Memberlist protocol).
Hash the Monitor URL: hash('google.com') = 9348.
Map 9348 to the closest Agent ID on the ring.

If an Agent dies, the ring automatically rebalances. The neighbors pick up the slack instantly. This allows us to scale from 1 node to 1,000 nodes linearly.

Conclusion

Horizontal scaling turns “Capacity Planning” into “Just add another node.” It is the only way to handle hyperscale monitoring.

👨‍💻

Jesus Paz

Founder

Previous ← Getting Started with Cluster Uptime in 5 Minutes: From Zero to Monitored Next Open Source Licensing Explained for DevOps: MIT vs Apache vs GPL →

Scaling Your Monitoring Stack Horizontally: Infinite Growth

The Problem: “Who Check What?”

Solution 1: Static Sharding (The Bad Way)

Solution 2: Leader Election

Solution 3: Consistent Hashing (The Cluster Uptime Way)

Conclusion

Jesus Paz

Read Next

Monitoring Microservices: Sidecars, Daemons, and Centralized Checks

Database Optimization for Time-Series Data: Handling Billions of Pings

Why "Simple" is Better for System Reliability: Simplicity as a Feature

Join 1,000+ FinOps and platform leaders