Scaling Your Monitoring Stack Horizontally: Infinite Growth
Vertical scaling has limits. Learn how to shard your monitoring across 100 nodes using Consistent Hashing.
The architectural challenges of massive scale monitoring. Learn how we solved database bottlenecks, network limits, and alert fatigue.
Moving from monitoring a startup’s MVP to an enterprise-grade infrastructure is a huge leap. It is easy to write a script that checks 10 websites. It is a completely different engineering challenge to check 10,000 endpoints every minute with high reliability.
When we scaled Cluster Uptime to this level, we hit every bottleneck imaginable. Here are the hard lessons we learned, so you don’t have to learn them the hard way.
The Problem: If you have 10,000 monitors checking every 60 seconds, that is 10,000 writes per second (WPS) to your database. Most standard Relational Databases (Postgres/MySQL) on a single node will start to choke at this sustained write volume, especially if you have indexes.
The Solution: Time-Series Optimization & Buffering
metrics_2025_12). This keeps the index size manageable and makes deleting old data (Retention Policy) instant—just drop the table.The Problem:
When you make an outgoing TCP connection, the OS assigns a local port (e.g., 54321). There are only ~65,000 ports available.
If you open 10,000 connections/minute and they stay in TIME_WAIT state for 60 seconds, you will run out of ports. Your agent will crash with EADDRNOTAVAIL.
The Solution: Tuning sysctl We had to tune the Linux kernel networking stack on our monitoring nodes:
# Allow reusing sockets in TIME_WAIT statesysctl -w net.ipv4.tcp_tw_reuse=1
# Decrease the time a connection stays in TIME_WAIT (default 60s)sysctl -w net.ipv4.tcp_fin_timeout=15
# Increase the range of available local portssysctl -w net.ipv4.ip_local_port_range="1024 65000"The Problem: At scale, the internet is flaky. A 10,000 endpoint check will encounter random packet loss. If you send an alert for every failure, you will send hundreds of false alerts a day. Your users will mute your pager.
The Solution: The “Confirmation” Architecture We implemented a Multi-Region Consensus mechanism.
This reduced our false positive rate by 99.9%.
The Problem:
If 5,000 checks are scheduled for “every minute,” and you’re not careful, your scheduler might fire all 5,000 at exactly 00:00:01. This causes a huge CPU spike, saturates the network card, and then the server sits idle for 59 seconds.
The Solution: Jitter and Spread We implemented randomized scheduling offset.
00:00:0100:00:01.35000:00:02.100We spread the load evenly across the entire minute window. This constant, flat resource usage is much easier to manage than spiky loads.
Scaling is about handling constraints. The database IOPS, the network ports, the CPU cycles—they are all finite resources. By understanding these limits and architecting around them, Cluster Uptime delivers enterprise scale on commodity hardware.
Founder
Vertical scaling has limits. Learn how to shard your monitoring across 100 nodes using Consistent Hashing.
Microservices introduce 10x the complexity. Learn the 3 architectures for monitoring them effective: The Sidecar, The DaemonSet, and The Central Scraper.
Relational databases struggle with time-series data. Learn about partitioning, LSM trees, and downsampling strategies for monitoring.
Get uptime monitoring and incident response tactics delivered weekly.