Monitoring 10,000 Endpoints: Lessons Learned Scaling Cluster Uptime

The architectural challenges of massive scale monitoring. Learn how we solved database bottlenecks, network limits, and alert fatigue.

J
Jesus Paz
3 min read

Moving from monitoring a startup’s MVP to an enterprise-grade infrastructure is a huge leap. It is easy to write a script that checks 10 websites. It is a completely different engineering challenge to check 10,000 endpoints every minute with high reliability.

When we scaled Cluster Uptime to this level, we hit every bottleneck imaginable. Here are the hard lessons we learned, so you don’t have to learn them the hard way.

1. The Database Write Bottleneck

The Problem: If you have 10,000 monitors checking every 60 seconds, that is 10,000 writes per second (WPS) to your database. Most standard Relational Databases (Postgres/MySQL) on a single node will start to choke at this sustained write volume, especially if you have indexes.

The Solution: Time-Series Optimization & Buffering

  1. Don’t Write Everything: Do you really need a database row for every successful “200 OK”? No. We switched to an aggregation model. We store the state change events immediately, but for the “heartbeat” data, we buffer in Redis and flush to the DB in bulk batches every minute.
  2. Partitioning: We partition our metrics tables by time (e.g., metrics_2025_12). This keeps the index size manageable and makes deleting old data (Retention Policy) instant—just drop the table.

2. Ephemeral Port Exhaustion

The Problem: When you make an outgoing TCP connection, the OS assigns a local port (e.g., 54321). There are only ~65,000 ports available. If you open 10,000 connections/minute and they stay in TIME_WAIT state for 60 seconds, you will run out of ports. Your agent will crash with EADDRNOTAVAIL.

The Solution: Tuning sysctl We had to tune the Linux kernel networking stack on our monitoring nodes:

Terminal window
# Allow reusing sockets in TIME_WAIT state
sysctl -w net.ipv4.tcp_tw_reuse=1
# Decrease the time a connection stays in TIME_WAIT (default 60s)
sysctl -w net.ipv4.tcp_fin_timeout=15
# Increase the range of available local ports
sysctl -w net.ipv4.ip_local_port_range="1024 65000"

3. False Positives Kill Trust

The Problem: At scale, the internet is flaky. A 10,000 endpoint check will encounter random packet loss. If you send an alert for every failure, you will send hundreds of false alerts a day. Your users will mute your pager.

The Solution: The “Confirmation” Architecture We implemented a Multi-Region Consensus mechanism.

  1. Agent A (NY) sees a failure. It does not alert.
  2. Agent A asks Agent B (London) and Agent C (Tokyo) to check the same URL immediately.
  3. Only if 2 out of 3 agents confirm the failure do we send the alert.

This reduced our false positive rate by 99.9%.

4. The “Thundering Herd”

The Problem: If 5,000 checks are scheduled for “every minute,” and you’re not careful, your scheduler might fire all 5,000 at exactly 00:00:01. This causes a huge CPU spike, saturates the network card, and then the server sits idle for 59 seconds.

The Solution: Jitter and Spread We implemented randomized scheduling offset.

  • Monitor A: Checks at 00:00:01
  • Monitor B: Checks at 00:00:01.350
  • Monitor C: Checks at 00:00:02.100

We spread the load evenly across the entire minute window. This constant, flat resource usage is much easier to manage than spiky loads.

Summary

Scaling is about handling constraints. The database IOPS, the network ports, the CPU cycles—they are all finite resources. By understanding these limits and architecting around them, Cluster Uptime delivers enterprise scale on commodity hardware.

👨‍💻

Jesus Paz

Founder

Read Next

Join 1,000+ FinOps and platform leaders

Get uptime monitoring and incident response tactics delivered weekly.