Essential Metrics for High Availability Clusters: Beyond 'Up' or 'Down'

If you are only checking HTTP 200, you are missing the picture. A guide to the Golden Signals of monitoring for HA systems.

J
Jesus Paz
2 min read

Is your service “Up”? If the API returns a 200 OK, but it takes 15 seconds to load, is it really up? If the API returns 200 OK, but 5% of users are getting 500 Errors, is it up?

For High Availability (HA) systems, binary monitoring (Up/Down) is insufficient. You need to measure the quality of the uptime. Google’s SRE book defines the “Four Golden Signals,” and every HA cluster should measure them.

1. Latency (The Speed)

Latency determines user capabilities. A slow app is an abandoned app. Do not measure Average Latency. The average lies. If 99 requests take 1ms and 1 request takes 10 seconds, the average is ~100ms. That looks fine. But that one user is furious.

Measure Percentiles (P95, P99).

  • P50 (Median): The experience of the “typical” user.
  • P99: The experience of the “unlucky” user (often the one with the biggest data/shopping cart).

Cluster Uptime automatically captures P95 and P99 latency for every check.

2. Traffic (The Demand)

How much demand is being placed on your system?

  • Web: Requests per second (RPS).
  • DB: I/O operations per second (IOPS).

Why it matters: A sudden drop in traffic is often a silent failure. If your RPS drops from 1000 to 0, but no errors are thrown, maybe your DNS is broken, or your load balancer is misconfigured. “Zero Traffic” should be a critical alert.

3. Errors (The Failures)

What fraction of requests are failing?

  • Explicit: HTTP 500s.
  • Implicit: HTTP 200s with empty bodies or “null” content.

Alerting on Error Rate: Don’t alert on 1 error. Set a threshold: “If Error Rate > 1% over 5 minutes.” This is your Error Budget.

4. Saturation (The Capacity)

How “full” is your service?

  • CPU/RAM: obvious.
  • Disk usage: critical.
  • Thread Pools / Connection Pools: subtle killer.

If you have 100 database connections available and you are using 99, you are saturated. You aren’t failing yet, but the next tiny spike will kill you. Saturation is a leading indicator of downtime.

Implementing This with Cluster Uptime

While Cluster Uptime focuses on Blackbox Monitoring (Simulating a user from outside), you can measure these via Synthetic Probes.

  1. Latency: Built-in.
  2. Errors: Built-in (Validation of status codes).
  3. Traffic/Saturation: Use our Push Monitor type (Heartbeat). Have your internal Prometheus alert manager “ping” a Cluster Uptime Push URL if saturation is low. If the ping stops (saturation too high prevents sending), we alert you.

Conclusion

High Availability is a spectrum, not a boolean. By watching these four signals, you move from “reacting to crashes” to “managing performance.”

👨‍💻

Jesus Paz

Founder

Read Next

Join 1,000+ FinOps and platform leaders

Get uptime monitoring and incident response tactics delivered weekly.