Development

The Power of P95 and P99

By Nick Schuch15th July 2024

We're proud to unveil our new dashboards, focusing on key application performance indicators!

So what led us here? Let's look back at our early assumptions, what we found out along the way, and the changes we've made to address those findings.

Our early assumptions

When we first developed our application dashboards, we strongly focused on cache HIT ratios. The idea was that if we had high cache HIT ratios then the application was scalable and highly performant.

We were so invested in cache HIT ratios that we put all the CDN (CloudFront) metrics at the top of the dashboard.

Diagram of the original dashboard listing services top to bottom of a request

High cache HIT ratios are a characteristic of highly performant applications but not the only characteristic.

Our findings

Our dashboards focused so much on caching that a high cache HIT ratio was perceived as the end of the quest towards performance.

In reality, our dashboards were hiding crucial indicators that needed to be addressed.

  • Response times
  • HTTP response codes eg. (2xx/4xx/5xx)

Diagram showing two key metrics in the middle of the dashboard

While caching is important, development teams should be focusing on:

  • How slow is my application?
  • Are there any errors?

It wasn’t until we started publishing P95 and P99 percentiles to our response times that we discovered the true value of these metrics.

What are P95 and P99 percentiles?

P95 and P99 percentiles are specific points in a dataset that help us understand its distribution:

  • P95 - This is the value below which 95% of the data points fall. It shows where most values lie, excluding the top 5% of the highest values.
  • P99 - This is the value below which 99% of the data points fall. It gives an even higher threshold, showing where almost all values lie, except for the top 1% highest values.

These percentiles are useful because they indicate how extreme or typical values are in a dataset, helping to understand its overall spread and potential outliers. The key takeaway here is potential outliers.

Below is a demonstration of our graphs as we turn on P95 and P99 percentiles.

Diagram demonstrating P95 and P99 percentiles vs average

We discovered that our average was hiding a lot of outliers. By enabling P95 and P99 percentiles, we saw that the little spike in average was actually smoothed out, when we should have been debugging it.

We needed a dashboard refresh.

What changes did we make to our dashboards?

The first update we made was the feng shui of the dashboard. Previously, the dashboard emphasised “the flow of a request”, displaying metrics from the edge at the top and then subsequent metrics as the request passed through the system, e.g. CloudFront to load balancer to application containers.

Diagram showing our dashboard layout before and after

We decided to shuffle these metrics, placing the response times and HTTP response codes at the top—calling them Key Application Performance Indicators. This is us planting our flag and declaring that these are the most important metrics development teams should be reviewing.

Diagram showing our dashboard layout before and after

We also took this opportunity to add slow database queries to our dashboards. These slow queries show developers a list of MySQL queries that took longer than a specified threshold. These query events can then be correlated back to our response times to determine if a slow database call caused a specific spike.

Finally, we also split out the cron logs from the application logs. This way, development teams can easily understand if a log is coming from a request or a long-running background task.

Summary

This hierarchy is the new default for our dashboards, which we plan to build upon, exposing further critical metrics.

Future improvements will include:

  • Anomaly detection for events that affect key performance indicators
  • OpenTelemetry traces
  • Application Performance Monitoring data

We’d love to hear your feedback, including how these new dashboards have helped you improve the performance of your applications!

Tags

announcement
dashboards

Getting Started

Interested in a demo?

🎉 Awesome!

Please check your inbox for a confirmation email. It might take a minute or so.

🤔 Whoops!

Something went wrong. Check that you have entered a valid email and try submitting the form again.

We'll be in touch shortly.