Skip to main content

Database Monitoring Without the Yak Shaving: Prometheus, PMM, and Alerts Built In

· 7 min read
Filess Team
Database Experts

Setting up database monitoring from scratch is one of those tasks that feels like it should take an afternoon but ends up consuming a week.

You need Prometheus. Then an exporter for your specific database. Then Grafana (or another dashboard). Then Alertmanager. Then you need to write the alert rules, configure notification channels, test that the alerts actually fire, and figure out why they fire at 3am for things that aren't real emergencies.

And this is before you've handled certificate rotation for the exporter, dealt with the Prometheus scrape interval tuning, or figured out why your dashboard queries are returning No data.

With Filess Dedicated databases, all of this ships by default. Here's what you get and how it's built.

Percona Monitoring and Management (PMM), Already Running

Every Dedicated database includes a PMM instance preconfigured and connected to your database. You don't install anything. You don't run any agents. It's just there.

PMM is Percona's open-source database observability platform. It gives you:

  • Query Analytics (QAN): Which queries are slowest? Which ones are running most frequently? Which ones have the highest I/O impact? PMM captures this at the query level — not just at the connection level.
  • Performance metrics: CPU, memory, disk I/O, and database-specific metrics (buffer pool hit rate, replication lag, connection count, lock waits, etc.) updated every few seconds.
  • Database Advisors: Automated checks that scan your database configuration and data patterns and surface recommendations. Things like "this index is never used" or "your innodb_buffer_pool_size is undersized for your working set."
  • Explain plans: Run EXPLAIN on any slow query directly from the PMM UI without opening a terminal.

This is the same tooling that DBAs at large-scale deployments use. It's available to you on a $20/month database.


Alert Rules: Three Signal Types

For automated alerting, we expose a structured alert rule system on top of Prometheus. Three signal types are supported today:

1. CPU Usage

Rule: cpu_usage > 80% for 5 minutes

The threshold is expressed as a percentage of the database's allocated CPU. Internally, CPU allocation is expressed in "units" (1 unit = 0.5 vCPU), so a 4-unit database has 2 vCPUs, and a CPU alert at 80% fires when sustained usage exceeds 1.6 vCPUs.

The forMinutes parameter controls how long the condition must be true before the alert fires. Setting it to 1 means the alert fires within a minute of the threshold being crossed. Setting it to 15 means sustained high CPU for 15 minutes — useful for workloads with legitimate traffic spikes.

2. Memory Usage

Rule: memory_usage > 90% for 3 minutes

Memory allocation is also expressed in units (1 unit = 500 MiB). The alert fires when RSS memory usage exceeds the threshold percentage of total allocated memory.

Memory alerts at 90%+ on a database are a serious signal. It typically means either:

  • Your working set has outgrown your plan (time to scale up).
  • A query is performing a full table scan without a covering index and pulling enormous amounts of data into the buffer pool.
  • A connection pool is leaking and connections are accumulating unreleased memory.

PMM's query analytics will usually tell you which one.

3. Downtime

Rule: database is unreachable

The downtime rule doesn't take a threshold — it's a binary check. If the database stops responding to health checks for the configured forMinutes window, the alert fires.

This is the most critical alert type. It fires on actual outages, not capacity pressure.


Creating Alert Rules via API

Alert rules are fully automatable:

curl -X POST https://api.filess.io/v1/organizations/my-org/namespaces/prod/databases/42/alert-rules \
-H "Authorization: Bearer $FILESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "cpu_usage",
"thresholdPercent": 80,
"forMinutes": 5,
"name": "High CPU"
}'

Response:

{
"id": 17,
"name": "High CPU",
"type": "cpu_usage",
"thresholdPercent": 80,
"forMinutes": 5,
"isEnabled": true,
"status": "ok",
"createdAt": "2026-04-04T10:00:00Z"
}

You can also manage alert rules via the dashboard. Rules can be enabled/disabled without deleting them — useful when you're doing planned maintenance and want to silence alerts temporarily.


How Alerts Are Evaluated

The evaluation pipeline uses Prometheus as the underlying data source. Here's the simplified flow:

Database → Exporter → Prometheus scrape → PromQL evaluation → AlertManager → Notification

Our backend has a direct integration with Prometheus. When you create or update an alert rule, we generate the corresponding PromQL expression and register it with Alertmanager. The rule includes the for duration — Prometheus won't fire the alert until the condition has been continuously true for that duration.

When an alert fires, Alertmanager routes the notification to our backend, which then:

  1. Records an AlertEvent in the database with status firing.
  2. Determines which users in the namespace should be notified.
  3. Sends notifications via the configured channels.

When the condition resolves, another event is recorded with status resolved, and a resolution notification is sent.


Notification Channels

Currently supported notification channels:

  • Email: Delivered to all users in the namespace with notification access.
  • In-app notifications: Available in the dashboard notification center.

Webhook notifications and Slack integration are on the roadmap.


Alert Event History

Every alert firing and resolution is stored as an event. This gives you a timeline of incidents:

2026-03-12 03:14 UTC — cpu_usage fired: CPU at 94% for 5 minutes
2026-03-12 03:47 UTC — cpu_usage resolved: CPU back to 23%
2026-03-15 11:02 UTC — memory_usage fired: Memory at 91% for 3 minutes
2026-03-15 11:09 UTC — memory_usage resolved

This timeline is useful for:

  • Post-mortems: correlating an alert firing with a deployment, traffic spike, or schema change.
  • Capacity planning: how often are you hitting CPU or memory limits? Is it getting more frequent?
  • On-call handoffs: the incoming responder can see what's been happening without reading through a chat thread.

The Prometheus Architecture Behind It

Our monitoring stack is region-scoped. Each region runs a shared Prometheus instance that scrapes all databases in that region. The per-database metrics are namespaced by database ID to prevent cross-tenant data leakage.

The connection config is encrypted at rest:

HostingRegion {
prometheusEndpoint: "https://prometheus.eu.filess.io"
prometheusUsername: "filess-backend"
prometheusPasswordEncrypted: "AES256GCM-encrypted-value"
}

Credentials are decrypted in memory at query time. They're never stored in plaintext in the database.

When you query database metrics from the dashboard, our backend:

  1. Looks up the region's Prometheus endpoint.
  2. Decrypts the credentials.
  3. Executes a scoped PromQL query against Prometheus.
  4. Returns the data to the frontend.

You never hit Prometheus directly. You don't manage credentials. You don't write PromQL.


What You're Not Doing

Let's be explicit about the yak shaving you're avoiding:

Without FilessWith Filess
Install Prometheus on a server
Install mysqld_exporter / postgres_exporter
Configure scrape intervals and labels
Set up Alertmanager
Write PromQL alert expressionsClick "Add Rule" in dashboard
Configure notification channelsAdd email in dashboard
Set up Grafana dashboardsPMM dashboards included
Renew exporter TLS certificates
Monitor the monitoring infrastructure

The goal is for you to spend your time understanding what your database is doing, not configuring the tools that watch it.


Getting Started

Alert rules are available on all Dedicated plans. PMM access is included on Dedicated plans.

To add your first alert rule:

  1. Navigate to your database in the Filess dashboard.
  2. Go to Monitoring → Alert Rules.
  3. Click Add Rule, choose the type, set the threshold and duration.
  4. Save. That's it.

The next time your database hits 80% CPU for 5 minutes, you'll know about it before your users do.

Explore Filess Dedicated Databases →