2026-06-15 • 6 min read

Right-Sizing a Production MongoDB Atlas Cluster: M50 → M40

You can't just click "downgrade" on a production database. Here's the log forensics, index surgery, and aggregation rewrites that shrank our working set enough to take MongoDB Atlas from M50 to M40 — ~$7,600/year saved, slow-query time down ~43%, latency improved.

mongodb
performance
cost-optimization
sre

The cluster was a tier too big — but you can't just downgrade

Our production MongoDB Atlas cluster was running on an M50. The metrics said it was oversized: CPU mostly idle, plenty of headroom. The obvious move was to drop it to an M40 and pocket the difference — about $634/month (~$7,600/year).

The problem is that "downgrade" isn't a billing toggle. An M40 caps out at 16 GB of RAM, and MongoDB lives and dies by whether its working set — the indexes and hot documents it touches constantly — fits in memory. The moment your working set spills past RAM, you fall off a cliff: cache eviction, disk reads on the hot path, and tail latency that goes from milliseconds to seconds.

So the real task wasn't "click downgrade." It was: shrink the working set and CPU until an M40 fits — then downgrade. That turned a cost ticket into a performance project. Here's how it went.

Step 1 — Forensics: rank query shapes by cumulative cost

You can't optimize what you can't see, and the Atlas Performance Advisor only shows you the recent surface. I wanted the whole picture, so I pulled the raw slow-query logs from all three shards — roughly 38 million log lines across three weeks — and wrote a parser to group them into 2,819 distinct query shapes, ranked by cumulative duration, not single-query latency.

That ranking is the whole game. One slow 2-second query that runs twice a day doesn't matter. A 50ms query that runs 460,000 times does. Sorting by total time spent surfaced the real offenders immediately:

The single worst shape: 459,801 executions, 31.5 hours of cumulative cluster time, 240 GB of disk reads.
A customerservices lookup that was 17.9% of all slow-query time — a collection scan that had cumulatively examined ~107 billion documents to return zero rows. A missing index, hiding in plain sight.
A dashboard aggregation over entities averaging 23.5 seconds (max 45.8s), another 13.5% of slow time.

Those three findings alone accounted for a third of the cluster's pain. None of them showed up as "high CPU" — they showed up as disk reads, which is exactly the pressure an M40 can't absorb.

Step 2 — Fix the queries before touching the hardware

A couple of anti-patterns explained most of the damage.

The $lookup that can't use an index. Several pipelines joined collections with a let / $expr sub-pipeline. It reads fine, but $expr equality can't use the foreign _id index — so Mongo collection-scans the entire foreign collection, once per input row. One governmentapplications list endpoint was examining ~2.98 million documents to return 10 per page.

Use a join that can actually use the index

// ❌ A let/$expr lookup can't use the foreign _id index —
//    it COLLSCANs the whole foreign collection, once per row.
{ $lookup: {
    from: "entities",
    let: { eid: "$entityId" },
    pipeline: [{ $match: { $expr: { $eq: ["$_id", "$$eid"] } } }],
    as: "entity",
} }
 
// ✅ localField/foreignField joins DO use the _id index.
{ $lookup: {
    from: "entities",
    localField: "entityId",
    foreignField: "_id",
    as: "entity",
} }

Enrich last, not first. The other recurring sin was paginating after the expensive joins. getAllCustomers ran a 6-month safety-net $match and $lookup-enriched ~100,000 entities per request — then kept 10. Reordering the pipeline to filter, sort, and page down to 10 rows first, and only then enrich, is the single highest-leverage change you can make to a list endpoint.

Paginate first, enrich second

// ❌ enrich 100k rows, then throw away all but 10
{ $match: filters },
{ $lookup: { /* heavy join across the whole set */ } },
{ $sort: { createdAt: -1 } }, { $skip: 0 }, { $limit: 10 }
 
// ✅ page down to 10 rows, THEN join
{ $match: filters },
{ $sort: { createdAt: -1 } }, { $skip: 0 }, { $limit: 10 },
{ $lookup: { /* same join, now over 10 rows */ } }

Same query, reordered: filter and page before the expensive join.

The heaviest dashboard aggregation went from ~23.5 seconds to under 100ms after being inverted this way, and the hottest of these I backed with a short Redis TTL so repeated dashboard loads never hit Mongo at all.

Step 3 — Index surgery (in both directions)

Indexes are the part people get backwards. You need the right ones, and you need to ruthlessly delete the rest — because every index is part of the working set you're trying to shrink, and every index is write amplification on every insert.

So I went both ways:

Added targeted compound indexes for the offenders from Step 1 (customerservices, documents on caseId/serviceId/customerServiceId, a sparse index on entities.taxDetails, servicestageinstances), and — critically — verified they were actually being used with $indexStats rather than assuming. An index nobody queries is pure cost.
Removed roughly 244 redundant single-field indexes across 16 collections — mostly prefixes already covered by a compound index, or fields nothing queried by.

Then I fixed the source of the sprawl. Mongoose builds an index for every index: true field on connect, which silently recreates indexes you just dropped — and every developer's laptop does the same against shared environments. The fix is one line:

config/db.js — stop the ORM from resurrecting indexes

// Manage indexes explicitly; don't let Mongoose auto-create them
// on every connect (including from dev machines).
mongoose.connect(uri, { autoIndex: false });

This round alone recovered ~43% of cumulative slow-query time and cut ~2 TB/month of unnecessary disk reads.

Step 4 — Prove the M40 fits before you click downgrade

This is the step that separates "right-sizing" from "gambling." I didn't want to downgrade and watch dashboards; I wanted evidence first. So I built a fitness analysis straight from the shard logs that quantified the four things an M40 actually constrains:

CPU peak (not average — the average was always fine; the question is the peaks).
Working-set disk-read % — how often reads were already missing cache, the leading indicator of the cliff.
Connection peaks against the lower M40 connection limit.
Lock yields, as a contention sanity check.

With the query and index work landed, the model showed the changes had freed ~17–20% of cluster CPU and ~3–5 GB of cache pressure — enough that the working set now sat comfortably under the M40's 16 GB ceiling with headroom at peak. Then I downgraded.

Results

The downgrade landed with latency improving, not regressing.

M50 → M40, a full tier down: ~~$634/month (~~$7,600/year) saved.
~43% less cumulative slow-query time and ~2 TB/month fewer disk reads.
The worst dashboard aggregation: ~23.5s → under 100ms.
Latency improved through the downgrade — the smaller box runs a leaner workload.

Takeaways

Right-sizing a database is an optimization problem, not a billing toggle. The savings are real, but you earn them by shrinking the workload first.
The working set is the currency. Indexes and hot aggregations spend it. Watch disk reads, not CPU averages — disk reads are what a smaller tier can't absorb.
Rank by cumulative cost. The worst offender is rarely the slowest single query; it's a cheap-looking query run a million times — often a missing index doing a scan that returns nothing.
Delete indexes, and stop your ORM from re-creating them. Index sprawl is invisible until it's your RAM bill.
Validate fitness with data before you downsize. Model the peaks, then pull the trigger.

← Back to blog