Loading...
Preparing article
Fetching the latest blog content.
Loading...
Fetching the latest blog content.
2026-06-15 • 6 min read
You can't just click "downgrade" on a production database. Here's the log forensics, index surgery, and aggregation rewrites that shrank our working set enough to take MongoDB Atlas from M50 to M40 — ~$7,600/year saved, slow-query time down ~43%, latency improved.
Our production MongoDB Atlas cluster was running on an M50. The metrics said it was oversized: CPU mostly idle, plenty of headroom. The obvious move was to drop it to an M40 and pocket the difference — about $634/month (~$7,600/year).
The problem is that "downgrade" isn't a billing toggle. An M40 caps out at 16 GB of RAM, and MongoDB lives and dies by whether its working set — the indexes and hot documents it touches constantly — fits in memory. The moment your working set spills past RAM, you fall off a cliff: cache eviction, disk reads on the hot path, and tail latency that goes from milliseconds to seconds.
So the real task wasn't "click downgrade." It was: shrink the working set and CPU until an M40 fits — then downgrade. That turned a cost ticket into a performance project. Here's how it went.
You can't optimize what you can't see, and the Atlas Performance Advisor only shows you the recent surface. I wanted the whole picture, so I pulled the raw slow-query logs from all three shards — roughly 38 million log lines across three weeks — and wrote a parser to group them into 2,819 distinct query shapes, ranked by cumulative duration, not single-query latency.
That ranking is the whole game. One slow 2-second query that runs twice a day doesn't matter. A 50ms query that runs 460,000 times does. Sorting by total time spent surfaced the real offenders immediately:
customerservices lookup that was 17.9% of all slow-query time — a collection scan that had cumulatively examined ~107 billion documents to return zero rows. A missing index, hiding in plain sight.entities averaging 23.5 seconds (max 45.8s), another 13.5% of slow time.Those three findings alone accounted for a third of the cluster's pain. None of them showed up as "high CPU" — they showed up as disk reads, which is exactly the pressure an M40 can't absorb.
A couple of anti-patterns explained most of the damage.
The $lookup that can't use an index. Several pipelines joined collections with a let / $expr sub-pipeline. It reads fine, but $expr equality can't use the foreign _id index — so Mongo collection-scans the entire foreign collection, once per input row. One governmentapplications list endpoint was examining ~2.98 million documents to return 10 per page.
// ❌ A let/$expr lookup can't use the foreign _id index —
// it COLLSCANs the whole foreign collection, once per row.
{ $lookup: {
from: "entities",
let: { eid: "$entityId" },
pipeline: [{ $match: { $expr: { $eq: ["$_id", "$$eid"] } } }],
as: "entity",
} }
// ✅ localField/foreignField joins DO use the _id index.
{ $lookup: {
from: "entities",
localField: "entityId",
foreignField: "_id",
as: "entity",
} }Enrich last, not first. The other recurring sin was paginating after the expensive joins. getAllCustomers ran a 6-month safety-net $match and $lookup-enriched ~100,000 entities per request — then kept 10. Reordering the pipeline to filter, sort, and page down to 10 rows first, and only then enrich, is the single highest-leverage change you can make to a list endpoint.
// ❌ enrich 100k rows, then throw away all but 10
{ $match: filters },
{ $lookup: { /* heavy join across the whole set */ } },
{ $sort: { createdAt: -1 } }, { $skip: 0 }, { $limit: 10 }
// ✅ page down to 10 rows, THEN join
{ $match: filters },
{ $sort: { createdAt: -1 } }, { $skip: 0 }, { $limit: 10 },
{ $lookup: { /* same join, now over 10 rows */ } }The heaviest dashboard aggregation went from ~23.5 seconds to under 100ms after being inverted this way, and the hottest of these I backed with a short Redis TTL so repeated dashboard loads never hit Mongo at all.
Indexes are the part people get backwards. You need the right ones, and you need to ruthlessly delete the rest — because every index is part of the working set you're trying to shrink, and every index is write amplification on every insert.
So I went both ways:
customerservices, documents on caseId/serviceId/customerServiceId, a sparse index on entities.taxDetails, servicestageinstances), and — critically — verified they were actually being used with $indexStats rather than assuming. An index nobody queries is pure cost.Then I fixed the source of the sprawl. Mongoose builds an index for every index: true field on connect, which silently recreates indexes you just dropped — and every developer's laptop does the same against shared environments. The fix is one line:
// Manage indexes explicitly; don't let Mongoose auto-create them
// on every connect (including from dev machines).
mongoose.connect(uri, { autoIndex: false });This round alone recovered ~43% of cumulative slow-query time and cut ~2 TB/month of unnecessary disk reads.
This is the step that separates "right-sizing" from "gambling." I didn't want to downgrade and watch dashboards; I wanted evidence first. So I built a fitness analysis straight from the shard logs that quantified the four things an M40 actually constrains:
With the query and index work landed, the model showed the changes had freed ~17–20% of cluster CPU and ~3–5 GB of cache pressure — enough that the working set now sat comfortably under the M40's 16 GB ceiling with headroom at peak. Then I downgraded.