Dask’s Active Memory Manager Finally Stopped Breaking My Pipelines
2 mins read

Dask’s Active Memory Manager Finally Stopped Breaking My Pipelines

I used to dread the Slack notification. You know the one. The little red dot popping up at 7:30 AM telling me my overnight batch job failed. Always the same error: KilledWorker. Dask was eating memory faster than my cluster could provision it, panicking, and dropping workers left and right.

For a long time, my solution was just throwing more hardware at the problem. But running Dask 2026.2.0` on Python 3.11.8 last week, I decided to actually sit down and figure out why my single-cell RNA sequencing pipelines were still randomly crashing. What I found completely changed how I configure my clusters.

The ghost of the state machine

If you've been using Dask for a few years, you probably remember when they completely rewrote the worker state machine. It was a massive architectural shift that stabilized a lot of the weird edge cases where workers would ghost the scheduler.

That rewrite laid the groundwork for the Active Memory Manager (AMM). The AMM is supposed to monitor memory pressure and aggressively spill data to disk or move it between workers before the OS out-of-memory (OOM) killer steps in.

In theory? Great. In practice? I always found it a bit too conservative. By the time the AMM decided to act, my workers were already dead.

A brutal life sciences edge case

Bioinformatics workloads are uniquely terrible for distributed computing. Single-cell RNA sequencing data specifically. You load a massive AnnData object, convert it to a Dask array to do some distributed filtering, and suddenly a 50GB dataset needs 300GB of RAM to compute a simple UMAP projection.

I was running a pipeline on a cluster of three r6i.4xlarge EC2 instances. It kept failing during the matrix multiplication phase. The data was heavily sparse, which usually saves memory, but Dask's default chunking was doing something weird under the hood.

After three failed runs, I went digging through the Dask Discourse. That forum is honestly a goldmine if you know how to search it. I found a thread from a maintainer explaining that sparse matrix chunks behave unpredictably with the AMM if you don't explicitly align your chunk sizes with your memory targets.

The default memory target for spilling is 0.6 (60%). But with sparse biological data, memory spikes happen in milliseconds during computation. 60% is way too late.

How I configure the AMM now