Compiling Fast.ai Models for Cerebras
The Deployment Wall
I was sitting at my desk at 9 PM last Thursday, staring at a CloudWatch dashboard that made absolutely no sense. My fast.ai vision model was taking 850ms per request on a standard EC2 instance. The math wasn’t mathing. I had just spent three days trying to get my models to run on the new AWS Cerebras endpoints, and I was getting nowhere.
If you missed the noise last month, AWS spun up Cerebras hardware for high-speed inference. It is fantastic for massive language models. But what about the rest of us? The ones still building custom computer vision pipelines with fast.ai?
Historically, moving a fast.ai Learner to specialized hardware meant stripping away all the elegant abstractions. You had to pull out the raw PyTorch model and write an inference script from scratch. You lose the DataLoaders. You lose the transforms. It sucks.

But the core team quietly pushed a massive refactor in fast.ai 2.8.1 last Tuesday. They finally overhauled how Learner.export handles compilation for specialized hardware backends.
The 2.8.1 Compilation Target
I didn’t believe it would work on the first try. I’ve been burned by hardware-agnostic claims too many times.
Here’s the setup. I’m running Python 3.12.2. I took an old image classification model—a ConvNeXt-Base trained on a custom manufacturing defect dataset. Usually, I’d just deploy this on a standard T4 GPU and call it a day. But the client needed sub-10ms latency.
The new approach changes the export step entirely. Instead of just pickling the learner, you pass a compilation target directly to the export method.

from fastai.vision.all import *
import torch_cerebras
learn = load_learner('defect_model_v1.pkl')
# The new 2.8.1 compilation hook
learn.model = torch.compile(
learn.model,
backend="cerebras",
mode="reduce-overhead"
)
# Export with the compiled graph intact
learn.export('defect_model_compiled.pkl', compile=True)
The Transform Gotcha
I ran this on a batch of 5,000 test images. The results? It dropped my inference time from 42ms per image down to 3.1ms. That’s not a typo.
But there is a massive catch. The documentation completely ignores how this interacts with fast.ai’s built-in TfmdDL (transformed dataloaders). If you have random augmentations still active in your validation set, the compiler completely chokes. It throws a cryptic RuntimeError: Graph break at random_crop and silently falls back to CPU execution. I wasted two hours figuring that out.

You have to explicitly call learn.dls.valid.rng.seed(None) or ensure all your transforms are strictly deterministic before hitting that export button. If you don’t, you’ll be paying for specialized hardware while your code runs on the CPU.
Right now, this integration is mostly focused on the new Cerebras endpoints and standard TensorRT. I expect we’ll see native support for AWS Trainium 2 by Q1 2027, assuming the PyTorch XLA team sorts out their dynamic shape issues.
I’m migrating the rest of my production endpoints this weekend. We’ll see if it holds up under real load.
Frequently asked questions
How do I compile a fast.ai model for AWS Cerebras endpoints?
In fast.ai 2.8.1, load your Learner with load_learner, then wrap learn.model with torch.compile using backend=”cerebras” and mode=”reduce-overhead”. Export the compiled graph by calling learn.export(‘model.pkl’, compile=True). This passes a compilation target directly to the export method instead of just pickling the Learner, preserving the compiled graph for specialized hardware inference.
How much faster is fast.ai inference on Cerebras compared to standard GPUs?
Testing a ConvNeXt-Base manufacturing defect classifier across 5,000 images, inference time dropped from 42ms per image on a standard T4 GPU to 3.1ms per image after compiling with the Cerebras backend in fast.ai 2.8.1. That is roughly a 13x speedup, enough to meet sub-10ms latency requirements that previously would have forced rewriting the pipeline in raw PyTorch.
Why does fast.ai torch.compile fail with RuntimeError: Graph break at random_crop?
The Cerebras compiler chokes when fast.ai’s TfmdDL still has random augmentations active on the validation set. It throws RuntimeError: Graph break at random_crop and silently falls back to CPU execution, meaning you pay for specialized hardware while running on CPU. Fix it by calling learn.dls.valid.rng.seed(None) or ensuring all validation transforms are strictly deterministic before exporting.
When will fast.ai support AWS Trainium 2 compilation?
Native AWS Trainium 2 support is expected around Q1 2027, contingent on the PyTorch XLA team resolving their dynamic shape issues. As of fast.ai 2.8.1, the new compilation hook in Learner.export is focused on the recently launched AWS Cerebras endpoints and standard TensorRT backends, so Trainium users will need to wait before getting the same one-line export workflow.
