Compiling Fast.ai Models for Cerebras
The Deployment Wall
I was sitting at my desk at 9 PM last Thursday, staring at a CloudWatch dashboard that made absolutely no sense. My fast.ai vision model was taking 850ms per request on a standard EC2 instance. The math wasn’t mathing. I had just spent three days trying to get my models to run on the new AWS Cerebras endpoints, and I was getting nowhere.
If you missed the noise last month, AWS spun up Cerebras hardware for high-speed inference. It is fantastic for massive language models. But what about the rest of us? The ones still building custom computer vision pipelines with fast.ai?
Historically, moving a fast.ai Learner to specialized hardware meant stripping away all the elegant abstractions. You had to pull out the raw PyTorch model and write an inference script from scratch. You lose the DataLoaders. You lose the transforms. It sucks.

But the core team quietly pushed a massive refactor in fast.ai 2.8.1 last Tuesday. They finally overhauled how Learner.export handles compilation for specialized hardware backends.
The 2.8.1 Compilation Target
I didn’t believe it would work on the first try. I’ve been burned by hardware-agnostic claims too many times.
Here’s the setup. I’m running Python 3.12.2. I took an old image classification model—a ConvNeXt-Base trained on a custom manufacturing defect dataset. Usually, I’d just deploy this on a standard T4 GPU and call it a day. But the client needed sub-10ms latency.
The new approach changes the export step entirely. Instead of just pickling the learner, you pass a compilation target directly to the export method.

from fastai.vision.all import *
import torch_cerebras
learn = load_learner('defect_model_v1.pkl')
# The new 2.8.1 compilation hook
learn.model = torch.compile(
learn.model,
backend="cerebras",
mode="reduce-overhead"
)
# Export with the compiled graph intact
learn.export('defect_model_compiled.pkl', compile=True)
The Transform Gotcha
I ran this on a batch of 5,000 test images. The results? It dropped my inference time from 42ms per image down to 3.1ms. That’s not a typo.
But there is a massive catch. The documentation completely ignores how this interacts with fast.ai’s built-in TfmdDL (transformed dataloaders). If you have random augmentations still active in your validation set, the compiler completely chokes. It throws a cryptic RuntimeError: Graph break at random_crop and silently falls back to CPU execution. I wasted two hours figuring that out.

You have to explicitly call learn.dls.valid.rng.seed(None) or ensure all your transforms are strictly deterministic before hitting that export button. If you don’t, you’ll be paying for specialized hardware while your code runs on the CPU.
Right now, this integration is mostly focused on the new Cerebras endpoints and standard TensorRT. I expect we’ll see native support for AWS Trainium 2 by Q1 2027, assuming the PyTorch XLA team sorts out their dynamic shape issues.
I’m migrating the rest of my production endpoints this weekend. We’ll see if it holds up under real load.
