Meta’s $100B AMD Pact Actually Fixes PyTorch’s Biggest Headache
The Monopoly Tax is Getting Old
I spent three hours yesterday trying to provision a single H100 instance on AWS. Three hours. For one node. When I finally got it, the hourly rate made me physically wince.
This is the reality of machine learning right now. We are all trapped in Jensen Huang’s walled garden, paying whatever Nvidia decides to charge because CUDA is the only game in town. Or at least, it was.
When the news dropped recently about Meta signing a $100 billion AI chip deal with AMD, my feed filled up with finance bros talking about market caps and supply chains. I don’t care about any of that. I care about what happens when I type import torch.
And honestly? This deal is the best thing to happen to ML engineers in five years.
The Hardware Was Never the Problem
Let’s get one thing straight. AMD hardware has been fast for a while. The Instinct MI300X accelerators are absolute monsters on paper. But nobody wanted to use them.
Why? Because ROCm—AMD’s answer to CUDA—used to trigger PTSD in anyone who tried to deploy models in production. Two years ago, trying to run a complex training loop on AMD hardware meant fighting obscure C++ compilation errors at 2 AM. It was a mess. You’d spend more time debugging memory allocation faults than actually training your model.
But Meta isn’t just buying chips. They are the primary maintainers of PyTorch. When a company drops $100 billion on AMD silicon to train Llama 4 and power their internal recommendation engines, they are going to make damn sure the software works.
And it’s already happening.
Benchmarking the Reality
I managed to get access to a staging cluster with 8x MI300X cards last Tuesday. I wanted to see if the software stack had actually caught up to the hype. I pulled down a Llama-3-70B-Instruct model, loaded up PyTorch 2.5.1, and ran a standard inference benchmark.
My expectations were low. I figured I’d hit a wall with custom kernels.
I was wrong. It just ran.
Here are the actual numbers. On our standard H100 nodes, we typically see about 34 tokens per second per user at batch size 16. On the MI300X cluster? 32 tokens per second. The performance difference was barely statistical noise.
But
