MLOps
Migrating from W&B to MLflow 2.15: Savings, Gaps, and Hidden Costs
In this article What does migrating from W&B to MLflow 2.15 actually cost? How do you actually rewrite the training loop?
Mistral-7B-v0.3 QLoRA on Modal A100-40GB: nf4 + bf16_compute Beat My RunPod H100 Spot Cost Per Step
TL;DR: For a Mistral-7B-v0.3 QLoRA fine-tune at sequence length 2048 and micro-batch 4, a Modal A100-40GB container running bitsandbytes nf4 with bfloat16.
vLLM 0.6 Continuous Batching Cut My Llama 3 Latency in Half
Upgrading a Llama 3 8B endpoint from vLLM 0.5.4 to 0.6.x is the rare dependency bump where the numbers on the dashboard actually move.
SageMaker HyperPod Finally Fixed the Checkpoint Bottleneck
I lost three days of Llama-3 fine-tuning last November because a single EC2 node decided to panic. The cluster halted.
Dropping my local tracking server for Comet’s new free tier
The 2 AM breaking point Well, there I was, staring at my terminal at 1:30 AM on a Thursday, watching my training loop crash for the fourth time.
Production AI Is Hell: My Love-Hate Relationship With Triton
Well, I have to admit, I was staring at a Grafana dashboard at 11:30 PM on a Tuesday when I finally admitted defeat.
Optuna’s New Rust Storage Backend Is Absurdly Fast
Actually, I should clarify – I spent three hours last Tuesday staring at a progress bar that simply refused to move. You know the feeling.
OpenAI Weights on SageMaker: Hell Froze Over
Honestly, I had to check the URL three times. Then I checked the SSL certificate. Then I texted a buddy at Amazon to ask if their marketing team had gone.
Azure ML Security: It’s Not Magic, It’s Just Someone Else’s Computer
I had a conversation last week with a Data Science lead that nearly made me choke on my coffee. We were reviewing their infrastructure, and when I pointed.
Taming the LLM Chaos: My Real-World MLflow Setup
I still remember the exact moment I realized my “custom” MLOps setup was a disaster waiting to happen. It was 2:00 AM on a Tuesday, and I was trying to.
