DevOps
vLLM 0.6 Continuous Batching Cut My Llama 3 Latency in Half
Upgrading a Llama 3 8B endpoint from vLLM 0.5.4 to 0.6.x is the rare dependency bump where the numbers on the dashboard actually move.
How to Convert PyTorch Models to ONNX Format for Faster Inference
I remember the first time I deployed a PyTorch model to production. I wrapped a beautifully trained ResNet model in a Flask API, spun up a Docker.
The Stable Kaggle CLI Fixes My Biggest Authentication Headache
I was staring at my terminal at 11pm last Tuesday, watching a GitHub Actions runner fail for the third time. The error was always the same.
Compiling Fast.ai Models for Cerebras
The Deployment Wall I was sitting at my desk at 9 PM last Thursday, staring at a CloudWatch dashboard that made absolutely no sense.
Mastering Tensorflow News: Advanced Techniques and Best Practices for Modern Developers
Introduction to Tensorflow News In today’s rapidly evolving technological landscape, TensorFlow News has emerged as a critical skill for developers.
Local Inference is Finally Good (Thanks, TensorRT)
I spent the better part of yesterday fighting with a Docker container that refused to see my GPU. You know the drill.
Production AI Is Hell: My Love-Hate Relationship With Triton
Well, I have to admit, I was staring at a Grafana dashboard at 11:30 PM on a Tuesday when I finally admitted defeat.
Azure ML Compute Security: Stop Trusting the Defaults
I spent last Tuesday arguing with a firewall. It wasn’t fun. I was trying to lock down our data science environment because, honestly, the default.
Ray joined PyTorch Foundation: Why my infra team finally relaxed
Actually, I should clarify — I was sitting in a budget meeting last November when our CTO asked the question that usually makes me sweat: “Are we sure.
Weaviate on GCP: Stop Overcomplicating Your Vector Stack
I remember the bad old days. You probably do too. Back around 2023 or 2024, if you wanted a production-grade vector database, you were basically signing.
