Performance - AI Dev News | Machine Learning Engineering

ONNX Runtime optimization levels: which fusions fire where

19 mins read

AI/ML

ONNX Runtime optimization levels: which fusions fire where

May 22, 2026May 22, 2026 Kwesi Mensah0Tagged Basic, Basic vs Extended, Extended, ONNX, onnx runtime optimization levels, Runtime

Understand onnx runtime optimization levels, the main trade-offs, and the practical checks to use before relying on it in practice.

Weaviate 1.30.0 BlockMax WAND: Hybrid Search BM25 Stage Dropped

18 mins read

AI/ML

Weaviate 1.30.0 BlockMax WAND: Hybrid Search BM25 Stage Dropped

May 14, 2026May 22, 2026 Elara Vance0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

Weaviate 1.30.0, per the release notes , promotes BlockMax WAND from a 1.28 technical preview to the default BM25 scorer for new collections.

Replicate vs Modal for image-generation APIs: per-second billing, autoscaling, cold-start

18 mins read

AI/ML

Replicate vs Modal for image-generation APIs: per-second billing, autoscaling, cold-start

April 26, 2026May 22, 2026 Mateo Santiago0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

By Mateo Santiago If you are choosing between Replicate and Modal to serve FLUX, SDXL, or a fine-tuned diffusion model, the honest answer is that they are.

JAX 0.5.1 Flips PjRt Default on TPU v5p: Compile Time Down 28%

11 mins read

Cloud Computing

JAX 0.5.1 Flips PjRt Default on TPU v5p: Compile Time Down 28%

April 18, 2026April 19, 2026 Anya Sharma0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

Dated: February 5, 2026 — jax 0.5.1 Contents Why the PjRt migration matters on TPU v5p How should you measure the compile-time delta?

LangChain 0.3.22 Deprecated AgentExecutor: My LangGraph Migration p95 Dropped 340ms

14 mins read

AI Agents

LangChain 0.3.22 Deprecated AgentExecutor: My LangGraph Migration p95 Dropped 340ms

April 18, 2026May 14, 2026 Mateo Solano0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

Event date: April 8, 2026 — langchain-ai/langchain 0.3.22 Bottom line: The current langchain release makes the AgentExecutor deprecation warnings louder.

Haystack 2.6 PipelineMaxLoops: Router + JoinDocuments Deadlock on Empty Retrieval

11 mins read

AI/ML

Haystack 2.6 PipelineMaxLoops: Router + JoinDocuments Deadlock on Empty Retrieval

April 17, 2026April 19, 2026 Zahara Kweku0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

A retrieval-augmented pipeline that ran clean on every staged query will silently stall the moment a real user asks about a topic your vector store does.

Qdrant Binary Quantization Cuts MiniLM Search Latency 4x

12 mins read

Database

Qdrant Binary Quantization Cuts MiniLM Search Latency 4x

April 13, 2026May 9, 2026 Kwesi Mensah0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

Qdrant Binary Quantization Cuts Sentence-Transformers Search Latency 4x Qdrant’s binary quantization compresses each float32 vector dimension to a single.

JAX Gradient Checkpointing on TPU v5e: 40% Memory Cut at 12% Speed Cost

12 mins read

Cloud Computing

JAX Gradient Checkpointing on TPU v5e: 40% Memory Cut at 12% Speed Cost

April 12, 2026May 22, 2026 Jia Li Song0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

In this article How does JAX gradient checkpointing reduce memory on TPU v5e? What is the checkpoint policy that drives the 40% memory saving?

Mistral-7B-v0.3 QLoRA on Modal A100-40GB: nf4 + bf16_compute Beat My RunPod H100 Spot Cost Per Step

11 mins read

Cloud Computing

Mistral-7B-v0.3 QLoRA on Modal A100-40GB: nf4 + bf16_compute Beat My RunPod H100 Spot Cost Per Step

April 11, 2026May 22, 2026 Jia Li Song0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

TL;DR: For a Mistral-7B-v0.3 QLoRA fine-tune at sequence length 2048 and micro-batch 4, a Modal A100-40GB container running bitsandbytes nf4 with bfloat16.

vLLM 0.6 Continuous Batching Cut My Llama 3 Latency in Half

13 mins read

AI/ML

vLLM 0.6 Continuous Batching Cut My Llama 3 Latency in Half

April 11, 2026May 22, 2026 Li Mei Fong0Tagged Anthropic News, Cohere News, Hugging Face News, JAX News, Keras News, Mistral AI News, OpenAI News, PyTorch News, Stability AI News, TensorFlow News

Upgrading a Llama 3 8B endpoint from vLLM 0.5.4 to 0.6.x is the rare dependency bump where the numbers on the dashboard actually move.