Machine learning,
from to clusters.
An open community for learning, writing, and tinkering on the infrastructure behind modern AI — inference engines, training systems, ml stacks, and everything in between.
Recently published.
The arithmetic of attention: why FlashAttention still matters
Memory bandwidth, not FLOPs, is what bounds modern inference. A walk through the numbers behind a kernel that quietly reshaped the field.
Continuous batching, revisited
Three years after the original paper, what does state-of-the-art serving actually look like? A field report from a team running 12B tokens a day.
What we've been getting wrong about MoE routing
Top-k routing has become a default. It shouldn't be. A look at the tradeoffs nobody's measuring and the experiments that change my mind.
Browse by topic.
Inference & Serving
vLLM, TGI, paged attention, continuous batching, speculative decoding. 4 articles 02Training Systems
Trainers, optimizers, recipes, debugging large runs. 1 articles 03Architecture
Transformers, MoE, SSMs, hybrids, and what's next. 1 articles 04Distributed Training
FSDP, tensor parallel, pipeline parallel, sequence parallel. 1 articles 05Quantization
PTQ, QAT, FP4, FP8, mixed precision, calibration. 1 articles 06Retrieval & RAG
Embeddings, indexes, re-rankers, and pipeline systems. 2 articles 07Models
LLMs, VLMs, multimodal systems, capabilities, and model behavior. 1 articles 08Agents
Planning, tool use, multi-agent systems, memory, and orchestration. 1 articles 09Evaluation
Benchmarks, harnesses, contamination, signal vs noise. 1 articles 10MLOps & Deployment
Pipelines, monitoring, observability, regressions. 0 articlesThe conversation.
GitHub Discussions
Long-form threads tied to GitHub identities. Best for technical questions, paper discussions, and feature proposals. Searchable, attributed, permanent.
Open Discussions →Discord
Real-time chat for the working day. Quick questions, debugging help, paper club, and the occasional argument about whether MoE is overrated.
Join the server →Run the math yourself.
Throughput Calculator
Estimate tokens/sec for any GPU + model + batch size combination.
Attention Visualizer
Inspect attention patterns layer-by-layer for any HF model.
Inference Cost Calculator
Compare provider pricing against self-hosting at realistic utilization.
Model Card Generator
Generate a structured model card from a checkpoint and evaluation log.
Eval Harness Playground
Run focused evaluations against any inference endpoint and compare quality, latency, and cost.
Kernel Benchmark
Compare Triton, CUDA, and PyTorch implementations across shapes and dtypes.