Most training frameworks are 50,000 lines of code in a trench coat. They have to be, because they support every possible combination of parallelism, optimizer, scheduler, dataset, checkpoint format, and callback. That generality has costs.
This is what falls out when you start from PyTorch FSDP and aggressively delete. The result is 400 lines of Python that runs research-grade experiments on up to about 32 GPUs, with no framework magic.
import torch
import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
# ...the trainer is here
Full code in the repo.