smoltorrent: Distributing ML Checkpoints Across a Pi Cluster
A 942 MB checkpoint. Four Raspberry Pis. ~1.5 min gather. No single point of failure. A deep dive into smoltorrent - a distributed checkpoint sharding system built over raw TCP with replication, SHA-256 integrity verification, mDNS discovery, and Prometheus monitoring. Read more