https://majun.osinfra.cn/code/dailychecks/dailycheck/66908407d838a5283a4e00e0/d1cbae2a13ac42c0a0e651177e8890ac/summary
TensorProbe (code name: kj600) is a LLM pretrain debugger with model's torch module , optimizer status, collective communication tensor collection and aggregation. It also supports rule-based alerts.