Pytorch笔记--RuntimeError: NCCL communicator was aborted on rank 3.
[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1721483, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805695 milliseconds before timing out.
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1721483, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805695 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
1. 避免超时等待的方法:
2. 延长超时等待的时间:
从默认的30min,延长至其他时间:torch.distributed.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))
3. 更多方案参考: