当前位置：首页 > article >正文

Pytorch笔记--RuntimeError: NCCL communicator was aborted on rank 3.

article 2025/1/19 16:58:43

1--分布式并行训练，出现以下bug：

[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1721483, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805695 milliseconds before timing out.

RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1721483, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805695 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

terminate called after throwing an instance of 'std::runtime_error'

主要原因：

超时错误，原因可能是CPU线程忙碌（服务器CPU资源不够），导致数据长时间加载不了，从而出现了超时bug。

2--可能的解决方法：

1. 避免超时等待的方法：

例如减少数据加载的线程（降低num_workers），避免由于CPU线程不足导致的超时问题。

2. 延长超时等待的时间：

从默认的30min，延长至其他时间：torch.distributed.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))

3. 更多方案参考：https://github.com/huggingface/accelerate/issues/314