Conversation

ouyhlan

Description:修复Trainer里check_code函数忽略pin_memory参数导致的内存不足bug

Main reason:
在使用fastNLP库时发生内存不足错误。使用场景是在使用CPU训练模型时,发生了内存错误。经过DEBUG发现,是core/trainer.py文件里,_check_code函数在调用Tester类时没有指定pin_memory参数,而Tester类默认初始化pin_memory为True。

具体错误调用栈:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory
Traceback (most recent call last):
  File "/data/ouyhlan/TextClassification/main.py", line 52, in <module>
    trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 558, in __init__
    _check_code(dataset=train_data, model=self.model, losser=losser, forward_func=self._forward_func, metrics=metrics,
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 1013, in _check_code
    evaluate_results = tester.test()
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/tester.py", line 184, in test
    for batch_x, batch_y in data_iterator:
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/batch.py", line 266, in __iter__
    for indices, batch_x, batch_y in self.dataiter:
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 477, in _next_data
    data = _utils.pin_memory.pin_memory(data)
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
    return [pin_memory(sample) for sample in data]
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in <listcomp>
    return [pin_memory(sample) for sample in data]
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in pin_memory
    return {k: pin_memory(sample) for k, sample in data.items()}
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in <dictcomp>
    return {k: pin_memory(sample) for k, sample in data.items()}
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
    return data.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

pin_memory参数设为False后问题消失。同时,根据pytorch/pytorch#57273 ,建议所有的torch版本里Trainer和Tester类默认不开启pin_memory。

Checklist 检查下面各项是否完成

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (例如[bugfix]修复bug,[new]添加新功能,[test]修改测试,[rm]删除旧代码)
  • Changes are complete (i.e. I finished coding on this PR) 修改完成才提PR
  • All changes have test coverage 修改的部分顺利通过测试。对于fastnlp/fastnlp/的修改,测试代码必须提供在fastnlp/test/
  • Code is well-documented 注释写好,API文档会从注释中抽取
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change 修改导致例子或tutorial有变化,请找核心开发人员

Changes: 逐项描述修改的内容

  • Tester和Trainer类默认不开启pin_memory

Mention: 找人review你的PR
@yhcc

@ouyhlanouyhlan changed the title 修复Trainer里check_code函数忽略pin_memory参数导致的内存bug [bugfix]修复Trainer里check_code函数忽略pin_memory参数导致的内存bug Nov 29, 2021
@yhcc

非常感谢您的再次提交,上次我没有认真检查您提到的细节,我刚才再次看了下这部分的修改内容。这里我倾向于默认打开,是由于大家真的在跑神经网络的时候,大部分时间都会在有gpu的服务器上,这点内存消耗应该对于服务器来说比较容易接受(这里是由于pin_memory确实会加速data准备过程,感觉默认开启可以为大家节省一点时间);遭遇了内存问题之后,也可以通过pin_memory手动关闭。我检查了一下代码,发现Trainer中check_code应该是默认就关闭了pin_memory吧?

_iter = DataSetIter(dataset, batch_size=batch_size, sampler=None)

@ouyhlan

@yhcc

非常感谢您的再次提交,上次我没有认真检查您提到的细节,我刚才再次看了下这部分的修改内容。这里我倾向于默认打开,是由于大家真的在跑神经网络的时候,大部分时间都会在有gpu的服务器上,这点内存消耗应该对于服务器来说比较容易接受(这里是由于pin_memory确实会加速data准备过程,感觉默认开启可以为大家节省一点时间);遭遇了内存问题之后,也可以通过pin_memory手动关闭。我检查了一下代码,发现Trainer中check_code应该是默认就关闭了pin_memory吧?

_iter = DataSetIter(dataset, batch_size=batch_size, sampler=None)

Trainer中check_code的问题是出现在下面这行代码:

tester = Tester(data=dev_data[:batch_size * DEFAULT_CHECK_NUM_BATCH], model=model, metrics=metrics,
batch_size=batch_size, verbose=-1, use_tqdm=False)

这行使用Tester的时候没有传入pin_memory参数,再看到Tester的初始化方法:
self.pin_memory = kwargs.get('pin_memory', True)

也就是说,不管是否给trainer传入pin_memory,这里的pin_memory都是默认开启的。
同时,在有gpu的服务器,我也复现过该内存问题。大概场景是同一个服务器有多人在不同卡上跑代码,导致服务器内存不是特别充足,那么pin_memory这部分的消耗就会有影响了。如果不想更改默认方式,也可以参考一下我之前提交的另外一种修复方式:ouyhlan@cac1331

@yhcc

嗯嗯,我觉得您提到的另一种方式应该会更好一些,

@yhcc

非常感谢~麻烦您按照提到的那份代码发起一下pr吧~

@ouyhlanouyhlan force-pushed the fix_trainer_pin_memory_bug branch from 3be86c6 to 2302853 Compare December 2, 2021 11:06
@ouyhlan

@yhcc

非常感谢~麻烦您按照提到的那份代码发起一下pr吧~

已经push -f到这个PR里了,麻烦直接在这个PR里review一下改动就好~

@ouyhlan

@yhcc

Sign up for free to join this conversation on . Already have an account? Sign in to comment
None yet
None yet

Successfully merging this pull request may close these issues.