[NLPL Task Force (A)] CuDNN for fp16 training
Vinit Ravishankar
vinitr at ifi.uio.no
Tue May 12 19:25:41 UTC 2020
Hi! Sorry for the delay, I’d been trying to get it to work on the ML nodes most of yesterday (unsuccessfully). I am definitely still interested, and I’m hoping GPU capacity looks nicer after a certain deadline :-)
I’ll try using your module asap, Saga seems to be having issues right now and won’t let me look at the queue. Can confirm good health, hopefully on a mostly permanent basis now, thanks!
– Vinit
> On 12 May 2020, at 16:41, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> hi again, vinit:
>
> are you still interested in fairseq? it does look like an interesting
> package, though i am not quite sure we really have sufficient gpu
> capacity for it :-).
>
> in any case, the metacenter staff helped with the right NCCL version,
> and i created a trial module (also including apex) which at least
> appears to have some basic functionality:
>
> $ module purge; module --ignore-cache load nlpl-fairseq/0.9.0/3.7
>
> i would be grateful for a little more testing and hopefully a report
> of good health?
>
> cheers, oe
>
>
> On Fri, May 8, 2020 at 6:50 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>>
>> hi vinit,
>>
>> from a quick glance at the README for fairseq, i found no mention of
>> cuDNN? for all i recall, PyTorch (unlike TensorFlow) is independent
>> of cuDNN too ... so why would you expect to benefit from cuDNN?
>>
>> fairseq does mention NCCL as a prerequisite (for distributed training)
>> and optionally apex (for faster training). from my (still partial :-)
>> understanding of apex, it provides mixed-precision support, so should
>> not be necessary, or?
>>
>> i am assuming you have everything installed yourself into a local
>> virtualenv? that will make it hard for others to debug, i fear. in
>> principle, fairseq to me looks like something we might want to support
>> as a ready-to-run NLPL module (bundle), but i cannot promise i will be
>> able to look into that so quickly.
>>
>> cheers, oe
>>
>> On Fri, May 8, 2020 at 6:10 PM Vinit Ravishankar <vinitr at ifi.uio.no> wrote:
>>>
>>> Hi! I’m trying to figure out how to enable half-precision floating points in Python; I’m using the fairseq library [1], which has an fp16 flag, in conjunction with my own virtual environment (Python 3.7.3). I’m not using any modules, I haven’t needed them for regular multi-GPU work. Unfortunately, my program switches to full-width floats because of a lack of support for fp16. This support was introduced in nVidia’s cuDNN, but loading any of the provided cuDNN modules results in a core dump the minute fairseq is loaded. Is there any recommended way to use these modules? Thanks!
>>>
>>> – Vinit
>>>
>>> 1. https://github.com/pytorch/fairseq
>>>
>>>
More information about the infrastructure
mailing list