[NLPL Task Force (A)] CuDNN for fp16 training

Tue May 12 19:25:41 UTC 2020

Hi! Sorry for the delay, I’d been trying to get it to work on the ML nodes most of yesterday (unsuccessfully). I am definitely still interested, and I’m hoping GPU capacity looks nicer after a certain deadline :-)

I’ll try using your module asap, Saga seems to be having issues right now and won’t let me look at the queue. Can confirm good health, hopefully on a mostly permanent basis now, thanks!

– Vinit

> On 12 May 2020, at 16:41, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> hi again, vinit:
> 
> are you still interested in fairseq?  it does look like an interesting
> package, though i am not quite sure we really have sufficient gpu
> capacity for it :-).
> 
> in any case, the metacenter staff helped with the right NCCL version,
> and i created a trial module (also including apex) which at least
> appears to have some basic functionality:
> 
> $ module purge; module --ignore-cache load nlpl-fairseq/0.9.0/3.7
> 
> i would be grateful for a little more testing and hopefully a report
> of good health?
> 
> cheers, oe
> 
> 
> On Fri, May 8, 2020 at 6:50 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>> 
>> hi vinit,
>> 
>> from a quick glance at the README for fairseq, i found no mention of
>> cuDNN?  for all i recall, PyTorch (unlike TensorFlow) is independent
>> of cuDNN too ... so why would you expect to benefit from cuDNN?
>> 
>> fairseq does mention NCCL as a prerequisite (for distributed training)
>> and optionally apex (for faster training).  from my (still partial :-)
>> understanding of apex, it provides mixed-precision support, so should
>> not be necessary, or?
>> 
>> i am assuming you have everything installed yourself into a local
>> virtualenv?  that will make it hard for others to debug, i fear.  in
>> principle, fairseq to me looks like something we might want to support
>> as a ready-to-run NLPL module (bundle), but i cannot promise i will be
>> able to look into that so quickly.
>> 
>> cheers, oe
>> 
>> On Fri, May 8, 2020 at 6:10 PM Vinit Ravishankar <vinitr at ifi.uio.no> wrote:
>>> 
>>> Hi! I’m trying to figure out how to enable half-precision floating points in Python; I’m using the fairseq library [1], which has an fp16 flag, in conjunction with my own virtual environment (Python 3.7.3). I’m not using any modules, I haven’t needed them for regular multi-GPU work. Unfortunately, my program switches to full-width floats because of a lack of support for fp16. This support was introduced in nVidia’s cuDNN, but loading any of the provided cuDNN modules results in a core dump the minute fairseq is loaded. Is there any recommended way to use these modules? Thanks!
>>> 
>>> – Vinit
>>> 
>>> 1. https://github.com/pytorch/fairseq
>>> 
>>>