[NLPL Task Force (A)] CuDNN for fp16 training

Fri May 8 16:50:04 UTC 2020

hi vinit,

from a quick glance at the README for fairseq, i found no mention of
cuDNN?  for all i recall, PyTorch (unlike TensorFlow) is independent
of cuDNN too ... so why would you expect to benefit from cuDNN?

fairseq does mention NCCL as a prerequisite (for distributed training)
and optionally apex (for faster training).  from my (still partial :-)
understanding of apex, it provides mixed-precision support, so should
not be necessary, or?

i am assuming you have everything installed yourself into a local
virtualenv?  that will make it hard for others to debug, i fear.  in
principle, fairseq to me looks like something we might want to support
as a ready-to-run NLPL module (bundle), but i cannot promise i will be
able to look into that so quickly.

cheers, oe

On Fri, May 8, 2020 at 6:10 PM Vinit Ravishankar <vinitr at ifi.uio.no> wrote:
>
> Hi! I’m trying to figure out how to enable half-precision floating points in Python; I’m using the fairseq library [1], which has an fp16 flag, in conjunction with my own virtual environment (Python 3.7.3). I’m not using any modules, I haven’t needed them for regular multi-GPU work. Unfortunately, my program switches to full-width floats because of a lack of support for fp16. This support was introduced in nVidia’s cuDNN, but loading any of the provided cuDNN modules results in a core dump the minute fairseq is loaded. Is there any recommended way to use these modules? Thanks!
>
> – Vinit
>
> 1. https://github.com/pytorch/fairseq
>
>