[NLPL Task Force (A)] Support request

Mon Jun 15 13:16:35 UTC 2020

Sorry for picking up this thread so late, I’ve only now got back to the fairseq-relevant project.

It looks like you’re right about FP16 on these GPUs :) after a happy four days (!) of extended debugging, I’ve hit upon a configuration that works for fairseq, reliably, on multiple GPUs, with tensorboard support. Tensorboard (for fairseq) appears to require tensorflow, so my current virtualenv looks something like:

tensorboard==2.2.2
tensorboard-plugin-wit==1.6.0.post3
tensorboardX==2.0
tensorflow==2.2.0
tensorflow-estimator==2.2.0
tensorflow-gpu==2.2.0
torch==1.4.0

with fairseq built from source; this environment was loaded after purging all modules, so I know exactly what I’m getting with versions etc.

It is important to have, from what I can tell, tensorboard>=2.0.0 installed, for tensorboard to log properly. In addition to this, two things that I found were critical for bugless training on multiple GPUs were:

a) /uninstall/ apex; there is a fused_layer_norm_cuda error, presumably because the GPU does not support fp16. Simply disabling —fp16 is not enough, the program still enters an incompatible code section.

b) I had previously had a very strange bug where the model sometimes trains successfully, and sometimes fails with an OpenMP bug: "OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361)”. I’m around 99% certain that the training process either runs or fails purely stochastically, without any change to the environment or code on my part, but I haven’t been able to experimentally verify this with 100% certainty because of how random this bug seems. I’m wondering if it has something to do with the GPU’s state at the time of job submission. Either way, adding `export KMP_INIT_AT_FORK=FALSE’ prior to training looks like it fixes this for good. 

Hope this is helpful!

– Vinit

> On 22 May 2020, at 18:29, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
>> RuntimeError: "mul_cpu" not implemented for ‘Half'
> 
> that sounds a bit like PyTorch when running on cpu?
> 
> https://github.com/pytorch/pytorch/issues/36318
> 
> but, either way, it almost seems that FP16 support requires more
> modern gpus than what is available on Saga?
> 
> https://fairseq.readthedocs.io/en/latest/getting_started.html#training-with-half-precision-floating-point-fp16
> 
>> Tensorboard is also missing from nlpl-fairseq, so I do have to use my own virtualenv in addition to the module. I’ve attached my slurm file in case you notice any obvious issues.
> 
> i will look into providing tensorboard as a stand-alone module; i
> understand it can be combined with either TensorFlow or PyTorch,
> right?
> 
> cheers, oe