[NLPL Task Force (A)] [rt.uio.no #3406027] gpu usage on Abel

Thu May 16 06:13:13 UTC 2019

good morning,

> Did you observe this at one point in time or have you seen this ongoing over a longer period?

the picture looks unchanged since last night:

[oe at login-0-0 ~]$ squeue --partition=accel|grep michaelm
          26981933     accel   pe_con michaelm PD       0:00      1 (Priority)
          26981931     accel   pe_con michaelm  R 1-18:16:56      1 c19-15
          26981930     accel   pe_con michaelm  R 1-18:56:39      1 c19-16
          26981928     accel   st_con michaelm  R 1-19:24:21      1 c19-5
          26981929     accel   pe_con michaelm  R 1-19:24:21      1 c19-11
          26981926     accel   st_con michaelm  R 1-19:25:08      1 c19-3
          26981927     accel   st_con michaelm  R 1-19:25:08      1 c19-8
          26981924     accel   st_con michaelm  R 1-19:25:55      1 c19-14
[oe at login-0-0 ~]$ for i in 3 5 8 11 14 15 16; do ssh c19-${i}
nvidia-smi | grep Default; done
| N/A   31C    P0    87W / 235W |    673MiB /  5699MiB |     79%      Default |
| N/A   19C    P8    18W / 235W |     11MiB /  5699MiB |      0%      Default |
| N/A   30C    P0    88W / 235W |    673MiB /  5699MiB |     82%      Default |
| N/A   18C    P8    18W / 235W |     11MiB /  5699MiB |      0%      Default |
| N/A   31C    P0    88W / 235W |    673MiB /  5699MiB |     85%      Default |
| N/A   18C    P8    17W / 235W |     11MiB /  5699MiB |      0%      Default |
| N/A   37C    P0    95W / 235W |   1128MiB /  5699MiB |     83%      Default |
| N/A   23C    P8    17W / 235W |     11MiB /  5699MiB |      0%      Default |
| N/A   34C    P0    92W / 235W |    673MiB /  5699MiB |     83%      Default |
| N/A   20C    P8    18W / 235W |     11MiB /  5699MiB |      0%      Default |
| N/A   35C    P0    95W / 235W |   1128MiB /  5699MiB |     86%      Default |
| N/A   20C    P8    17W / 235W |     11MiB /  5699MiB |      0%      Default |
| N/A   31C    P0    90W / 235W |   1128MiB /  5699MiB |     85%      Default |
| N/A   19C    P8    18W / 235W |     11MiB /  5699MiB |      0%      Default |

earlier this week, when i reported this observation first, i kept
looking at the nodes for several days.  for all i recall these jobs
were somewhat long-running (four to five days), and during that period
i checked repeatedly and never saw both gpus active.  so, yes, my
impression is the user may request two gpus (or otherwise an exclusive
node), but his code only utilizes one gpu.  if that is indeed the
case, it is no doubt because he does not know better: currently, he
has another job waiting in the gpu queue, hence would benefit himself
if he could avoid that seemingly wasteful pattern :-).

oe