[NLPL Task Force (A)] [rt.uio.no #3406027] gpu usage on Abel
Stephan Oepen
oe at ifi.uio.no
Thu May 16 06:12:56 UTC 2019
good morning,
> Did you observe this at one point in time or have you seen this ongoing over a longer period?
the picture looks unchanged since last night:
[oe at login-0-0 ~]$ squeue --partition=accel|grep michaelm
26981933 accel pe_con michaelm PD 0:00 1 (Priority)
26981931 accel pe_con michaelm R 1-18:16:56 1 c19-15
26981930 accel pe_con michaelm R 1-18:56:39 1 c19-16
26981928 accel st_con michaelm R 1-19:24:21 1 c19-5
26981929 accel pe_con michaelm R 1-19:24:21 1 c19-11
26981926 accel st_con michaelm R 1-19:25:08 1 c19-3
26981927 accel st_con michaelm R 1-19:25:08 1 c19-8
26981924 accel st_con michaelm R 1-19:25:55 1 c19-14
[oe at login-0-0 ~]$ for i in 3 5 8 11 14 15 16; do ssh c19-${i}
nvidia-smi | grep Default; done
| N/A 31C P0 87W / 235W | 673MiB / 5699MiB | 79% Default |
| N/A 19C P8 18W / 235W | 11MiB / 5699MiB | 0% Default |
| N/A 30C P0 88W / 235W | 673MiB / 5699MiB | 82% Default |
| N/A 18C P8 18W / 235W | 11MiB / 5699MiB | 0% Default |
| N/A 31C P0 88W / 235W | 673MiB / 5699MiB | 85% Default |
| N/A 18C P8 17W / 235W | 11MiB / 5699MiB | 0% Default |
| N/A 37C P0 95W / 235W | 1128MiB / 5699MiB | 83% Default |
| N/A 23C P8 17W / 235W | 11MiB / 5699MiB | 0% Default |
| N/A 34C P0 92W / 235W | 673MiB / 5699MiB | 83% Default |
| N/A 20C P8 18W / 235W | 11MiB / 5699MiB | 0% Default |
| N/A 35C P0 95W / 235W | 1128MiB / 5699MiB | 86% Default |
| N/A 20C P8 17W / 235W | 11MiB / 5699MiB | 0% Default |
| N/A 31C P0 90W / 235W | 1128MiB / 5699MiB | 85% Default |
| N/A 19C P8 18W / 235W | 11MiB / 5699MiB | 0% Default |
earlier this week, when i reported this observation first, i kept
looking at the nodes for several days. for all i recall these jobs
were somewhat long-running (four to five days), and during that period
i checked repeatedly and never saw both gpus active. so, yes, my
impression is the user may request two gpus (or otherwise an exclusive
node), but his code only utilizes one gpu. if that is indeed the
case, it is no doubt because he does not know better: currently, he
has another job waiting in the gpu queue, hence would benefit himself
if he could avoid that seemingly wasteful pattern :-).
oe
More information about the infrastructure
mailing list