[NLPL Task Force (A)] Heavy IO load from your jobs

Fri Nov 23 21:26:30 UTC 2018

hi again, colleagues,

i am back to running some of these jobs but will limit myself to a
maximum of fifteen active ones, at any point in time.

thinking about it a little more: the python (with TensorFlow et al.)
start-up actually is somewhat i/o intensive, probably in part because
our NLPL project has modularized python add-ons and there are a
handful of entries on PYTHONPATH to search through (and in part
because TensorFlow appears to depend on a large number of dynamically
loaded shared object files).

$ module purge
$ module load python3/3.5.0 nlpl-scipy nlpl-tensorflow
$ type python3
python3 is /projects/nlpl/software/tensorflow/1.11/bin/python3
$ strace -f python3 -c "import tensorflow as tf;
print(tf.__version__);" > /tmp/tf.log 2>&1
$ egrep '\] (:?stat|open)' /tmp/tf.log | wc -l
12417

most of these calls go to ‘/projects/nlpl/’, which i believe is merely
NFS-mount(8)ed.

as it gets going, each job script will consecutively create thousands
of quite short-lived TensorFlow processes, and each running job will
do so on eight cores in parallel.  at the moment, just the python
process creation is in fact somewhat costly:

$ /usr/bin/time -v python3 -c "import tensorflow as tf; print(tf.__version__);"
1.10.1
        Command being timed: "python3 -c import tensorflow as tf;
print(tf.__version__);"
        User time (seconds): 3.09
        System time (seconds): 5.25
        Percent of CPU this job got: 75%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.05
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 174760
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 263
        Minor (reclaiming a frame) page faults: 63934
        Voluntary context switches: 8134
        Involuntary context switches: 268240
        Swaps: 0
        File system inputs: 51669
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

for what i thought should be a light-weight script interpreter, this
feels a bit sad to me, actually!  so could it actually be the case
that the excessive i/o strain that you observed was just owed to my
(unnecessarily) frequent creation of short-lived python processes?  if
so, then i predict that you should see sharply rising i/o load since
around 21:30 tonight?  but then that should almost disappear once my
jobs actually get to doing interesting TensorFlow work—which currently
should begin after a good hour of run time.  is there some way for me
to watch i/o load on various Abel file systems, to test these emerging
theories?

if any of the above is actually relevant, how much of a difference
would it make if the NLPL python modules were on the ‘/cluster/’ file
system rather than on ‘/projects/’?

good night, oe

On Fri, Nov 23, 2018 at 4:27 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>
> hi marcin, bjørn-helge, and colleagues,
>
> no worries about killing these jobs; they were not critical.  and my apologies for putting undue strain on the system!
>
> but i am surprised about the heavy i/o load from these jobs and would like to try and get more feedback.  in a nutshell, each job repeatedly spawns eight (python, regrettably) processes that run the same code: read two smallish files from my home directory to train and evaluate a TensorFlow model of medium-complexity.  each of these training and evaluation runs (on a single core) take between five and thirty mintues, most of that inside TensorFlow.  upon completion, each process creates four small files (so, admittedly, in the course of the night from wednesday to thursday i had succeeded in creating some 12,000 or so new files in my home directory, over the course of ten or so hours; i would like to think that the file system is up to that).  unless there is i/o load from the TensorFlow part (which i would find surprising), i honestly fail to see how these jobs could be so hard on the file system, even with about one hundred of them running at the same time?
>
> could i invite one of you to take a look at the code?  it resides here:
>
> [oe at login-0-0 inf5820]$ pwd
> /usit/abel/u1/oe/src/starsem/inf5820
>
> ‘grid.slurm’ is the job script that i would like to run on at least dozens, preferably hundreds of nodes in parallel.  this is for a systematic study on the effects and interactions of a wide range of hyper-parameters, so my vision would be to explore 64,000 different combinations in a first round of experimentation, which i estimate should come to around 10–15,000 core hours.
>
> as i am about to send this message, maybe an emerging theory: the way my processes split the work among themselves is file-based: for each configuration, a new ‘.score’ file is created, and before doing any actual work each process checks for the existence of that file.  that means that thursday morning, the first 3,000 or so configurations had been completed, such that each new instance of ‘grid.slurm’ when it started would look for (and find) a couple of thousand files and immediately exit.  but process start-up (loading the TensorFlow et al. dynamic libraries) is slow, so i estimate that at least several seconds would pass between each successive on-disk search for a specific ‘.score’ file.  still, i may be naive about parallel file systems, but could this midly rapid succession of stat(2) calls be the root cause of the i/o load you observed?
>
> any advice will be appreciated; in unrelated news, god helg!  oe
>
>
> On Thu, 22 Nov 2018 at 10:36 Marcin Krotkiewski <marcin.krotkiewski at usit.uio.no> wrote:
>>
>> Hi, Stephan
>>
>> Some bad news - some 100 of your jobs have started today, essentially
>> killing the /cluster file system. We were forced to cancel most of them,
>> leaving around 15 running. This is more or less what we can handle.
>>
>> Please have a look at the damage and re-submit the jobs as appropriate.
>> But maybe in smaller quantities...
>>
>> Sorry for this, but Abel was really struggling.
>>
>> Regards,
>>
>> Marcin
>>
>> --
>> Marcin Krotkiewski
>> USIT, University of Oslo
>> Pb 1048 Blindern
>> 0316 Oslo Norway
>>