SCRIPTS SLURM prun_batch_slurmΒΆ

# -*-Python-*-
# Created by bgriers at 04 Jul 2018  10:23

"""
This script submits many batch jobs as a prun.
The script is full featured to avoid over-writing existing runs and also handle continues and errors.

The high level goal is to run the script `batch_slurm_for_prun` many times, each with a unique runid.
But we want to avoid running a case that is already run
And we want to support the option to continue a run in a manner similar to a large simulation code

The logic for managing this is included in this TUTORIAL example.
All runs are stored in
  TUTORIAL['OUTPUTS']['SLURM']['RUN_DB']['sim*']
where 'sim*' is 'sim1', 'sim2', ... up to "nprun"
The simulation is simple, and all it does it write the working directory of the job to a text file.
However for a real case this may be a very large simulation.

We will run these "simulations" either
1. Sequentially
   Here the output is collected as the runs finish
   This is done by parallel = False
2. In parallel and monitor each job as it runs (pops up a grid of job status), waiting for each to finish
   Here the output is collected as the runs finish
   This is done by parallel = True and wait = True
3. In parallel by running the sbatch and retaining the job information
   Here we store the slurm job information in the tree for later query about the job status and run directory
   Here the output is collected as the user requests
   This is done by parallel = True and wait = False

In all cases, we do not overwrite runs.  They have to be deleted manually
If there is no output then we run the job
If there is output and we are not continuing the job, then we skip over it
If we are continuing a run, then we either run it fresh if it's not there, or continue if it is


"""

defaultVars(parallel=True, wait=True, cont=False)

# Prepare the output tree structure for the workflow
root['OUTPUTS'].setdefault('SLURM', OMFITtree())
root['OUTPUTS']['SLURM'].setdefault('RUN_DB', OMFITtree())

# Set the total number of simulations we will perform
nprun = 10

# Set the maximum number of simulations that are to be run simultaneously
# This is useful for clusters that have maximum resource limits per user
# For example if a user is limited to 1024 cores and each job uses 128
# then this number should be 8
nsimultaneous = 5

# Prepare the output tree structure for each simulation
for i in range(nprun):
    root['OUTPUTS']['SLURM']['RUN_DB'].setdefault('sim{}'.format(i), OMFITtree())

# Create the list of simulation runids that will be passed to the job script
scratch['SBATCH_runids'] = []
for i in range(nprun):
    # Check if we have already run this simulation and we are not continuing
    # If we have run it then skip it
    # Else add it to the list of runs to do
    if (
        'pwd' in root['OUTPUTS']['SLURM']['RUN_DB']['sim{}'.format(i)]
        and isinstance(root['OUTPUTS']['SLURM']['RUN_DB']['sim{}'.format(i)]['pwd'], OMFITascii)
        and not cont
    ):
        printi('Simulation sim{} already run'.format(i))
        continue
    else:
        printi('Adding simulation sim{} to the SBATCH list'.format(i))
        scratch['SBATCH_runids'].append('sim{}'.format(i))

# If there is at least one simulation to run, then run them
#
# For a parallel run:
# This will return the output of each run in a list.
# The content of the output depends on how we have set parallel and wait
# Either this will return "data" in the form of an ASCII text file, or it will
# return the job manager that contains the SLURM_JOBID and directory that the
# simulation is running in.
#
# For a sequential run:
# We just run each run, forcing parallel=False and wait=True
#
if len(scratch['SBATCH_runids']) and parallel:
    runs = root['SCRIPTS']['SLURM']['batch_slurm_for_prun'].prun(
        nprun, nsimultaneous, 'result', runid=scratch['SBATCH_runids'], cont=cont, parallel=parallel, wait=wait
    )

    # Regardless of what the run returns (result or job manager), we need to store it.
    for runid, run in zip(scratch['SBATCH_runids'], runs):
        root['OUTPUTS']['SLURM']['RUN_DB'][runid] = copy.deepcopy(runs[run])
elif len(scratch['SBATCH_runids']) and not parallel:
    for runid in scratch['SBATCH_runids']:
        root['SCRIPTS']['SLURM']['batch_slurm_for_prun'].run(runid=runid, cont=cont, parallel=False, wait=True)
        root['OUTPUTS']['SLURM']['RUN_DB'][runid] = copy.deepcopy(root['OUTPUTS']['SLURM']['OUTPUTS'][runid])
        root['OUTPUTS']['SLURM']['OUTPUTS'].clear()
else:
    printi('No simulations to run!')

# If this is a parallel=True, wait=False run, then we have to run the script
# that downsyncs the simulation result called 'collect_prun_batch_slurm'