Long Running Processes

From mn/geo/geoit
Jump to: navigation, search


Abel users have a disk capacity of 200 GB; but you might have more data than this. This page summarises some good practices on using Abel for large jobs. With a specific example for WRF.

Use job scripts

From the Abel user's guide:

"When you log in to Abel, you are logged in to a login node. The login nodes are meant for logging in, copying files, editing, compiling, running short tests (no more than a couple of minutes), submitting jobs, checking job status, etc.

To run a job on the cluster, you submit a job script into a job queue, and the job is started when a suitable compute node (or group of nodes) is available. The job queue is managed by a queue system (scheduler and resource manager) called SLURM (SLURM's documentation page).

Note that it is not allowed to run jobs directly on the login nodes of Abel. If you fail to comply with this rule, your access to Abel might be suspended."

In short, all simulations should be submitted as jobs. Read more on these links:

uio.no/english/services/it/research/hpc/abel/help/user-guide/queue-system uio.no/english/services/it/research/hpc/abel/help/user-guide/job-scripts


Work directory $SCRATCH

When submitting a job, it is good practice to copy all relevant files to a work directory (/work/...). "The name of the directory is stored in the environment variable $SCRATCH. As a general rule, all jobs should use the scratch directory ($SCRATCH) as its work directory.

There are several reasons for using $SCRATCH:

   $SCRATCH is on a faster file system than user home directories.
  There is less risk of interfering with running jobs by accidentally modifying or deleting the jobs' input or output files.
  Temporary files are automatically cleaned up, because the scratch directory is removed when the job finishes.
  It avoids taking unneeded backups of temporary and partial files, because $SCRATCH is not backed up


Read more here: uio.no/english/services/it/research/hpc/abel/help/user-guide/job-scripts.html#Work_Directory

~/nobackup

All Abel users have a folder called nobackup. "Directories with name "nobackup" (and all the sub-tree) will be excluded from the daily backup (to be used as much as possible to help reducing the task of the daily backup !)"

Read more here: uio.no/english/services/it/research/hpc/abel/help/user-guide/data.html

Copy to vann (or other storage)

When a simulation is finished, the output should be copied to a storage (for instance vann). For large datasets, or many small files, the command rsync is preferred over scp. The following scripts show automatic copying from Abel to vann. Note that ssh keys between vann (sverdrup) and Abel must be set up to make this automatic, otherwise it asks for a password every time.

Job script hrldas.job:

#!/bin/bash
# Job name:
#SBATCH --job-name=spinup
#
# Project:
#SBATCH --account=geofag
#
# Wall clock limit:
#SBATCH --time=20:0:0
#
# Max memory usage per task:
#SBATCH --mem-per-cpu=2000M
#
# Number of tasks (cores):
#SBATCH --ntasks-per-node=1
#
# Number of nodes:
#SBATCH --nodes=1 
#
###SBATCH --constraint=amd
# Set up job environment
source /cluster/bin/jobsetup
ulimit -s unlimited
export LANG=en_US.UTF-8
export LC_ALL=en_US
# cp ../Noah_hrldas_beta $SCRATCH      # do not need this when using absolute file paths
# cp kopi.sh $SCRATCH
chkfile "RESTART*" "199*1200*LDASOUT*" # copy result files from $SCRATCH to $SUBMITDIR when job is finished
cp kopi.job $SCRATCH
cp namelist.input $SCRATCH
cp namelist.hrldas $SCRATCH
cp RESTART.1992123100_DOMAIN1 $SCRATCH
#
cd $SCRATCH                            # Need this to do the actual simulation in the work directory
export GRIB_ROOT~/hrldas/hrldas-v3.6/HRLDAS_COLLECT_DATA/GRIB_TABLES
~/hrldas/hrldas-v3.6/HRLDAS_COLLECT_DATA/consolidate_grib.exe
#
~/hrldas/hrldas-v3.6/Run/Noah_hrldas_beta # IF COPY TO SCRATCH, ./Noah_hrldas_beta
sbatch kopi.job                      # start copying when the job is done (if relatively short job)
# scancel <jobid to kopi.job>        # remember to cancel the copy job.


Copy script kopi.job

#!/bin/bash
# Job name:
#SBATCH --job-name=kopi
#
# Project:
#SBATCH --account=geofag
#
# Wall clock limit:
#SBATCH --time=05:0:0
#
# Max memory usage per task:
#SBATCH --mem-per-cpu=2000M
#
# Number of tasks (cores):
#SBATCH --ntasks-per-node=1
#
# Number of nodes:
#SBATCH --nodes=1 
#
###SBATCH --constraint=amd
# Set up job environment
source /cluster/bin/jobsetup
ulimit -s unlimited
export LANG=en_US.UTF-8
export LC_ALL=en_US
mpirun ./kopi_hrldas_sverdrup.sh

-bash-4.1$ more wrf_kopi_hjem.sh

#!/bin/bash
#Script to copy too home folder (must mkdir to copy to each time)
# 1) create a directory on Sverdrup/vann for the outfiles, ex  mkdir full1992and1993
cd "/usit/abel/u1/irenebn/nobackup/1992and1993/simulated_LDASOUT"  # Output files 199*LDASOUT* are created here
rsync -F -tv namelist.hrldas irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah-mp_1993
rsync -F -tv *.TBL           irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah_mp_1993
while true; do date          # note that this loop never ends!
   ls -l 199*LDASOUT*>>~/kopi-status.txt  
   time 
   rsync -F -tv ./199*LDASOUT* irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah_mp_1993 >>~/kopi-status.txt 2>&1
   rsync -F -tv ./RESTART* irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah_mp_1993 >>~/kopi-status.txt 2>&1
   sleep 600                # the script is activated once each 10 minutes (600 s)
done

Remember to run scancel to stop the copy script, because the loop is infinite.

scancel jobid     # where jobid is the number before your job when writing squeue -U geofag