Difference between revisions of "Long Running Processes"
(Created page with "<br/>Abel users have a disk capacity of 200 GB; but you might have more data than this. This page summarises some good practices on using Abel for large jobs. == Use job scri...") |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | <br/>Abel users have a disk capacity of 200 GB; but you might have more data than this. This page summarises some good practices on using Abel for large jobs. | + | <br/>Abel users have a disk capacity of 200 GB; but you might have more data than this. This page summarises some good practices on using Abel for large jobs. With a specific example for WRF. |
== Use job scripts == | == Use job scripts == | ||
Line 132: | Line 132: | ||
scancel jobid # where jobid is the number before your job when writing squeue -U geofag | scancel jobid # where jobid is the number before your job when writing squeue -U geofag | ||
− | [[Category:WRF]]<br/>[[Category:NOAH]]<br/>[[Category:Abel]] | + | [[Category:Models]]<br/>[[Category:FLEXPART]]<br/>[[Category:WRF]]<br/>[[Category:NOAH]]<br/>[[Category:Abel]] |
Latest revision as of 15:20, 17 February 2015
Abel users have a disk capacity of 200 GB; but you might have more data than this. This page summarises some good practices on using Abel for large jobs. With a specific example for WRF.
Use job scripts
From the Abel user's guide:
"When you log in to Abel, you are logged in to a login node. The login nodes are meant for logging in, copying files, editing, compiling, running short tests (no more than a couple of minutes), submitting jobs, checking job status, etc.
To run a job on the cluster, you submit a job script into a job queue, and the job is started when a suitable compute node (or group of nodes) is available. The job queue is managed by a queue system (scheduler and resource manager) called SLURM (SLURM's documentation page).
Note that it is not allowed to run jobs directly on the login nodes of Abel. If you fail to comply with this rule, your access to Abel might be suspended."
In short, all simulations should be submitted as jobs. Read more on these links:
uio.no/english/services/it/research/hpc/abel/help/user-guide/queue-system uio.no/english/services/it/research/hpc/abel/help/user-guide/job-scripts
Work directory $SCRATCH
When submitting a job, it is good practice to copy all relevant files to a work directory (/work/...). "The name of the directory is stored in the environment variable $SCRATCH. As a general rule, all jobs should use the scratch directory ($SCRATCH) as its work directory.
There are several reasons for using $SCRATCH:
$SCRATCH is on a faster file system than user home directories. There is less risk of interfering with running jobs by accidentally modifying or deleting the jobs' input or output files. Temporary files are automatically cleaned up, because the scratch directory is removed when the job finishes. It avoids taking unneeded backups of temporary and partial files, because $SCRATCH is not backed up
Read more here: uio.no/english/services/it/research/hpc/abel/help/user-guide/job-scripts.html#Work_Directory
~/nobackup
All Abel users have a folder called nobackup. "Directories with name "nobackup" (and all the sub-tree) will be excluded from the daily backup (to be used as much as possible to help reducing the task of the daily backup !)"
Read more here: uio.no/english/services/it/research/hpc/abel/help/user-guide/data.html
Copy to vann (or other storage)
When a simulation is finished, the output should be copied to a storage (for instance vann). For large datasets, or many small files, the command rsync is preferred over scp. The following scripts show automatic copying from Abel to vann. Note that ssh keys between vann (sverdrup) and Abel must be set up to make this automatic, otherwise it asks for a password every time.
Job script hrldas.job:
#!/bin/bash # Job name: #SBATCH --job-name=spinup # # Project: #SBATCH --account=geofag # # Wall clock limit: #SBATCH --time=20:0:0 # # Max memory usage per task: #SBATCH --mem-per-cpu=2000M # # Number of tasks (cores): #SBATCH --ntasks-per-node=1 # # Number of nodes: #SBATCH --nodes=1 # ###SBATCH --constraint=amd # Set up job environment source /cluster/bin/jobsetup ulimit -s unlimited export LANG=en_US.UTF-8 export LC_ALL=en_US # cp ../Noah_hrldas_beta $SCRATCH # do not need this when using absolute file paths # cp kopi.sh $SCRATCH chkfile "RESTART*" "199*1200*LDASOUT*" # copy result files from $SCRATCH to $SUBMITDIR when job is finished cp kopi.job $SCRATCH cp namelist.input $SCRATCH cp namelist.hrldas $SCRATCH cp RESTART.1992123100_DOMAIN1 $SCRATCH # cd $SCRATCH # Need this to do the actual simulation in the work directory export GRIB_ROOT~/hrldas/hrldas-v3.6/HRLDAS_COLLECT_DATA/GRIB_TABLES ~/hrldas/hrldas-v3.6/HRLDAS_COLLECT_DATA/consolidate_grib.exe # ~/hrldas/hrldas-v3.6/Run/Noah_hrldas_beta # IF COPY TO SCRATCH, ./Noah_hrldas_beta sbatch kopi.job # start copying when the job is done (if relatively short job) # scancel <jobid to kopi.job> # remember to cancel the copy job.
Copy script kopi.job
#!/bin/bash # Job name: #SBATCH --job-name=kopi # # Project: #SBATCH --account=geofag # # Wall clock limit: #SBATCH --time=05:0:0 # # Max memory usage per task: #SBATCH --mem-per-cpu=2000M # # Number of tasks (cores): #SBATCH --ntasks-per-node=1 # # Number of nodes: #SBATCH --nodes=1 # ###SBATCH --constraint=amd # Set up job environment source /cluster/bin/jobsetup ulimit -s unlimited export LANG=en_US.UTF-8 export LC_ALL=en_US mpirun ./kopi_hrldas_sverdrup.sh
-bash-4.1$ more wrf_kopi_hjem.sh
#!/bin/bash #Script to copy too home folder (must mkdir to copy to each time) # 1) create a directory on Sverdrup/vann for the outfiles, ex mkdir full1992and1993 cd "/usit/abel/u1/irenebn/nobackup/1992and1993/simulated_LDASOUT" # Output files 199*LDASOUT* are created here rsync -F -tv namelist.hrldas irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah-mp_1993 rsync -F -tv *.TBL irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah_mp_1993 while true; do date # note that this loop never ends! ls -l 199*LDASOUT*>>~/kopi-status.txt time rsync -F -tv ./199*LDASOUT* irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah_mp_1993 >>~/kopi-status.txt 2>&1 rsync -F -tv ./RESTART* irenebn@sverdrup.uio.no:/mn/vann/metos-d2/irenebn/HRLDAS/Resultater/noah_mp_1993 >>~/kopi-status.txt 2>&1 sleep 600 # the script is activated once each 10 minutes (600 s) done
Remember to run scancel to stop the copy script, because the loop is infinite.
scancel jobid # where jobid is the number before your job when writing squeue -U geofag