Running CanESM on ECCC HPC systems
This document was adapated from the contents of the “Running_readme.md”
documenting in the CanESM repository.
Introduction
This document describes the five basic steps required to run
CanESM/CanAM from Version Control. If this is your first run, see the
section “One time setup” below. For instructions on how
to modify the code see the developers guide.
Runid Guidance
As part creating a run, users must select a “runid”, or run identifier, that will be used to differeniate their runs from others. In general the chosen runids are free form, but there are some restrictions, specifically, they must contain only lower case alphanumeric characters [a-z] and [0-9], the hyphen “-” and the period “.”
Setting up a run
To setup a run, follow the steps below:
Call
setup-canesmIn the directory you want the “run directory” created and setup, call
setup-canesm, specifying at a minimum the runid and version of the code to use (seesetup-canesm -hfor additional information):setup-canesm ver=ABC config=ESM runid=XXX ## use config=AMIP/OMIP if you want an AGCM or Ocean only run
where
veris a tag/commit/SHA1 checksum from theCanESM“super repo” (see Model versions section below, or you could use the branch name such asdevelop_canesm)configdefines the high level config you’d like (can beESM,OMIP, orAMIP),runidis the unique runid.
Note
If you not have access to
setup_canesm, see the one time setup documentation below.If you want to setup the run from a user fork, you will need to provide the
repo=argument. See information here for more info.If you plan on doing development using the code cloned for this run, be sure to refer to the quickstart guide on modifying CanESM noting that the CanESM source code has been cloned into
CanESM_source_linkwithin the setup directory.
Warning
Do not re-use runids of existing runs, even if setting up from a different account!
Source the run-time environment
After running
setup-canesm, you will see output like:Setup complete! Now: cd your-runid source env_setup_file to set the proper environment for this run
Follow these directions to get the proper run-time environment, which will place the proper scripts on your
$PATH.Note
If you log out, you will need to source this file again when you return.
Set your configuration settings
Edit the file
canesm.cfg, to set dates and run options, as documented in that file, and then generate the downstream config files by executingconfig-canesm, i.e.vi canesm.cfg ... # Change start and end dates, etc. runid has been set already by setup-canesm config-canesm
This will produce the config files that get used by the run and store them in a local
configdirectory.Compile the executables
Executables are compiled interactively by the user. Upon sourcing the run’s environment file,
compile-canesm.shwill be placed on your$PATH. Simply execute it to compile:compile-canesm.sh
The compilation will take a few minutes. See
compile-canesm.sh -hfor additional options, but the default behaviour will be to compile the executables in the source repo, and link them back to$EXEC_STORAGE_DIR(defined incanesm.cfg), which by default is a localexecutablesdirectory.
Save restarts to the run’s file database
Sourcing the run’s environment will also add
save_restart_filesandtapeload_rsto your$PATH- use these scripts to retrieve/setup the input restarts and save them to the run’s local file database:save_restart_files
or
tapeload_rs
Where
save_restart_fileslooks for the specified restarts in the databases defined byDATAPATH_DBin the environment, andtapeload_rslooks for them on the tape archives.Note
These scripts must be ran on the hall you plan to run the model on (defined by
compute_system)Note
By default namelists from restarts are not used, but if you would like to do so, see
save_restart_files -h(ortapeload_rs -h) for information on how to do so.Submit the job.
To launch the experiment, you have two options:
Via the command line
expbegin -e ${SEQ_EXP_HOME} -d YYYYMMDDhh # where YYYYMMDDhh should be replaced by todays date (i.e. 2022062218)Via tlclient
log onto
tlclientin a terminal navigate to the run directory and
source env_setup_filerun
xflow,set the experiment date and click the green checkmark beside it
right click on the
/canesmnode and select “Submit”
Continuing an existing run
It a run has crashed before its scheduled end
If a run crashes, the normal recovery procedure is to use xflow on tlclient to
figure out what job failed, and attempt a re-submission of the inidividual job node
(right click on the offending node and select the “Submit and Continue” option). If this
doesn’t work, users should then select the “Listing” menu and then “Latest Abort Listing”
to see the output from the job, and debug the problem. Once the problem is fixed, simply
resubmit the job and the run should continue.
Using a Terminal
It should be noted that it is also possible to determine what job has failed by looking
under $WRK_DIR/sequencer/sequencing/status (using tree works well for this) and
looking for *abort.stop files.
Once the offending job is identified, users should be able to find a corresponding, compressed,
*abort* “listing file” (output file) under $WRK_DIR/sequencer/listings/latest/canesm.
Given that these files are typically compressed via gzip, users can inspect these
via something like
gunzip -c sequencer/listings/latest/path/to/desired/listing_file | less
or via some other editor that can open gzip files natively.
Then, after the problem is fixed, the offending job can resubmitted via something like
maestro -n /path/to/job/node/in/maestro/suite -s submit -f continue [-l loop_name=ITERATION_NUM ]
For example, if a user wishes to resubmitted the model_run job, for the 2nd iteration of the
model_loop, it would look like:
maestro -n /canesm/model/model_loop/model_run -s submit -f continue -l model_loop=2
Note
While users can monitor/relaunch runs purely in the terminal, it requires an advanced under
standing of the maestro sequencing system and as such the tlclient/xflow
solution is recommended.
Run has reached the end
To continue a run which has already finished, users should:
adjust the
*stop_timevars incanesm.cfgand then re-runconfig-canesmopen
xflow, and for each loop in the suiteselect the
n+1loop from the drop down menu, wherenis the last completed loop from the run.right click the loop node and select “Member Submit” from the drop down menu
Note
Under the loop drop down menus, reselect “latest” to see the suite progressing
Model versions
The branch develop_canesm is used to integrate in new changes, and always
reflects the lastest developments. We strive to keep develop_canesm stable
and even bit-identical to the previous tagged release, but issues can arise.
When important changes occur, a tagged release is issued. Tagged releases are thoroughly tested, stable versions of the model (although old taggaed releases might not function as HPC changes). The latest tagged release should always be functional and stable, and represents a reliable starting point for work.
To find the latest tagged release, visit CanESM repository on gitlab. From the top horizontal menu
bar, select ‘repository’. Then from the secondary menu bar, select ‘Tags’. The
latest tagged release is listed at the top, with its commit number underneath.
Use that for ver= in the call to setup-canesm.
When new tagged releases are issued, users should merge these into their working branches ASAP.
One time setup
Adding your ssh keys to gitlab
Prior to accessing the necessary gitlab repositories, you must add you ssh keys to gitlab. Do to this, follow these instructions to add your keys.
General Environment Guidance
To run the CanESM system, users must have a few things setup in the .profile
along with a .condarc file to pick up the necessary infrastructure environments.
Specifically, users must have the following in their ~/.profile:
export CCCMA_REF=/home/scrd102/cccma_libs/cccma/latest/ # defines the lib version to use export PATH=$CCCMA_REF/CanESM_source_link/CCCma_tools/tools:$PATH # Access to setup/s scripts source $CCCMA_REF/CanESM_source_link/CCCma_tools/generic/u2_site_profile # basic ordenv setup & ssm loads of maestro etc alias load_cccma_env='source $CCCMA_REF/env_setup_file' # command to activate a full env with binaries like ggstat umask 022 # Default read permissions on new files for group
and the following in a ~/.condarc file:
envs_dirs: - /home/scrd102/cccma_conda/envs
Once added, log back in for these changes to take affect.
Warning
diverging from the above environment can cause problems! If you experience issues with the
maestrosequencing system, make sure to try again with a bare.profilecontaining only to settings above to isolate this possibility.Warning
the environment system used on the ECCC system has notable issues if a ``~/.bashrc` file. Make sure this is not used on your science network account.
One Time Maestro Setup
Prior to using the maestro sequencing system, users must setup some maestro files/links
and initialize their maestro. Assuming users have setup their .profile as laid out
above, to do this:
Set the default machine that all maestro suites will use (many specific suites will have machines explicitly defined which will override this):
mkdir -p ~/.suites echo "SEQ_DEFAULT_MACHINE=ppp6" >> ~/.suites/default_resources.def
Set the default links, which
maestrouses to find the locations to place temporary directoriesmkdir -p ~/.suites/.default_links ss_scratch_space=/eccc/crd/ccrn/ccrn_tmp/$(whoami)/maestro mkdir -p /space/hall6/sitestore/${ss_scratch_space} mkdir -p /space/hall5/sitestore/${ss_scratch_space} ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp6 ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/robert ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp5 ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/underhill
Initialize your
maestroserver, which monitors your “suites” by executingmserver_initSCI
and following the on-screen prompts. Then execute
mserver
Tips for compiling
If any of the compilations fail, you should inspect the errors in the hidden
.compile-canesm* logs that will appear in the run directory.
All of the components directly use the source repo cloned via setup-canesm, so once you
find the problem, you can go into the source code and make the necessary modifications.
Once done, you can recompile by simply calling the compile-canesm.sh script again.
It should also be noted that if you only want to recompile a single component, you can
utilize various flags to limit the compilation - see compile-canesm.sh -h for
information on the interface.
Users should be aware that the compile script:
doesn’t trigger a full recompile by default, to trigger a full recompile, use the
-foption (“force”).the compile script also generates two
cppfiles, depending on your configuration settings (cppdef_config.h, andcppdef_sizes.h). You should clean these files if you want your new settings to be applied properly, or use the-foption, which also cleans them.