Running CanESM on ECCC HPC systems

This document was adapated from the contents of the “Running_readme.md” documenting in the CanESM repository.

Introduction

This document describes the five basic steps required to run CanESM/CanAM from Version Control. If this is your first run, see the section “One time setup” below. For instructions on how to modify the code see the developers guide.

Runid Guidance

As part creating a run, users must select a “runid”, or run identifier, that will be used to differeniate their runs from others. In general the chosen runids are free form, but there are some restrictions, specifically, they must contain only lower case alphanumeric characters [a-z] and [0-9], the hyphen “-” and the period “.”

Setting up a run

To setup a run, follow the steps below:

  1. Call setup-canesm

    In the directory you want the “run directory” created and setup, call setup-canesm, specifying at a minimum the runid and version of the code to use (see setup-canesm -h for additional information):

    setup-canesm  ver=ABC config=ESM runid=XXX ## use config=AMIP/OMIP if you want an AGCM or Ocean only run
    

    where

    • ver is a tag/commit/SHA1 checksum from the CanESM “super repo” (see Model versions section below, or you could use the branch name such as develop_canesm)

    • config defines the high level config you’d like (can be ESM, OMIP, or AMIP),

    • runid is the unique runid.

    Note

    • If you not have access to setup_canesm, see the one time setup documentation below.

    • If you want to setup the run from a user fork, you will need to provide the repo= argument. See information here for more info.

    • If you plan on doing development using the code cloned for this run, be sure to refer to the quickstart guide on modifying CanESM noting that the CanESM source code has been cloned into CanESM_source_link within the setup directory.

    Warning

    Do not re-use runids of existing runs, even if setting up from a different account!

  2. Source the run-time environment

    After running setup-canesm, you will see output like:

    Setup complete! Now:
    
       cd your-runid
       source env_setup_file
    
    to set the proper environment for this run
    

    Follow these directions to get the proper run-time environment, which will place the proper scripts on your $PATH.

    Note

    If you log out, you will need to source this file again when you return.

  3. Set your configuration settings

    Edit the file canesm.cfg, to set dates and run options, as documented in that file, and then generate the downstream config files by executing config-canesm, i.e.

    vi canesm.cfg
    ...              # Change start and end dates, etc. runid has been set already by setup-canesm
    
    config-canesm
    

    This will produce the config files that get used by the run and store them in a local config directory.

  4. Compile the executables

    Executables are compiled interactively by the user. Upon sourcing the run’s environment file, compile-canesm.sh will be placed on your $PATH. Simply execute it to compile:

    compile-canesm.sh
    

    The compilation will take a few minutes. See compile-canesm.sh -h for additional options, but the default behaviour will be to compile the executables in the source repo, and link them back to $EXEC_STORAGE_DIR (defined in canesm.cfg), which by default is a local executables directory.

  1. Save restarts to the run’s file database

    Sourcing the run’s environment will also add save_restart_files and tapeload_rs to your $PATH - use these scripts to retrieve/setup the input restarts and save them to the run’s local file database:

    save_restart_files
    

    or

    tapeload_rs
    

    Where save_restart_files looks for the specified restarts in the databases defined by DATAPATH_DB in the environment, and tapeload_rs looks for them on the tape archives.

    Note

    These scripts must be ran on the hall you plan to run the model on (defined by compute_system)

    Note

    By default namelists from restarts are not used, but if you would like to do so, see save_restart_files -h (or tapeload_rs -h) for information on how to do so.

  2. Submit the job.

    To launch the experiment, you have two options:

    Via the command line

    expbegin -e ${SEQ_EXP_HOME} -d YYYYMMDDhh # where YYYYMMDDhh should be replaced by todays date (i.e. 2022062218)
    

    Via tlclient

    1. log onto tlclient

    2. in a terminal navigate to the run directory and source env_setup_file

    3. run xflow,

      1. set the experiment date and click the green checkmark beside it

      2. right click on the /canesm node and select “Submit”

Continuing an existing run

It a run has crashed before its scheduled end

If a run crashes, the normal recovery procedure is to use xflow on tlclient to figure out what job failed, and attempt a re-submission of the inidividual job node (right click on the offending node and select the “Submit and Continue” option). If this doesn’t work, users should then select the “Listing” menu and then “Latest Abort Listing” to see the output from the job, and debug the problem. Once the problem is fixed, simply resubmit the job and the run should continue.

Using a Terminal

It should be noted that it is also possible to determine what job has failed by looking under $WRK_DIR/sequencer/sequencing/status (using tree works well for this) and looking for *abort.stop files.

Once the offending job is identified, users should be able to find a corresponding, compressed, *abort* “listing file” (output file) under $WRK_DIR/sequencer/listings/latest/canesm. Given that these files are typically compressed via gzip, users can inspect these via something like

gunzip -c sequencer/listings/latest/path/to/desired/listing_file | less

or via some other editor that can open gzip files natively.

Then, after the problem is fixed, the offending job can resubmitted via something like

maestro -n /path/to/job/node/in/maestro/suite -s submit -f continue [-l loop_name=ITERATION_NUM ]

For example, if a user wishes to resubmitted the model_run job, for the 2nd iteration of the model_loop, it would look like:

maestro -n /canesm/model/model_loop/model_run -s submit -f continue -l model_loop=2

Note

While users can monitor/relaunch runs purely in the terminal, it requires an advanced under standing of the maestro sequencing system and as such the tlclient/xflow solution is recommended.

Run has reached the end

To continue a run which has already finished, users should:

  1. adjust the *stop_time vars in canesm.cfg and then re-run config-canesm

  2. open xflow, and for each loop in the suite

    1. select the n+1 loop from the drop down menu, where n is the last completed loop from the run.

    2. right click the loop node and select “Member Submit” from the drop down menu

Note

Under the loop drop down menus, reselect “latest” to see the suite progressing

Model versions

The branch develop_canesm is used to integrate in new changes, and always reflects the lastest developments. We strive to keep develop_canesm stable and even bit-identical to the previous tagged release, but issues can arise.

When important changes occur, a tagged release is issued. Tagged releases are thoroughly tested, stable versions of the model (although old taggaed releases might not function as HPC changes). The latest tagged release should always be functional and stable, and represents a reliable starting point for work.

To find the latest tagged release, visit CanESM repository on gitlab. From the top horizontal menu bar, select ‘repository’. Then from the secondary menu bar, select ‘Tags’. The latest tagged release is listed at the top, with its commit number underneath. Use that for ver= in the call to setup-canesm.

When new tagged releases are issued, users should merge these into their working branches ASAP.

One time setup

Adding your ssh keys to gitlab

Prior to accessing the necessary gitlab repositories, you must add you ssh keys to gitlab. Do to this, follow these instructions to add your keys.

General Environment Guidance

To run the CanESM system, users must have a few things setup in the .profile along with a .condarc file to pick up the necessary infrastructure environments.

Specifically, users must have the following in their ~/.profile:

export CCCMA_REF=/home/scrd102/cccma_libs/cccma/latest/                  # defines the lib version to use
export PATH=$CCCMA_REF/CanESM_source_link/CCCma_tools/tools:$PATH        # Access to setup/s scripts
source $CCCMA_REF/CanESM_source_link/CCCma_tools/generic/u2_site_profile # basic ordenv setup & ssm loads of maestro etc
alias load_cccma_env='source $CCCMA_REF/env_setup_file'                  # command to activate a full env with binaries like ggstat
umask 022                                                                # Default read permissions on new files for group

and the following in a ~/.condarc file:

envs_dirs:
   - /home/scrd102/cccma_conda/envs

Once added, log back in for these changes to take affect.

Warning

diverging from the above environment can cause problems! If you experience issues with the maestro sequencing system, make sure to try again with a bare .profile containing only to settings above to isolate this possibility.

Warning

the environment system used on the ECCC system has notable issues if a ``~/.bashrc` file. Make sure this is not used on your science network account.

One Time Maestro Setup

Prior to using the maestro sequencing system, users must setup some maestro files/links and initialize their maestro. Assuming users have setup their .profile as laid out above, to do this:

  1. Set the default machine that all maestro suites will use (many specific suites will have machines explicitly defined which will override this):

    mkdir -p ~/.suites
    echo "SEQ_DEFAULT_MACHINE=ppp6" >> ~/.suites/default_resources.def
    
  2. Set the default links, which maestro uses to find the locations to place temporary directories

    mkdir -p ~/.suites/.default_links
    ss_scratch_space=/eccc/crd/ccrn/ccrn_tmp/$(whoami)/maestro
    mkdir -p /space/hall6/sitestore/${ss_scratch_space}
    mkdir -p /space/hall5/sitestore/${ss_scratch_space}
    ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp6
    ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/robert
    ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp5
    ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/underhill
    
  3. Initialize your maestro server, which monitors your “suites” by executing

    mserver_initSCI
    

    and following the on-screen prompts. Then execute

    mserver
    

Tips for compiling

If any of the compilations fail, you should inspect the errors in the hidden .compile-canesm* logs that will appear in the run directory.

All of the components directly use the source repo cloned via setup-canesm, so once you find the problem, you can go into the source code and make the necessary modifications. Once done, you can recompile by simply calling the compile-canesm.sh script again. It should also be noted that if you only want to recompile a single component, you can utilize various flags to limit the compilation - see compile-canesm.sh -h for information on the interface.

Users should be aware that the compile script:

  1. doesn’t trigger a full recompile by default, to trigger a full recompile, use the -f option (“force”).

  2. the compile script also generates two cpp files, depending on your configuration settings (cppdef_config.h, and cppdef_sizes.h). You should clean these files if you want your new settings to be applied properly, or use the -f option, which also cleans them.