Running CanESM on ECCC HPC systems ================================== This document was adapated from the contents of the "Running_readme.md" documenting in the ``CanESM`` repository. Introduction ------------ This document describes the five basic steps required to run ``CanESM``/``CanAM`` from Version Control. If this is your first run, see the section "One time setup" :ref:`below `. For instructions on how to modify the code see :ref:`the developers guide `. Runid Guidance ^^^^^^^^^^^^^^ As part creating a run, users must select a "runid", or run identifier, that will be used to differeniate their runs from others. In *general* the chosen runids are free form, but there are *some* restrictions, specifically, they must contain **only lower case alphanumeric characters [a-z] and [0-9], the hyphen "-" and the period "."** Setting up a run ---------------- To setup a run, follow the steps below: 1. **Call** ``setup-canesm`` **In the directory you want the "run directory" created and setup**, call ``setup-canesm``, specifying at a minimum the runid and version of the code to use (see ``setup-canesm -h`` for additional information): .. code-block:: text setup-canesm ver=ABC config=ESM runid=XXX ## use config=AMIP/OMIP if you want an AGCM or Ocean only run where * ``ver`` is a tag/commit/SHA1 checksum from the ``CanESM`` "super repo" (see Model versions section :ref:`below `, or you could use the branch name such as ``develop_canesm``) * ``config`` defines the high level config you'd like (can be ``ESM``, ``OMIP``, or ``AMIP``), * ``runid`` is the unique runid. .. note:: - If you not have access to ``setup_canesm``, see the :ref:`one time setup documentation below `. - If you want to setup the run from a user fork, you will need to provide the ``repo=`` argument. See information :ref:`here <(Internal Use Only) Launching runs from your fork>` for more info. - **If you plan on doing development using the code cloned for this run**, be sure to refer to :ref:`the quickstart guide on modifying CanESM ` noting that the CanESM source code has been cloned into ``CanESM_source_link`` within the setup directory. .. warning:: **Do not** re-use runids of existing runs, even if setting up from a different account! 2. **Source the run-time environment** After running ``setup-canesm``, you will see output like: .. code-block:: text Setup complete! Now: cd your-runid source env_setup_file to set the proper environment for this run Follow these directions to get the proper run-time environment, which will place the proper scripts on your ``$PATH``. .. note:: If you log out, you will need to source this file again when you return. 3. **Set your configuration settings** Edit the file ``canesm.cfg``, to set dates and run options, as documented in that file, and then generate the downstream config files by executing ``config-canesm``, i.e. .. code-block:: text vi canesm.cfg ... # Change start and end dates, etc. runid has been set already by setup-canesm config-canesm This will produce the config files **that get used by the run** and store them in a local ``config`` directory. 4. **Compile the executables** Executables are compiled interactively by the user. Upon sourcing the run's environment file, ``compile-canesm.sh`` will be placed on your ``$PATH``. Simply execute it to compile: .. code-block:: text compile-canesm.sh The compilation will take a few minutes. See ``compile-canesm.sh -h`` for additional options, but the default behaviour will be to compile the executables in the source repo, and link them back to ``$EXEC_STORAGE_DIR`` (defined in ``canesm.cfg``), which by default is a local ``executables`` directory. 4. **Save restarts to the run's file database** Sourcing the run's environment will also add ``save_restart_files`` and ``tapeload_rs`` to your ``$PATH`` - use these scripts to retrieve/setup the input restarts and save them to the run's local file database: .. code-block:: text save_restart_files or .. code-block:: test tapeload_rs Where ``save_restart_files`` looks for the specified restarts in the databases defined by ``DATAPATH_DB`` in the environment, and ``tapeload_rs`` looks for them on the tape archives. .. note:: These scripts must be ran on the hall you plan to run the model on (defined by ``compute_system``) .. note:: By *default* namelists from restarts are not used, but if you would like to do so, see ``save_restart_files -h`` (or ``tapeload_rs -h``) for information on how to do so. 5. **Submit the job.** To launch the experiment, you have two options: *Via the command line* .. code-block:: text expbegin -e ${SEQ_EXP_HOME} -d YYYYMMDDhh # where YYYYMMDDhh should be replaced by todays date (i.e. 2022062218) *Via tlclient* 1. log onto ``tlclient`` 2. in a terminal navigate to the run directory and ``source env_setup_file`` 3. run ``xflow``, a. set the experiment date and click the green checkmark beside it b. right click on the ``/canesm`` node and select "Submit" Continuing an existing run -------------------------- It a run has crashed before its scheduled end ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If a run crashes, the normal recovery procedure is to use ``xflow`` on ``tlclient`` to figure out what job failed, and attempt a re-submission of the inidividual job node (right click on the offending node and select the "Submit and Continue" option). If this doesn't work, users should then select the "Listing" menu and then "Latest Abort Listing" to see the output from the job, and debug the problem. Once the problem is fixed, simply resubmit the job and the run should continue. Using a Terminal """""""""""""""" It should be noted that it is also possible to determine what job has failed by looking under ``$WRK_DIR/sequencer/sequencing/status`` (using ``tree`` works well for this) and looking for ``*abort.stop`` files. Once the offending job is identified, users should be able to find a corresponding, compressed, ``*abort*`` "listing file" (output file) under ``$WRK_DIR/sequencer/listings/latest/canesm``. Given that these files are typically compressed via ``gzip``, users can inspect these via something like .. code-block:: bash gunzip -c sequencer/listings/latest/path/to/desired/listing_file | less or via some other editor that can open ``gzip`` files natively. Then, after the problem is fixed, the offending job can resubmitted via something like .. code-block:: bash maestro -n /path/to/job/node/in/maestro/suite -s submit -f continue [-l loop_name=ITERATION_NUM ] For example, if a user wishes to resubmitted the ``model_run`` job, for the 2nd iteration of the ``model_loop``, it would look like: .. code-block:: bash maestro -n /canesm/model/model_loop/model_run -s submit -f continue -l model_loop=2 .. note:: While users can monitor/relaunch runs purely in the terminal, it requires an advanced under standing of the ``maestro`` sequencing system and as such the ``tlclient``/``xflow`` solution is recommended. Run has reached the end ^^^^^^^^^^^^^^^^^^^^^^^ To continue a run which has already finished, users should: 1. adjust the ``*stop_time`` vars in ``canesm.cfg`` and then re-run ``config-canesm`` 2. open ``xflow``, and **for each loop in the suite** a. select the ``n+1`` loop from the drop down menu, where ``n`` is the last completed loop from the run. b. right click the loop node and select "Member Submit" from the drop down menu .. note:: Under the loop drop down menus, reselect "latest" to see the suite progressing Model versions -------------- The branch ``develop_canesm`` is used to integrate in new changes, and always reflects the lastest developments. We strive to keep ``develop_canesm`` stable and even bit-identical to the previous tagged release, but issues can arise. When important changes occur, a tagged release is issued. Tagged releases are thoroughly tested, stable versions of the model (although old taggaed releases might not function as HPC changes). The latest tagged release should always be functional and stable, and represents a reliable starting point for work. To find the latest tagged release, visit `CanESM repository on gitlab `_. From the top horizontal menu bar, select 'repository'. Then from the secondary menu bar, select 'Tags'. The latest tagged release is listed at the top, with its commit number underneath. Use that for ``ver=`` in the call to ``setup-canesm``. When new tagged releases are issued, users should merge these into their working branches ASAP. One time setup -------------- Adding your ssh keys to gitlab ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Prior to accessing the necessary gitlab repositories, you must add you ssh keys to gitlab. Do to this, follow `these `_ instructions to add your keys. General Environment Guidance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To run the CanESM system, users must have a few things setup in the ``.profile`` along with a ``.condarc`` file to pick up the necessary infrastructure environments. Specifically, users must have the following in their ``~/.profile``: .. code-block:: bash export CCCMA_REF=/home/scrd102/cccma_libs/cccma/latest/ # defines the lib version to use export PATH=$CCCMA_REF/CanESM_source_link/CCCma_tools/tools:$PATH # Access to setup/s scripts source $CCCMA_REF/CanESM_source_link/CCCma_tools/generic/u2_site_profile # basic ordenv setup & ssm loads of maestro etc alias load_cccma_env='source $CCCMA_REF/env_setup_file' # command to activate a full env with binaries like ggstat umask 022 # Default read permissions on new files for group and the following in a ``~/.condarc`` file: .. code-block:: text envs_dirs: - /home/scrd102/cccma_conda/envs Once added, log back in for these changes to take affect. .. warning:: diverging from the above environment can cause problems! If you experience issues with the ``maestro`` sequencing system, make sure to try again with a bare ``.profile`` containing **only** to settings above to isolate this possibility. .. warning:: the environment system used on the ECCC system has **notable** issues if a ``~/.bashrc` file. Make sure this is not used on your science network account. One Time Maestro Setup ^^^^^^^^^^^^^^^^^^^^^^ Prior to using the ``maestro`` sequencing system, users must setup some maestro files/links and initialize their ``maestro``. Assuming users have setup their ``.profile`` as laid out above, to do this: 1. Set the default machine that all maestro suites will use (many specific suites will have machines explicitly defined which will override this): .. code-block:: bash mkdir -p ~/.suites echo "SEQ_DEFAULT_MACHINE=ppp6" >> ~/.suites/default_resources.def 2. Set the default links, which ``maestro`` uses to find the locations to place temporary directories .. code-block:: bash mkdir -p ~/.suites/.default_links ss_scratch_space=/eccc/crd/ccrn/ccrn_tmp/$(whoami)/maestro mkdir -p /space/hall6/sitestore/${ss_scratch_space} mkdir -p /space/hall5/sitestore/${ss_scratch_space} ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp6 ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/robert ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp5 ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/underhill 3. Initialize your ``maestro`` server, which monitors your "suites" by executing .. code-block:: bash mserver_initSCI and following the on-screen prompts. Then execute .. code-block:: bash mserver Tips for compiling ------------------ If any of the compilations fail, you should inspect the errors in the hidden ``.compile-canesm*`` logs that will appear in the run directory. All of the components directly use the source repo cloned via ``setup-canesm``, so once you find the problem, you can go into the source code and make the necessary modifications. Once done, you can recompile by simply calling the ``compile-canesm.sh`` script again. It should also be noted that if you only want to recompile a single component, you can utilize various flags to limit the compilation - see ``compile-canesm.sh -h`` for information on the interface. Users should be aware that the compile script: 1. doesn't trigger a *full* recompile by default, to trigger a full recompile, use the ``-f`` option ("force"). 2. the compile script also generates two ``cpp`` files, depending on your configuration settings (``cppdef_config.h``, and ``cppdef_sizes.h``). You should clean these files if you want your new settings to be applied properly, or use the ``-f`` option, which also cleans them.