Running CanESM on ECCC HPC systems
==================================

This document was adapated from the contents of the "Running_readme.md"
documenting in the ``CanESM`` repository.

Introduction
------------

This document describes the five basic steps required to run
``CanESM``/``CanAM`` from Version Control. If this is your first run, see the
section "One time setup" :ref:`below <One Time Setup>`. For instructions on how
to modify the code see :ref:`the developers guide <Contributing to CanESM
(Developers guide)>`. 

Runid Guidance
^^^^^^^^^^^^^^
 
 As part creating a run, users must select a "runid", or run identifier, that
 will be used to differeniate their runs from others. In *general* the chosen
 runids are free form, but there are *some* restrictions, specifically, they
 must contain **only lower case alphanumeric characters [a-z] and [0-9], the
 hyphen "-" and the period "."**

Setting up a run 
----------------

To setup a run, follow the steps below:

1. **Call** ``setup-canesm`` 

    **In the directory you want the "run directory" created and setup**, call ``setup-canesm``, 
    specifying at a minimum the runid and version of the code to use (see ``setup-canesm -h`` for
    additional information):

    .. code-block:: text

        setup-canesm  ver=ABC config=ESM runid=XXX ## use config=AMIP/OMIP if you want an AGCM or Ocean only run

    where 
    
    * ``ver`` is a tag/commit/SHA1 checksum from the ``CanESM`` "super repo" (see Model versions section :ref:`below <Model Versions>`, or you could use the branch name such as ``develop_canesm``)
    * ``config`` defines the high level config you'd like (can be ``ESM``, ``OMIP``, or ``AMIP``),
    * ``runid`` is the unique runid. 

    .. note::

        - If you not have access to ``setup_canesm``, see the :ref:`one time
          setup documentation below <One Time Setup>`.
        - If you want to setup the run from a user fork, you will need to provide
          the ``repo=`` argument. See information 
          :ref:`here <(Internal Use Only) Launching runs from your fork>` for more info.
        - **If you plan on doing development using the code cloned for this run**, 
          be sure to refer to 
          :ref:`the quickstart guide on modifying CanESM <Modifying CanESM>` noting
          that the CanESM source code has been cloned into ``CanESM_source_link``
          within the setup directory. 

    .. warning::

        **Do not** re-use runids of existing runs, even if setting up from a different
        account! 

2. **Source the run-time environment**

    After running ``setup-canesm``, you will see output like:

    .. code-block:: text

        Setup complete! Now:

           cd your-runid
           source env_setup_file

        to set the proper environment for this run

    Follow these directions to get the proper run-time environment, which
    will place the proper scripts on your ``$PATH``.

    .. note::

        If you log out, you will need to source this file again when you return.

3. **Set your configuration settings**

    Edit the file ``canesm.cfg``, to set dates and run options, as documented in
    that file, and then generate the downstream config files by executing ``config-canesm``, i.e.

    .. code-block:: text

        vi canesm.cfg
        ...              # Change start and end dates, etc. runid has been set already by setup-canesm

        config-canesm

    This will produce the config files **that get used by the run** and store them in a 
    local ``config`` directory.

4. **Compile the executables**

    Executables are compiled interactively by the user. Upon sourcing the
    run's environment file, ``compile-canesm.sh`` will be placed on your ``$PATH``. 
    Simply execute it to compile:

    .. code-block:: text

        compile-canesm.sh
    
    The compilation will take a few minutes. See ``compile-canesm.sh -h`` for 
    additional options, but the default behaviour will be to compile the 
    executables in the source repo, and link them back to ``$EXEC_STORAGE_DIR``
    (defined in ``canesm.cfg``), which by default is a local ``executables``
    directory.
    
4. **Save restarts to the run's file database**

    Sourcing the run's environment will also add ``save_restart_files`` and
    ``tapeload_rs`` to your ``$PATH`` - use these scripts to retrieve/setup the input
    restarts and save them to the run's local file database:

    .. code-block:: text

        save_restart_files

    or 

    .. code-block:: test

       tapeload_rs

    Where ``save_restart_files`` looks for the specified restarts in the databases
    defined by ``DATAPATH_DB`` in the environment, and ``tapeload_rs`` looks for them
    on the tape archives.

    .. note::

       These scripts must be ran on the hall you plan to run the model on (defined
       by ``compute_system``)

    .. note::

       By *default* namelists from restarts are not used, but if you would like to 
       do so, see ``save_restart_files -h`` (or ``tapeload_rs -h``) for
       information on how to do so.

5. **Submit the job.**


    To launch the experiment, you have two options:

    *Via the command line*
        
    .. code-block:: text

        expbegin -e ${SEQ_EXP_HOME} -d YYYYMMDDhh # where YYYYMMDDhh should be replaced by todays date (i.e. 2022062218)

    *Via tlclient*

       1. log onto ``tlclient``
       2. in a terminal navigate to the run directory and ``source env_setup_file``
       3. run ``xflow``,
       
          a. set the experiment date and click the green checkmark beside it
          b. right click on the ``/canesm`` node and select "Submit"

Continuing an existing run
--------------------------

It a run has crashed before its scheduled end
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If a run crashes, the normal recovery procedure is to use ``xflow`` on ``tlclient`` to
figure out what job failed, and attempt a re-submission of the inidividual job node 
(right click on the offending node and select the "Submit and Continue" option). If this
doesn't work, users should then select the "Listing" menu and then "Latest Abort Listing"
to see the output from the job, and debug the problem. Once the problem is fixed, simply
resubmit the job and the run should continue.

Using a Terminal
""""""""""""""""

It should be noted that it is also possible to determine what job has failed by looking
under ``$WRK_DIR/sequencer/sequencing/status`` (using ``tree`` works well for this) and
looking for ``*abort.stop`` files.

Once the offending job is identified, users should be able to find a corresponding, compressed,
``*abort*`` "listing file" (output file) under ``$WRK_DIR/sequencer/listings/latest/canesm``.
Given that these files are typically compressed via ``gzip``, users can inspect these
via something like

    .. code-block:: bash

       gunzip -c sequencer/listings/latest/path/to/desired/listing_file | less

or via some other editor that can open ``gzip`` files natively.

Then, after the problem is fixed, the offending job can resubmitted via something like

    .. code-block:: bash

       maestro -n /path/to/job/node/in/maestro/suite -s submit -f continue [-l loop_name=ITERATION_NUM ]

For example, if a user wishes to resubmitted the ``model_run`` job, for the 2nd iteration of the
``model_loop``, it would look like:

    .. code-block:: bash

       maestro -n /canesm/model/model_loop/model_run -s submit -f continue -l model_loop=2

.. note::
    
    While users can monitor/relaunch runs purely in the terminal, it requires an advanced under
    standing of the ``maestro`` sequencing system and as such the ``tlclient``/``xflow``
    solution is recommended.

Run has reached the end
^^^^^^^^^^^^^^^^^^^^^^^

To continue a run which has already finished, users should:

1. adjust the ``*stop_time`` vars in ``canesm.cfg`` and then re-run ``config-canesm``
2. open ``xflow``, and **for each loop in the suite**

    a. select the ``n+1`` loop from the drop down menu, where ``n`` is the last completed loop from the run.
    b. right click the loop node and select "Member Submit" from the drop down menu

.. note::

   Under the loop drop down menus, reselect "latest" to see the suite progressing

  
Model versions
--------------

The branch ``develop_canesm`` is used to integrate in new changes, and always
reflects the lastest developments. We strive to keep ``develop_canesm`` stable
and even bit-identical to the previous tagged release, but issues can arise.

When important changes occur, a tagged release is issued. Tagged releases are
thoroughly tested, stable versions of the model (although old taggaed releases
might not function as HPC changes). The latest tagged release should always be
functional and stable, and represents a reliable starting point for work.
    
To find the latest tagged release, visit `CanESM repository on gitlab
<https://gitlab.science.gc.ca/CCCma/CanESM>`_.  From the top horizontal menu
bar, select 'repository'. Then from the secondary menu bar, select 'Tags'. The
latest tagged release is listed at the top, with its commit number underneath.
Use that for ``ver=`` in the call to ``setup-canesm``.

When new tagged releases are issued, users should merge these into their
working branches ASAP.

One time setup
--------------

Adding your ssh keys to gitlab
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Prior to accessing the necessary gitlab repositories, you must add you ssh keys to 
gitlab. Do to this, follow 
`these <https://wiki.cmc.ec.gc.ca/wiki/Subscribing_To_Gitlab>`_ instructions to
add your keys.

General Environment Guidance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To run the CanESM system, users must have a few things setup in the ``.profile``
along with a ``.condarc`` file to pick up the necessary infrastructure environments. 

Specifically, users must have the following in their ``~/.profile``:

    .. code-block:: bash

       export CCCMA_REF=/home/scrd102/cccma_libs/cccma/latest/                  # defines the lib version to use
       export PATH=$CCCMA_REF/CanESM_source_link/CCCma_tools/tools:$PATH        # Access to setup/s scripts
       source $CCCMA_REF/CanESM_source_link/CCCma_tools/generic/u2_site_profile # basic ordenv setup & ssm loads of maestro etc
       alias load_cccma_env='source $CCCMA_REF/env_setup_file'                  # command to activate a full env with binaries like ggstat
       umask 022                                                                # Default read permissions on new files for group 

and the following in a ``~/.condarc`` file:

    .. code-block:: text

       envs_dirs:
          - /home/scrd102/cccma_conda/envs

Once added, log back in for these changes to take affect.

    .. warning::

       diverging from the above environment can cause problems! If you experience issues
       with the ``maestro`` sequencing system, make sure to try again with a bare 
       ``.profile`` containing **only** to settings above to isolate this possibility.

    .. warning::

       the environment system used on the ECCC system has **notable** issues if 
       a ``~/.bashrc` file. Make sure this is not used on your science network account.

One Time Maestro Setup
^^^^^^^^^^^^^^^^^^^^^^

Prior to using the ``maestro`` sequencing system, users must setup some maestro files/links
and initialize their ``maestro``. Assuming users have setup their ``.profile`` as laid out
above, to do this:

1. Set the default machine that all maestro suites will use (many specific suites will have machines explicitly defined which will override this):

    .. code-block:: bash

       mkdir -p ~/.suites
       echo "SEQ_DEFAULT_MACHINE=ppp6" >> ~/.suites/default_resources.def

2. Set the default links, which ``maestro`` uses to find the locations to place temporary directories

    .. code-block:: bash

       mkdir -p ~/.suites/.default_links
       ss_scratch_space=/eccc/crd/ccrn/ccrn_tmp/$(whoami)/maestro
       mkdir -p /space/hall6/sitestore/${ss_scratch_space}
       mkdir -p /space/hall5/sitestore/${ss_scratch_space}
       ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp6
       ln -s /space/hall6/sitestore/${ss_scratch_space} ~/.suites/.default_links/robert
       ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/ppp5
       ln -s /space/hall5/sitestore/${ss_scratch_space} ~/.suites/.default_links/underhill

3. Initialize your ``maestro`` server, which monitors your "suites" by executing

    .. code-block:: bash

       mserver_initSCI

    and following the on-screen prompts. Then execute

    .. code-block:: bash

       mserver

Tips for compiling
------------------
If any of the compilations fail, you should inspect the errors in the hidden
``.compile-canesm*`` logs that will appear in the run directory.

All of the components directly use the source repo cloned via ``setup-canesm``, so once you 
find the problem, you can go into the source code and make the necessary modifications. 
Once done, you can recompile by simply calling the ``compile-canesm.sh`` script again. 
It should also be noted that if you only want to recompile a single component, you can
utilize various flags to limit the compilation - see ``compile-canesm.sh -h`` for
information on the interface.

Users should be aware that the compile script:

1. doesn't trigger a *full* recompile by default, to trigger a full recompile, use the ``-f`` option ("force").
2. the compile script also generates two ``cpp`` files, depending on your configuration settings (``cppdef_config.h``,
   and ``cppdef_sizes.h``). You should clean these files if you want your new settings to be applied properly, 
   or use the ``-f`` option, which also cleans them.