Structure

A Sisyphus experiment folder consists mainly of 4 things:
  • the config folder, containing graph definition code

  • the recipe folder, containing Job definition and pipeline code

  • the settings.py file, defines sisyphus parameters and the used engine

  • the work folder, contains the actual data in form of structured “Job” folders

When running Sisyphus, two additional folders will be added and filled automatically:
  • the alias folder, containing human readable symlinks to Job folders

  • the output folder, containing symlinks to output files from jobs

recipe folders

The recipe folder contains python packages and modules containing job definitions and pipeline code. Currently Sisyphus allows the recipe folder to be used in two different ways:

  1. The recipe folder as a single python package: This means that all imports start with from recipe., and the full name of each job will be based on the package structure without the recipe prefix.

  2. The recipe folder as location for different recipe packages: This means that there are individual recipe packages that are located in the recipe folder, and the imports start with the package name, e.g. from i6_core. (see the i6_core recipes)

Note that for new setups, variant #2 should always be preferred. If you are using PyCharm to manage a setup, it is also important to mark the recipe folder as source.

In addition to the jobs, the recipe folder can also contain helper classes and functions that are used as building blocks for a pipeline. Depending on the personal preference, also full experiments can be defined here, but the experiments have to be called via code in the config folder. See more on this in the next session.

config folder

The config folder should contain the pipeline calls for the specific experiments/workflows in a hierarchical order. When creating a new setup it is important to have a config/__init__.py which contains a main() function. This will be the global entry point for the graph thread, and should call ALL experiments/workflows without any exception. If this is not the case, some console commands will lead to incorrect behavior or not work at all.

Besides the main() function, the config folder can contain further packages/modules/functions that define workflow pipelines in the form of partial Sisyphus graphs.

Example:

The config folder has the following structure:

- config
  - __init__.py
  - experiments1.py
  - experiments2.py

__init__.py:

from config.experiments1 import run_experiments1
from config.experiments2 import run_experiments2

def main():
    run_experiments1()
    run_experiments2()

experiments1.py:

from sisyphus import tk, Path

from some_recipe_package.some_jobs import Job1, Job2, Job3

def run_experiment1():
    # define any input
    input = Path("/path/to/some/input")

    # define a sequence of jobs imported from a recipe package
    job1 = Job1(input, param="some_value")
    job2 = Job2(job1.output, param="another_value")
    job3 = Job3(job2.output, param="yet_another_value")

    # register the output
    tk.register_output("experiments1/an_output_file", job3.output)
    return job1.output, job2.output, job3.output

experiments2.py:

from another_recipe_package.some_pipelines import pipeline1, pipeline2

def run_experiment2():
    # define some inputs
    input1 = Path("/path/to/some/input")
    input2 = Path("/path/to/another/input")

    # run a pipeline (consisting of a sequence of jobs like in run_experiment1) on different inputs
    output1 = pipeline1(input1)
    output2 = pipeline1(input2)
    tk.register_output("experiments2/pipeline1/output_file1", output1)
    tk.register_output("experiments2/pipeline1/output_file2", output2)

    # run another pipeline on the same input
    output3 = pipeline2(input1)
    tk.register_output("experiments2/pipeline2/output_file1", output3)

When the pipelines are defined this way, a ./sis m call will create the full graph, and run jobs in order to produce all defined outputs. Now lets say the graph code is already very large, and you only want to run a sub-graph. With an hierarchical structure, it is then possible to call the manager with a specific function, e.g. ./sis m config.experiments2.run_experiment2 to only build and run the sub-graph for experiment 2.

It is also possible to define asynchronous workflows which allow halting the calculation of the graph to wait until the requested jobs are finished. This allows to easily make the graph dependent on intermediate results. Given the code example above it could work like this:

from config.experiments1 import run_experiments1
from config.experiments2 import run_experiments2

async def main():
    job1_output, job2_output, job3_output = run_experiments1()
    await tk.async_run(job2_output)  # The workflow will pause here until the output of job2 is available
    if job2_output.get() < some_other_value:  # Assuming Job2 returns a Variable
        run_experiments2()

The pipeline code in both the config and recipe folders can be arbitrary complex and structured freely, but it is important to keep in mind that sub-graph functions always have to be located within the config folder.

work folder

The work folder stores all files created during the experiment in the form a folder per created job. The directory structure will match the package structure below the recipe folder. This folder should point to a directory with a lot available space, and is typically a symlink to a location on a specific file system that is accessable by all cluster machines. The whole folder could be deleted after an experiment is done since everything can be recomputed, assuming your experiments are deterministic.

settings.py

Contains all settings that determine the general behavior of Sisyphus with respect to the specific setup. A required entry is the engine function that determines the backend job-scheduling engine. See Installing with… for examples. A detailed overview of all settings can be found here.

def engine():
    """ Create engine object used to submit jobs. The simplest setup just creates a local
    engine starting all jobs on the local machine e.g.:

        from sisyphus.localengine import LocalEngine
        return LocalEngine(max_cpu=8)

    The usually recommended version is to use a local and a normal grid engine. The EngineSelector
    can be used to schedule tasks on different engines. The main intuition was to have an engine for
    very small jobs that don't required to be scheduled on a large grid engine (e.g. counting lines of file).

    Note: the engines should only be imported locally inside the function to avoid circular imports

    :return: engine
    """
    # Exmple of local engine:
    from sisyphus.localengine import LocalEngine
    return LocalEngine(cpu=4, gpu=0, mem=16)

    # Example how to use the engine selector, normally the 'long' engine would be a grid enigne e.g. sge
    from sisyphus.engine import EngineSelector
    from sisyphus.localengine import LocalEngine
    from sisyphus.son_of_grid_engine import SonOfGridEngine
    return EngineSelector(
        engines={'short': LocalEngine(cpu=4),
                 'long': SonOfGridEngine(
                     default_rqmt={'cpu' : 1, 'mem' : 2, 'gpu' : 0, 'time' : 1},
                     gateway="<gateway-machine-name>")}, # a gateway is only needed if the local machine has no SGE installation
        default_engine='long')

# Wait so long before marking a job as finished to allow network
# filesystems so synchronize, should be reduced if only the local engine and filesystem is used.
WAIT_PERIOD_JOB_FS_SYNC = 30

# How ofter Sisyphus checking for finished jobs
WAIT_PERIOD_BETWEEN_CHECKS = 30

# Disable automatic job directory clean up
JOB_AUTO_CLEANUP = False