How-to guides#

Running a shell command#

The most simple example is to run a shell command without any arguments:

from aiida_shell import launch_shell_job
results, node = launch_shell_job('date')
print(results['stdout'].get_content())

Which should print something like Thu 17 Mar 2022 10:49:52 PM CET.

Running a shell command with arguments#

To pass arguments to the shell command, pass them as a string to the arguments keyword:

from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'date',
    arguments='--iso-8601'
)
print(results['stdout'].get_content())

which should print something like 2022-03-17.

Tip

Curly braces {} carry specific meaning in arguments and are considered as placeholders for input files (see this section). If you need literal curly braces, escape them by doubling them, just as you would for a normal f-string:

from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'echo',
    arguments='some{{curly}}braces',
)
print(results['stdout'].get_content())

which prints some{curly}braces.

Running a shell command with files as arguments#

For commands that take arguments that refer to files, pass those files using the nodes keyword. The keyword takes a dictionary of SinglefileData nodes. To specify where on the command line the files should be passed, use placeholder strings in the arguments keyword.

from aiida.orm import SinglefileData
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'cat',
    arguments='{file_a} {file_b}',
    nodes={
        'file_a': SinglefileData.from_string('string a'),
        'file_b': SinglefileData.from_string('string b'),
    }
)
print(results['stdout'].get_content())

which prints string astring b.

Note

The filename with which the file is written to the working directory is taken from the SinglefileData.filename property. This is set when the node is created, e.g., SinglefileData.from_string('content', filename='some_filename.txt'). If the filename property is the default, the key of the node in the nodes dictionary is used instead.

Warning

If the filename overlaps with a reserved filename (i.e. stdout, stderr or status), the filename will be automatically changed to a unique filename by appending a random suffix.

Running a shell command with files as arguments with specific filenames#

The keys in the nodes dictionary can only use alphanumeric characters and underscores. The keys will be used as the link label of the file in the provenance graph, and as the filename in the temporary directory in which the shell command will be executed. Certain commands may require specific filenames, for example including a file extension, e.g., filename.txt, but this cannot be used in the nodes arguments. To specify explicit filenames that should be used in the running directory, make sure that the filename of the SinglefileData node is defined. If the SinglefileData.filename was explicitly set when creating the node, that is the filename used to write the input file to the working directory:

from aiida.orm import SinglefileData
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'cat',
    arguments='{file_a}',
    nodes={
        'file_a': SinglefileData.from_string('string a', filename='filename.txt'),
    }
)
print(results['stdout'].get_content())

which prints string a.

If the filename of the SinglefileData cannot be controlled, alternatively explicit filenames can be defined using the filenames argument:

from aiida.orm import SinglefileData
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'cat',
    arguments='{file_a}',
    nodes={
        'file_a': SinglefileData.from_string('string a'),
    },
    filenames={
        'file_a': 'filename.txt'
    }
)
print(results['stdout'].get_content())

which prints string a. Filenames specified in the filenames input will override the filename of the SinglefileData nodes. Any parent directories in the filepath, for example some/nested/path in the filename some/nested/path/file.txt, will be automatically created.

Warning

The output filename can be anything except for stdout, stderr and status, which are reserved filenames. If these names are chosen anyway, a validation error is raised when the job is launched.

Running a shell command with folders as arguments#

Certain commands might require the presence of a folder of files in the working directory. Just like a file is modeled in AiiDA’s provenance graph by a SinglefileData node, a folder is represented by a FolderData node. The following example shows how a FolderData can be created to contain multiple files and how it can be passed to launch_shell_job using the nodes argument:

import pathlib
import tempfile
from aiida.orm import FolderData
from aiida_shell import launch_shell_job

# First create a ``FolderData`` node with some arbitrary files
with tempfile.TemporaryDirectory() as tmpdir:
    dirpath = pathlib.Path(tmpdir)
    (dirpath / 'file_a.txt').write_text('content a')
    (dirpath / 'file_b.txt').write_text('content b')
    folder_data = FolderData(tree=dirpath.absolute())

results, node = launch_shell_job(
    'ls',
    nodes={
        'directory': folder_data,
    }
)
print(results['stdout'].get_content())

which prints:

_aiidasubmit.sh
file_a.txt
file_b.txt
_scheduler-stderr.txt
_scheduler-stdout.txt
stderr
stdout

The contents of the folder_data node, the file_a.txt and file_b.txt files, were copied to the working directory.

Note that by default, the contents of the FolderData are copied to the root of the working directory, as shown in the example above. If the contents should be written to a directory inside the working directory, use the filenames argument, as is done for copying SinglefileData nodes. Take for example the zip command that can create a zip archive from one or many files and folders.

import pathlib
import tempfile
from aiida.orm import FolderData
from aiida_shell import launch_shell_job

# First create a ``FolderData`` node with some arbitrary files
with tempfile.TemporaryDirectory() as tmpdir:
    dirpath = pathlib.Path(tmpdir)
    (dirpath / 'file_a.txt').write_text('content a')
    (dirpath / 'file_b.txt').write_text('content b')
    folder_data = FolderData(tree=dirpath.absolute())

results, node = launch_shell_job(
    'zip',
    arguments='-r archive.zip {folder}',
    outputs=['archive.zip'],
    nodes={
        'folder': folder_data,
    },
    filenames={
        'folder': 'directory'
    }
)

In this example, the contents of the folder_data node were copied to the directory folder in the working directory. The results dictionary contains the archive_zip output which is a SinglefileData node containing the zip archive. It can be unzipped as follows: verdi node repo cat <IDENTIFIER> | unzip, where <IDENTIFIER> should be replaced with the pk or UUID of the archive_zip node. The original files file_a.txt and file_b.txt are now written to the current working directory.

Note

It is not required for a FolderData node, that is specified in the nodes input, to have a corresponding placeholder in the arguments. Just as with SinglefileData inputs nodes, if there is no corresponding placeholder, the contents of the folder are simply written to the working directory where the shell command is executed. This is useful for commands that expect a folder to be present in the working directory but whose name is not explicitly defined through a command line argument.

Running a shell command with remote data#

Data that is stored on a remote computing resource, which is configured in AiiDA as a Computer, can be represented in the provenance graph as a RemoteData node. This can be useful if a job needs data that is already present on the computer where the job is to run. AiiDA can simply make the remote data available in the working directory of the job without copying it through the local machine, which would be costly for large data.

For the purpose of an example, imagine there is a zip archive on a remote computer that needs to be unzipped. In the following, the remote computer is actually the localhost to keep the example generic, but the concept applies to any Computer:

import pathlib
import shutil
from aiida.orm import RemoteData, load_computer
from aiida_shell import launch_shell_job

# Create a temporary folder with the subdirectories ``archive`` and ``content``.
dirpath = pathlib.Path.cwd() / 'tmp_folder'
dirpath_archive = dirpath / 'archive'
dirpath_content = dirpath / 'content'
dirpath_archive.mkdir(parents=True)
dirpath_content.mkdir(parents=True)

# Write a dummy file ``content/file.txt`` and create an archive of the ``content`` dir as ``archive/archive.zip``.
(dirpath_content / 'file.txt').write_text('content')
shutil.make_archive((dirpath_archive / 'archive'), 'zip', dirpath_content)

# Create a ``RemoteData`` node that points to the ``archive`` directory on the localhost.
localhost = load_computer('localhost')
remote_data = RemoteData(computer=localhost, remote_path=str(dirpath_archive.absolute()))

results, node = launch_shell_job(
    'unzip',
    arguments='archive.zip',
    nodes={
        'remote_data': remote_data,
    },
    outputs=['file.txt']
)
print(results['file_txt'].get_content())

which prints content.

Tip

By default, the contents of the RemoteData nodes are copied to the working directory. This may be undesirable for large data, in which case the metadata option use_symlinks can be set to True to symlink the contents instead of copy it.

Any number of RemoteData nodes can be specified in the nodes input. The entire content of each node will be recursively copied to the working directory. It is currently not possible to select only parts of a RemoteData to be copied or to have it copied with a different filename to the working directory.

Warning

If multiple RemoteData input nodes contain files with the same name, these files will be overwritten without warning. The same goes if the files overlap with any other files present in the job’s working directory.

Passing other Data types as input#

The nodes keyword does not only accept SinglefileData nodes, but it accepts also other Data types. For these node types, the content returned by the value property is directly cast to str, which is used to replace the corresponding placeholder in the arguments. So as long as the Data type implements this value property it should be supported. Of course, whether it makes sense for the value of the node to be used directly as a command line argument for the shell job, is up to the user. Typical useful examples, are the base types that ship with AiiDA, such as the Float, Int and Str types:

from aiida.orm import Float, Int, Str
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'echo',
    arguments='{float} {int} {string}',
    nodes={
        'float': Float(1.0),
        'int': Int(2),
        'string': Str('string'),
    },
)
print(results['stdout'].get_content())

which prints 1.0 2 string. This example is of course contrived, but when combining it with other components of AiiDA, which typically return outputs of these form, they can be used directly as inputs for launch_shell_job without having to convert the values. This ensures that provenance is kept.

Redirecting input file through stdin#

Certain shell commands require input to be passed through the stdin file descriptor. This is normally accomplished as follows:

cat < input.txt

To reproduce this behaviour, the file that should be redirected through stdin can be defined using the metadata.options.filename_stdin input:

from aiida.orm import SinglefileData
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'cat',
    nodes={
        'input': SinglefileData.from_string('string a')
    },
    metadata={'options': {'filename_stdin': 'input'}}
)
print(results['stdout'].get_content())

which prints string a.

N.B.: one might be tempted to simply define the arguments as '< {input}', but this won’t work as the < symbol will be quoted and will be read as a literal command line argument, not as the redirection symbol. This is why passing the < in the arguments input will result in a validation error.

Redirecting stderr to the stdout file#

A common practice when running shell commands is to redirect the content, written to the stderr file descriptor, to stdout. This is normally accomplished as follows:

date > stdout 2>&1

To reproduce this behaviour, set the metadata.options.redirect_stderr input to True:

from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'date',
    metadata={'options': {'redirect_stderr': True}}
)

If the option is not specified, or set to False, the stderr will be redirected to the file named stderr, as follows:

date > stdout 2> stderr

Defining outputs#

When the shell command is executed, AiiDA captures by default the content written to the stdout and stderr file descriptors. The content is wrapped in a SinglefileData node and attached to the ShellJob with the stdout and stderr link labels, respectively. Any other output files that need to be captured can be defined using the outputs keyword argument.

from aiida.orm import SinglefileData
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'sort',
    arguments='{input} --output sorted',
    nodes={
        'input': SinglefileData.from_string('2\n5\n3'),
    },
    outputs=['sorted']
)
print(results['sorted'].get_content())

which prints 2\n3\n5.

Defining output files with globbing#

When the exact output files that will be generated and need to be captured are not known in advance, one can use globbing. Take for example the split command, which split a file into multiple files of a certain number of lines. By default, each output file will follow the sequence xa, xb, xc etc. augmenting the last character alphabetically. These output files can be captured by specifying the outputs as ['x*']:

from aiida.orm import SinglefileData
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'split',
    arguments='-l 1 {single_file}',
    nodes={
        'single_file': SinglefileData.from_string('line 0\nline 1\nline 2\n'),
    },
    outputs=['x*']
)
print(results.keys())

which prints dict_keys(['xab', 'xaa', 'xac', 'stderr', 'stdout']).

Defining output folders#

When the command produces a folder with multiple output files, it is also possible to parse this as a single output node, instead of individual outputs for each file. If a filepath specified in the outputs corresponds to a directory, it is attached as a FolderData that contains all its contents, instead of individual SinglefileData nodes. For example, imagine a compressed tarball /some/path/archive.tar.gz that contains the folder sub_folder with a number of files in it. The following example uncompresses the tarball and captures the uncompressed files in the sub_folder directory in the sub_folder output node:

from aiida.orm import SinglefileData
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'tar',
    arguments='-zxvf {archive}',
    nodes={
        'archive': SinglefileData('/some/path/archive.tar.gz'),
    },
    outputs=['sub_folder']
)
print(results.keys())

which prints dict_keys(['sub_folder', 'stderr', 'stdout']). The contents of the folder can be retrieved from the node as follows:

for filename in results['sub_folder'].list_object_names():
    content = results['sub_folder'].get_object_content(filename)
    # or, if a file-like object is preferred to stream the content
    with results['sub_folder'].open(filename) as handle:
        content = handle.read()

Defining a specific computer#

By default the shell command ran by launch_shell_job will be executed on the localhost, i.e., the computer where AiiDA is running. However, AiiDA also supports running commands on remote computers. See the AiiDA’s documentation for instructions to setting up and configuring a remote computer. To specify what computer to use for a shell command, pass it as a key to the metadata argument:

from aiida.orm import load_computer
from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'date',
    metadata={'computer': load_computer('some-computer')}
)
print(results['stdout'].get_content())

Here you can use aiida.orm.load_computer to load the Computer instance from its label, PK or UUID.

Defining a pre-configured code#

The first argument, command, of launch_shell_job takes the name of the command to be run as a string. Under the hood, this is automatically converted into an AbstractCode. The command argument also accepts a pre-configured code instance directly:

from aiida.orm import load_code
from aiida_shell import launch_shell_job
code = load_code('date@localhost')
results, node = launch_shell_job(code)

This approach can be used as an alternative to the previous example where the target computer is specified through the metadata argument. For more details on creating codes manually, please refer to AiiDA’s documentation.

Running with MPI#

AiiDA supports running codes that are compiled with support for the Message Passing Interface (MPI). It can be enabled by setting the metadata.options.withmpi input to True:

from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'parallel-executable',
    metadata={
        'options': {
            'withmpi': True,
            'resources': {
                'num_machines': 2,
                'num_mpiprocs_per_machine': 3,
            }
        }
    }
)

When MPI is enabled, by default AiiDA assumes Open MPI and calls the command prefixed with mpirun -np {tot_num_mpiprocs}. The {tot_num_mpiprocs} placeholder is replaced with the product of the num_machines and num_mpiprocs_per_machine keys of the metadata.options.resources input, i.e., in this example the MPI line would be mpirun -np 6.

Note

If the target command does not use Open MPI but some other implementation, a computer can be configured to customize the mpirun_command attribute. For example, on clusters with the SLURM job scheduler, the MPI run command could be set to srun -n {tot_num_mpiprocs}. Once the computer is correctly set up and configured, it can be passed to the metadata.options.computer input.

Running many shell jobs in parallel#

By default the shell command ran by launch_shell_job is run blockingly; meaning that the Python interpreter is blocked from doing anything else until the shell command finishes. This becomes inefficient if you need to run many shell commands. If the shell commands are independent and can be run in parallel, it is possible to submit the jobs to AiiDA’s daemon by setting submit=True:

from aiida.engine.daemon.client import get_daemon_client
from aiida_shell import launch_shell_job

# Make sure the daemon is running
get_daemon_client().start_daemon()

nodes = []

for arguments in ['string_one', 'string_two']:
    results, node = launch_shell_job(
        'echo',
        arguments=arguments,
        submit=True,
    )
    nodes.append(node)
    print(f'Submitted {node}')

The results returned by launch_shell_job will now just be an empty dictionary. The reason is because the function returns immediately after submitting the job to the daemon at which point it isn’t finished yet and so the results are not yet known. To check on the status of the submitted jobs, you can use the verdi process list command of the CLI that ships with AiiDA. Or you can do it programmatically:

import time

while True:
    if all(node.is_terminated for node in nodes):
        break
    time.sleep(1)

for node in nodes:
    if node.is_finished_ok:
        print(f'{node} finished successfully')
        # The outputs of each node can be accessed through `node.outputs`:
        print(node.outputs.stdout.get_content())
    else:
        print(f'{node} failed')

Custom output parsing#

By default, all outputs will be parsed into SinglefileData nodes. While convenient not having to define a parser manually, it can also be quite restrictive. One of AiiDA’s strong points is that it can store data in JSON form in a relational database, making it queryable, but the content of SinglefileData nodes is excluded from this functionality.

The parser keyword allows to define a “custom” parser, which is a function with the following signature:

def custom_parser(dirpath: pathlib.Path) -> dict[str, Data]:
    """Parse any output file generated by the shell command and return it as any ``Data`` node."""

The dirpath argument receives the filepath to a directory that contains the retrieved output files that can then be read and parsed. The parsed results should be returned as a dictionary of Data nodes, such that they can be attached to the job’s node as outputs in the provenance graph.

The following example shows how a custom parser can be implemented:

from aiida_shell import launch_shell_job

def custom_parser(dirpath):
    from aiida.orm import Str
    return {'string': Str((dirpath / 'stdout').read_text().strip())}

results, node = launch_shell_job(
    'echo',
    arguments='some output',
    parser=custom_parser
)
print(results['string'].value)

which prints some output.

Important

If the output file that is parsed by the custom parser is not any of the files that are retrieved by default, i.e., stdout, stderr, status and the filenames specified in the outputs input, it has to be specified in the metadata.options.additional_retrieve input:

from json import dumps
from aiida_shell import launch_shell_job
from aiida.orm import SinglefileData

def parser(self, dirpath):
    """Parse the content of the ``results.json`` file and return as a ``Dict`` node."""
    import json
    from aiida.orm import Dict
    return {'json': Dict(json.load((dirpath / 'results.json').open()))}

results, node = launch_shell_job(
    'cat',
    arguments='{json}',
    nodes={'json': SinglefileData.from_string(dumps({'a': 1}))},
    parser=parser,
    metadata={
        'options': {
            'output_filename': 'results.json',
            'additional_retrieve': ['results.json']
        }
    }
)
print(results['json'].get_dict())

which prints {'a': 1}.

Optionally, the parsing function can also define the parser argument. If defined, in addition to the dirpath, the parser receives an instance of the Parser class. This instance can be useful for a number of things, such as:

  • Access the logger in order to log messages

  • Access the node that represents the ShellJob, from which, e.g., its input nodes can be accessed

Below is an example of how the parser argument can be put to use:

from pathlib import Path
from aiida_shell import launch_shell_job
from aiida.parsers import Parser

def custom_parser(dirpath: Path, parser: Parser):
    from aiida.orm import Bool, Str

    inputs = parser.node.inputs  # Retrieve inputs of the job

    if inputs.arguments[0] == 'return-bool':
        parser.logger.warning('Arguments set to `return-bool`, returning a bool')
        return {'output': Bool(True)}
    else:
        return {'output': Str((dirpath / 'stdout').read_text().strip())}

results, node = launch_shell_job(
    'echo',
    arguments='return-bool',
    parser=custom_parser
)
print(results['output'].value)

which should print

07/18/2024 03:49:32 PM <13555> aiida.parser.ShellParser: [WARNING] Arguments set to `return-bool`, returning a bool
True

Tip

If you find yourself reusing the same parser often, you can also register it with an entry point and use that for the parser input. See the AiiDA documentation for details on how to register entry points. For example, if the parser is registered with the name some.parser in the group aiida.parsers, the parser input will accept aiida.parsers:some.parser. The entry point will automatically be validated and wrapped in a aiida_shell.data.entry_point.EntryPointData.

Keeping the command path relative#

By default, launch_shell_job() automatically converts the provided command to the absolute filepath of the corresponding executable. This serves two purposes:

  1. A check to make sure the command exists on the specified computer

  2. Increases the quality of provenance

The executable that a relative command resolves to on the target computer can change as a function of the environment, or simply change over time. Storing the actual absolute filepath of the executable avoids this, although it remains of course vulnerable to the executable itself actually being changed over time.

Nevertheless, there may be use-cases where the resolving of the command is not desirable. To skip this step and keep the command as specified, set the resolve_command argument to False:

from aiida_shell import launch_shell_job
results, node = launch_shell_job('date')
assert str(node.inputs.code.filepath_executable) == '/usr/bin/date'

results, node = launch_shell_job('date', resolve_command=False)
assert str(node.inputs.code.filepath_executable) == 'date'

Customizing run environment#

By default, aiida-shell runs the specified command in a regular bash shell. The shell will inherit the default environment of the system user, as if they would have launched an interactive shell. It is possible to customize this environment by specifying bash commands to run before the actual command is invoked. These commands can be specified in the metadata option prepend_text. An example use case is to load a particular Python environment, using conda for example, in which the command should run:

from aiida_shell import launch_shell_job
results, node = launch_shell_job(
    'some_command_in_some_conda_env',
    metadata={
        'options': {
            'prepend_text': 'conda activate some-conda-env'
        }
    }
)

The resulting bash script that is executed will look something like the following:

#!/usr/bin/env bash
exec > _scheduler-stdout.txt
exec 2> _scheduler-stderr.txt

conda activate some-conda-env

some_command_in_some_conda_env > 'stdout' 2> 'stderr'

echo $? > status