Defining and Executing Analysis Workflows ========================================= Figure :ref:`workflow_illustration`, illustrates the basic steps of using analysis workflows, i.e,.: 1) Create the analysis tasks 2) Define the analysis inputs 3) Execute .. _workflow_illustration: .. figure:: _static/workflow_illustration.png :scale: 100 % :alt: Figure showing and example workflow Illustration of an example workflow for image normalization In the following we will use a simple analysis---workflow in which we compute a peak-cube from a raw MSI dataset and then compute an NMF from the peak cube---to illustrate the main steps involved for performing complex analysis workflows. Step 1: Create the analysis tasks: ---------------------------------- First we need to create our main analysis objects. .. code-block:: python :linenos: from omsi.dataformat.omsi_file import * from omsi.analysis.findpeaks.omsi_findpeaks_global import omsi_findpeaks_global from omsi.analysis.multivariate_stats.omsi_nmf import omsi_nmf # Open a file to get some MSI data f = omsi_file('/Users/oruebel/Devel/openmsi-data/msidata/20120711_Brain.h5' , 'r') d = f.get_experiment(0).get_msidata(0) # Specify the analysis workflow # Create a global peak finding analysis a1 = omsi_findpeaks_global() # Create the analysis # Create an NMF that processes our peak cube a2['numIter'] = 2 # Set input to perform 2 iterations only Step 2: Define analysis inputs: ------------------------------- We can define the input parameters of analysis simply using standard dict-like assignment. Any dependencies between analysis tasks or OpenMSI files are created automatically for us. .. code-block:: python :linenos: # Define the inputs of the global peak finder a1['msidata'] = d # Set the input msidata a1['mzdata'] = d.mz # Set the input mz data # Define the inputs of the NMF a2['msidata'] = a1['peak_cube'] # Set the input data to the peak cube a2['numIter'] = 2 # Set input to perform 2 iterations only NOTE: So far we have only specified our workflow. We have not executed any analysis yet, nor have we loaded any actual data yet. Step 3: Execute --------------- Finally we need to execute our analyses. For this we have various options, depending on which parts of our workflow we want to execute. Executing a single analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^ To execute a single analysis, we can simply call the ``execute()`` function of our analysis. Note, the execute may raise and ``AnalysisReadyError`` in case that the inputs of the analsis are not ready. E.g.: .. code-block:: python :linenos: a2.execute() # Will fail with ``AnalysisReadyError`` .. code-block:: python :linenos: a1.execute() # Will successfully execute a1 Executing a single sub-workflow ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To execute a single analysis including any missing dependencies, we can simply call the ``execute_recursive()`` function. E.g.: .. code-block:: python :linenos: a2.execute_recursive() # Will successfully execute a1 The above will execute ``a1`` as well as ``a2`` since ``a2`` depends on ``a1``. **NOTE:** Recursive execution will only execute other analyses that are actually needed to complete our analysis and analysis results of dependent analyses that have been executed before will be reused. E.g., if we would call ``a2.execute_recursive()`` again, then only ``a2`` would be executed again. **NOTE:** When executing multiple dependent analyses, then the execution is typically controlled by a workflow executor py:meth:`omsi.workflow.executor`. By default, ``execute_recursive(..)`` will automatically create a default driver. If we want to customize the driver to be used then we can simply assign a driver to the analysis before-hand by setting the py:var:`omsi.analysis.base.analysis_base.driver`` instance variable. Executing all analyses ^^^^^^^^^^^^^^^^^^^^^^ To run all analyses that have been created---independent of whether they depend on each other or not---we can simply call :py:meth:`omsi.analysis.base.analysis_base.execute_all`. .. code-block:: python :linenos: a1.execute_all() # Execute all analyses The above will execute any analysis that have not up-to-date. NOTE: In contrast to py:meth:`omsi.analysis.base.analysis_base.execute` and py:meth:`omsi.analysis.base.analysis_base.execute_recursive`, this is a class-level method and not an object-method. Again, the function uses a workflow driver, which we can customize by providing as driver as input to the function. Executing multiple sub-workflows ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To explicitly execute a subset of analyses (and all their dependencies) we can explicitly define a driver for the workflow we want to execute: .. code-block:: python :linenos: from omsi.workflow.driver.greedy_executor import greedy_executor driver = greedy_executor() # Create a driver driver.add_analysis(a1) # Add one ore more analyses driver.add_analysis(a2) driver.execute() # Execute the workflow and its dependencies .. code-block:: python :linenos: driver2 = greedy_executor() driver2.add_analysis_all() # Add all analyses driver2.execute() # Execute all analyses Example: Normalizing an image ----------------------------- The goal of this example is to 1) illustrate the general concepts of how we can define analysis workflows and 2) illustrate the use of simple wrapped functions in combination with integrated analytics to create complex analysis workflows. The example shown below defines a basic image normalization workflow in which we: 1. Compute a reduced peak cube from an MSI image using the global peak finding analysis provided by BASTet 2. Use a simple wrapped function to compute the total intensity image for the peak cube dataset computed in step 1 3. use a simple wrapped function to normalize the peak cube computed in step 1 using the total intensity image computed in step 2 This is the same workflow as shown in Figure :ref:`workflow_illustration`. .. code-block:: bash :linenos: # Illustration of the basic image normalization workflow defined below: # # +-----------a1------------+ +-------------a2----------------+ +-----------a3--------------+ # +---global-peak-finder----+ +------total_intensities--------+ +---normalize_intensities---+ # | | | | | | # | msidata peak_cube +---+---> msidata total_intensities+-------> norm_factors output_0 | # | | | | | | | # | mzdata | | | axis=2 | +---> msidata | # +-------------------------+ | +-------------------------------+ | +---------------------------+ # | | # | | # +---------------------------------------+ .. code-block:: python :linenos: :emphasize-lines: 21,22,23,26,27,28,31,32,33,43 import numpy as np from omsi.shared.log import log_helper log_helper.set_log_level('DEBUG') from omsi.analysis.findpeaks.omsi_findpeaks_global import omsi_findpeaks_global from omsi.dataformat.omsi_file.main_file import omsi_file from omsi.analysis.generic import analysis_generic # Define a simple function to compute the total intensity image def total_intensity(msidata, axis=2): import numpy as np return np.sum(msidata, axis=axis) # Define a simple function to normalize an MSI data cube by per-spectrum normalization factors def normalize_intensities(msidata, normfactors): import numpy as np return msidata / normfactors[:,:,np.newaxis] # Get an ezample MSI image f = omsi_file('/Users/oruebel/Devel/openmsi-data/msidata/20120711_Brain.h5' , 'r') d = f.get_experiment(0).get_msidata(0) # Define the global peak finder a1 = omsi_findpeaks_global() a1['msidata'] = d a1['mzdata'] = d.mz # Define compute of total intensity image a2 = analysis_generic.from_function(analysis_function=total_intensity, output_names=['total_intensities']) a2['msidata'] = a1['peak_cube'] # Define the normalization of the peak cube a3 = analysis_generic.from_function(normalize_intensities) a3['msidata'] = a1['peak_cube'] a3['normfactors'] = a2['total_intensities'] # To run the workflow we now have several basic options # # 1) a3.execute_recursive() : Recursively execute the last analysis and all its dependencies (i.e., a1, a2) # 2) a1.execute_all() : Tell any analysis to execute all available analyses (i.e., a1,a2,a3) # 3) Create our own workflow driver to control the execution of the analyses # 4) Manually call execute on a1, a2, and a3 in order of their dependencies # Execute the workflow a3.execute_recursive() Workflow Tools ============== Similar to the :py:mod:`omsi.workflow.driver.cl_analysis_driver` (and the corresponding tool :py:mod:`omsi.tools.run_analysis`) for running single analysis tasks, BASTet provides basic tools for executing complete workflows via the concept of workflow dirvers. Users may implement their own drivers using the approbriate base classes :py:mod:`omsi.workflow.driver.base`. Some basic drivers and tools are already available with BASTet, e.g., the :py:mod:`omsi.workflow.driver.cl_workflow_driver` module (and the corresponding tool :py:mod:`omsi.tools.run_workflow`) defines a driver for driving and executing one or multiple workflows defined via workflow scripts, directly from the command-line. Workflow Scripts ---------------- Workflow scripts are regular python scripts that include the i) creation of the analusis objects, and ii) full or partial definition of analysis parameters but usually **NOT** the actual execution of any of the analyses. Following our example from earlier, we may simply save the following code in python source file, e.g, `normalize_image.py`. .. code-block:: python :linenos: import numpy as np from omsi.analysis.findpeaks.omsi_findpeaks_global import omsi_findpeaks_global from omsi.dataformat.omsi_file.main_file import omsi_file from omsi.analysis.generic import analysis_generic # Define a simple function to compute the total intensity image def total_intensity(msidata, axis=2): import numpy as np return np.sum(msidata, axis=axis) # Define a simple function to normalize an MSI data cube by per-spectrum normalization factors def normalize_intensities(msidata, normfactors): import numpy as np return msidata / normfactors[:,:,np.newaxis] # Define the global peak finder a1 = omsi_findpeaks_global() # Define compute of total intensity image a2 = analysis_generic.from_function(analysis_function=total_intensity, output_names=['total_intensities']) a2['msidata'] = a1['peak_cube'] # Define the normalization of the peak cube a3 = analysis_generic.from_function(normalize_intensities) a3['msidata'] = a1['peak_cube'] a3['normfactors'] = a2['total_intensities'] When using our command-line tool, all parameters that are not defined for any of the analyses are automatically exposed via command-line options. In contrast to our previous example, we here, e.g., do not set the input msidata and mzdata parameters for our global peak finder (a1). In this way, we can now easily set the input file we want to process directly via the command line. In cases where we want to expose a parameter via the command line but still want to provide a good default setting for the user, we can set the default value of a parameter via, e.g, ``a1.get_parameter_data_by_name('peakheight')['default'] = 3``. To execute our above example from the command line we can now simply do the following: .. code-block:: bash python run_workflow.py --script normalize_image.py --ana_0:msidata $HOME/20120711_Brain.h5:/entry_0/data_0 --ana_0:mzdata $HOME/20120711_Brain.h5:/entry_0/data_0/mz In order to avoid collisions between parameters with the same name for different analyses, the tool prepends the unique ``analysis_identifier`` to each parameter. Since we did not set any explicit ``analysis_identifier` (e.g, via ``a1.analysis_identifier='a1'``), the tool automatically generated unique identifiers (i.e, ``ana_0``, ``ana_1``, and ``ana_3`` for our 3 analyses). To view all available command line option we can simply call the script with ``--help``. If one or more workflow scipts are given (here via seperate ``--script`` parameters), then all unfilled options of those workflows and the corresponding analyses will be listed as. E.g. .. code-block:: python :linenos: newlappy:tools oruebel$ python run_workflow.py --script normalize_image.py --help usage: run_workflow.py --script SCRIPT [--save SAVE] [--profile] [--memprofile] [--loglevel {INFO,WARNING,CRITICAL,ERROR,DEBUG,NOTSET}] --ana_0:msidata ANA_0:MSIDATA --ana_0:mzdata ANA_0:MZDATA [--ana_0:integration_width ANA_0:INTEGRATION_WIDTH] [--ana_0:peakheight ANA_0:PEAKHEIGHT] [--ana_0:slwindow ANA_0:SLWINDOW] [--ana_0:smoothwidth ANA_0:SMOOTHWIDTH] [--ana_1:axis ANA_1:AXIS] [--reduce_memory_usage REDUCE_MEMORY_USAGE] [--synchronize SYNCHRONIZE] [-h] Execute analysis workflow(s) based on a given set of scripts required arguments: --script SCRIPT The workflow script to be executed. Multiple scripts may be added via separate --script arguments (default: None) optional arguments: --save SAVE Define the file and experiment where all analysis results should be stored. A new file will be created if the given file does not exists but the directory does. The filename is expected to be of the from: : . If no experiment index is given, then experiment index 0 (i.e, entry_0) will be assumed by default. A validpath may, e.g, be "test.h5:/entry_0" or jus "test.h5" (default: None) --profile Enable runtime profiling of the analysis. NOTE: This is intended for debugging and investigation of the runtime behavior of an analysis.Enabling profiling entails certain overheads in performance (default: False) --memprofile Enable runtime profiling of the memory usage of analysis. NOTE: This is intended for debugging and investigation of the runtime behavior of an analysis. Enabling profiling entails certain overheads in performance. (default: False) --loglevel {INFO,WARNING,CRITICAL,ERROR,DEBUG,NOTSET} Specify the level of logging to be used. (default: INFO) -h, --help show this help message and exit ana_0:omsi.analysis.findpeaks.omsi_findpeaks_global:analysis settings: Analysis settings --ana_0:integration_width ANA_0:INTEGRATION_WIDTH The window over which peaks should be integrated (default: 0.1) --ana_0:peakheight ANA_0:PEAKHEIGHT Peak height parameter (default: 2) --ana_0:slwindow ANA_0:SLWINDOW Sliding window parameter (default: 100) --ana_0:smoothwidth ANA_0:SMOOTHWIDTH Smooth width parameter (default: 3) ana_0:omsi.analysis.findpeaks.omsi_findpeaks_global:input data: Input data to be analyzed --ana_0:msidata ANA_0:MSIDATA The MSI dataset to be analyzed (default: None) --ana_0:mzdata ANA_0:MZDATA The m/z values for the spectra of the MSI dataset (default: None) ana_1 : generic: --ana_1:axis ANA_1:AXIS optional workflow executor options: Additional, optional settings for the workflow execution controls --reduce_memory_usage REDUCE_MEMORY_USAGE Reduce memory usage by pushing analyses to file each time they complete, processing dependencies out-of- core. (default: False) --synchronize SYNCHRONIZE Place an MPI-barrier at the beginning of the exection of the workflow. This can be useful when we require that all MPI ranks are fully initalized. (default: False) how to specify ndarray data? ---------------------------- n-dimensional arrays stored in OpenMSI data files may be specified as input parameters via the following syntax: -- MSI data: .h5:/entry_#/data_# -- Analysis data: .h5:/entry_#/analysis_#/ -- Arbitrary dataset: .h5: E.g. a valid definition may look like: 'test_brain_convert.h5:/entry_0/data_0' In rear cases we may need to manually define an array (e.g., a mask) Here we can use standard python syntax, e.g, '[1,2,3,4]' or '[[1, 3], [4, 5]]' This command-line tool has been auto-generated by BASTet (Berkeley Analysis & Storage Toolkit)