datastructures Package

omsi.datastructures Package with a collection of various data structures and related classes used throughout the software stack, e.g., for metadata, analysis parameter data, runtime information data etc.
omsi.datastructures.metadata Package with metadata datastructures
omsi.datastructures.metadata.metadata_data Define infrastructure for describing metadata (in memory)
omsi.datastructures.metadata.metadata_ontologies Define ontologies for metadata
omsi.datastructures.analysis_data Helper module with data structures for managing analysis-related data.
omsi.datastructures.dependency_data Define a dependency to another omsi object
omsi.datastructures.run_info_data Module with helper data structures for recording runtime provenance data

datastructures Package

Package with a collection of various data structures and related classes used throughout the software stack, e.g., for metadata, analysis parameter data, runtime information data etc.

analysis_data Module

Helper module with data structures for managing analysis-related data.

class omsi.datastructures.analysis_data.analysis_data(name='undefined', data=None, dtype='float32')

Bases: dict

Define an output dataset for the analysis that should be written to the omsi HDF5 file

The class can be used like a dictionary but restricts the set of keys that can be used to the following required keys which should be provided during initalization.

Required Keyword Arguments:

Parameters:
  • name – The name for the dataset in the HDF5 format
  • data – The numpy array to be written to HDF5. The data write function omsi_file_experiment.create_analysis used for writing of the data to file can in principal also handel other primitive data types by explicitly converting them to numpy. However, in this case the dtype is determined based on the numpy conversion and correct behavior is not guaranteed. I.e., even single scalars should be stored as a 1D numpy array here. Default value is None which is mapped to np.empty( shape=(0) , dtype=dtype) in __init__
  • dtype

    The data type to be used during writing. For standard numpy data types this is just the dtype of the dataset, i.e., [‘data’].dtype. Other allowed datatypes are:

    • For string: omsi_format.str_type (omsi_format is located in omsi.dataformat.omsi_file )
    • To generate data links: ana_hdf5link (analysis_data)

Value used to indicate that a hard link to another dataset should be created when saving an analysis object

class omsi.datastructures.analysis_data.data_dtypes

Bases: dict

Class specifying basic function for specifying common data types used as part of an analysis.

static bool_type(argument)

Implement conversion of boolean input parameters since arparse (or bool, depending on the point of view), do not handle bool as a type in an intuitive fashion.

Parameters:argument – The argument to be parsed to a boolean
Returns:The converted value
static get_dtypes()

Get a list of available data type specifications

static ndarray(argument)

This dtype may be used to indicate numpy ndarrays as well as h5py arrays or omsi_dependencies

Parameters:argument – The argument to be parsed to ndarray
Returns:The converted ndarray
class omsi.datastructures.analysis_data.parameter_data(name, help='', dtype=None, required=False, default=None, choices=None, data=None, group=None)

Bases: dict

Define a single input parameter for an analysis.

Variables:default_keys – List of allowed dictionary keys:

Required keys:

  • name : The name of the parameter
  • help : Help string describing the parameter
  • type : Optional type. Default is None, indicating a dynamically typed dataset that the analysis will convert
  • required : Boolean indicating whether the parameter is required (True) or optional (False). Default False
  • default : Optional default value for the parameter. Default None.
  • choices : Optional list of choices with allowed data values. Default None, indicating no choices set.
  • data : The data assigned to the parameter. None by default.
  • ‘group’ : Optional group string used to organize parameters. This may also be a dict of {‘name’:<group>, ‘description’:<description>}

In the context of the argparse package the default keys have the following mapping:

  • argparse.name = name
  • argparse.action –> The action is constant and set to save value
  • argparse.nargs –> Left as default
  • `argparse.const –> Not used as action is always save value
  • argparse.type = type
  • argparse.choices = `choices
  • argparse.required = `required
  • argparse.help = `help
  • `argparse.metavar –> Not used. Positional arguments are not allowed for analyses
  • argparse.destination –> Automatically determined by the `name of the parameter
  • argparse.add_argument_group(...) –> Automatically determined based on the required parameter and the `group parameter if set.

Initialize a new parameter description.

Parameters:
  • name – Required name for the parameter
  • help – Required help string for the parameter
  • dtype – Type argument. Default unicode.
  • required – Boolean indicating whether the parameter is required (default=True)
  • default – Optional default value for the parameter. Default None.
  • choices – Optional list of choices with allowed data values. Default None, indicating no choices set.
  • data – The data assigned to the parameter. None by default.
  • group – The parameter group to be used. None by default.
clear_data()

Remove the currently assigned data.

copy()

Return a new parameter_data object with the same data as stored in the current object

Returns:dependency_dict object
data_ready()

This function check if the data points to a dependency and if so, then check if the dependency can be resolved or not

data_set()

Check if a data has been assigned for the parameter.

default_keys = ['name', 'default', 'dtype', 'choices', 'required', 'help', 'data', 'group']

List of allowed keys for the parameter dict.

get_data_or_default()

Get the data of the parameter if set, otherwise get the default value if available.

Returns:The data to be used for the parameter.
Raises:KeyError is raised in case that neither ‘default’ nor ‘data’ are available. This should never be the case if the object was created properly.
get_group_description()

Get the description for the group if available.

Returns:String with the group description or None.
get_group_name()

Get the name of the group to be used.

Returns:String with the name of the group of None if not set
is_dependency()

Check whether the parameter defines a dependency.

Returns:Boolean indicating whether the parameter defines a dependency.
class omsi.datastructures.analysis_data.parameter_manager

Bases: object

Base class for objects that manage their own parameters.

Parameters are set and their values retrieved by name using dict-like slicing. Derived classes may overwrite __getitem__ and __setitem__ to implement their own behavior but we exepct that the functionality of the interface is preserved, i.e., others should still be able set parameter value and retrieve values via dict slicing.

add_parameter(name, help, dtype=<type 'unicode'>, required=False, default=None, choices=None, data=None, group=None)

Add a new parameter for the analysis. This function is typically used in the constructor of a derived analysis to specify the parameters of the analysis.

Parameters:
  • name – The name of the parameter
  • help – Help string describing the parameter
  • dtype – Optional type. Default is string.
  • required – Boolean indicating whether the parameter is required (True) or optional (False). Default False.
  • default – Optional default value for the parameter. Default None.
  • choices – Optional list of choices with allowed data values. Default None, indicating no choices set.
  • data – The data assigned to the parameter. None by default.
  • group – Optional group string used to organize parameters. Default None, indicating that parameters are automatically organized by driver class (e.g. in required and optional parameters)
Raises:

ValueError is raised if the parameter with the given name already exists.

clear_parameter_data()

Clear the list of parameter data

define_missing_parameters()

Set any required parameters that have not been defined to their respective default values.

This function may be overwritten in child classes to customize the definition of default parameter values and to apply any modifications (or checks) of parameters before the analysis is executed. Any changes applied here will be recorded in the parameter of the analysis.

get_all_dependency_data()

Get the complete list of all direct dependencies to be written to the HDF5 file

NOTE: These are only the direct dependencies as specified by the analysis itself. Use get_all_dependency_data_recursive(..) to also get the indirect dependencies of the analysis due to dependencies of the dependencies themselves.

Returns:List of parameter_data objects that define dependencies.
get_all_parameter_data(exclude_dependencies=False)

Get the complete list of all parameter datasets to be written to the HDF5 file

Parameters:exclude_dependencies – Boolean indicating whether we should exclude parameters that define dependencies from the list
get_num_dependency_data()

Return the number of dependencies defined as part of the parameters

get_num_parameter_data()

Return the number of parameter datasets to be wirtten to the HDF5 file

get_parameter_data(index)

Given the index return the associated dataset to be written to the HDF5 file

:param index : Return the index entry of the private member parameters. If a
string is given, then get_parameter_data_by_name(...) will be used instead.
Raises:IndexError is raised when the index is out of bounds
get_parameter_data_by_name(dataname)

Given the key name of the data return the associated parameter_data object.

Parameters:dataname – Name of the parameter requested from the parameters member.
Returns:The parameter_data object or None if not found
get_parameter_names()

Get a list of all parameter dataset names (including those that may define dependencies.

keys()

Get a list of all valid keys, i.e., a list of all parameter names.

Returns:List of strings with all input parameter and output names.
set_parameter_default_value(name, value)

Set the default value of the parameter with the given name

Parameters:
  • name – Name of the parameter
  • value – New value
Raises:

KeyError if parameter not found

dependency_data Module

Define a dependency to another omsi object

class omsi.datastructures.dependency_data.dependency_dict(param_name=None, link_name=None, omsi_object=None, selection=None, dataname=None, help=None, dependency_type=None)

Bases: dict

Define a dependency to another omsi file-based data object or in-memory analysis_base object

Required Keyword Arguments:

Variables:
  • param_name – The name of the parameter that has the depency
  • link_name – The name of for the link to be created in the HDF5 file.
  • omsi_object – The object to which a link should be established to. This must be either an h5py.Dataset or the omsi_file_analysis or omsi_file_msidata or any of the other omsi_file API interface ojects.
  • selection – Optional string type parameter indicating a python selection for the dependency
  • dataname – String indicating the dataset within the omsi_object. If the omsi_object is an h5py object within a managed Group, then the omsi_object is automatically split up into the parent object and dataname.
  • _data – Private key used to store the data associated with the dependency object.

Optional Keyword arguments:

Variables:dependency_type – The type of the dependency being modeled. If not defined then the default value of ‘parameter’ is assumed.

Initialize the allowed set of keys.

Parameters:
  • param_name – The name of the parameter that has the dependency
  • link_name – The name of for the link to be created in the HDF5 file.
  • omsi_object – The object to which a link should be established to. This must be either an h5py.Dataset or the omsi_file_analysis or omsi_file_msidata or any of the other omsi_file API interface objects.
  • selection – Optional string type parameter indicating a python selection for the dependency
  • dataname – String indicating the dataset within the omsi_object. If the omsi_object is an h5py object within a managed Group, then the omsi_object is automatically split up into the parent object and dataname.
  • help – Optional string describing the object
copy()

Return a new dependency_dict object with the same data as stored in the current object

Returns:dependency_dict object
dependency_types = {'subset': 'subset', 'undefined': None, 'contains': 'contains', 'link': 'link', 'parameter': 'parameter', 'co_modality': 'co_modality'}
get_data()

Get the data associated with the dependency.

Returns:If a selection is applied and the dependency object supports array data load (e.g., h5py.Dataset, omsi_file_msidata), then the selected data will be loaded and returned as numpy array. Otherwise the [‘omsi_object’] is returned.

run_info_data Module

Module with helper data structures for recording runtime provenance data

class omsi.datastructures.run_info_data.run_info_dict(*args, **kwargs)

Bases: dict

Simple dictionary class for collecting runtime information

The typical use is as follows:

>> my_run_info = run_info_dict() >> my_run_info(my_function)(my_parameters)

With this, all runtime information is automatically collected in my_run_info. We can enable time-and-usage and memory profiling simply by calling enable_profile_time_and_usage(...) or enable_profile_memory(...), respectively, before we run our function.

We can also use the data structure directly and control the population ourselves, however, memory profiling is not supported by default in this case but we need to set and run the memory profiler ourselves, since memory_profiler expects that it can wrap the function

DEFAULT_TIME_FORMAT = '%Y-%m-%d %H:%M:%S.%f'
clean_up()

Clean up the runinfo object. In particular remove empty keys that either recorded None or recorded just an empty string.

This function may be overwritten to also do clean-up needed due to additional custom runtime instrumentation.

When overwriting this function we should call super(..., self).runinfo_clean_up() at the end of the function to ensure that the runinfo dictionary is clean, i.e., does not contain any empty entries.

clear()

Clear the dictionary and other internal parameters

Side Effects

  • Remove all key/value pairs from the dict
  • Set self.__time_and_use_profiler to None
  • Set self.__memory_profiler to None
  • Set self.__profile_memory to False if invalid (i.e, if set to True but memory profiling is unavailable)
  • Set self.__profile_time_and_usage to False if invalid (i.e., if set to True but profiling is unavailable)
enable_profile_memory(enable=True)

Enable/disable profiling of memory usage

Parameters:enable – boolean to enable (True) or disable (False) memory profiling
enable_profile_time_and_usage(enable=True)

Enable/disable time and usage profiling

Parameters:enable – boolean to enable (True) or disable (False) time and usage profiling
gather()

Simple helper function to gather the runtime information—that has been collected on multiple processes when running using MPI—on a single root process

Returns:If we have more than one processes then this function returns a dictionary with the same keys as usual for the run_info but the values are now lists with one entry per mpi processes. If we only have a single process, then the run_info object will be returned without changes. NOTE: Similar to mpi gather, the function only collects information on the root. All other processes will return just their own private runtime information.
get_profile_memory()

Check whether profiling of memory usage is enabled

Returns:Boolean indicating whether memory profiling is enabled
get_profile_stats_object(consolidate=True, stream=None)

Based on the execution profile of the execute_analysis(..) function get pstats.Stats object to help with the interpretation of the data.

Parameters:
  • consolidate – Boolean flag indicating whether multiple stats (e.g., from multiple cores) should be consolidated into a single stats object. Default is True.
  • stream – The optional stream parameter to be used fo the pstats.Stats object.
Returns:

A single pstats.Stats object if consolidate is True. Otherwise the function returns a list of pstats.Stats objects, one per recorded statistic. None is returned in case that the stats objects cannot be created or no profiling data is available.

get_profile_time_and_usage()

Check whether time and usage profiling is enabled

Returns:Boolean indicating whether time and usage profiling is enabled
record_postexecute(execution_time=None)

Function used to record runtime information after the task we want to track is comleted, e.g. the execute_analysis(...) function of a standard analysis.

The function may be overwritten in child classes to add recording of additional runtime information.

When overwriting the function we should call super(...,self).runinfo_record_postexecute(execution_time) in the custom version to ensure that the execution and end_time are properly recorded.

Parameters:
  • execution_time – The total time it took to execute the analysis. May be None, in which case the function will attempt to compute the execution time based on the start_time (if available) and the the current time.
  • comm – Used for logging only. The MPI communicator to be used. Default value is None, in which case MPI.COMM_WORLD is used.
record_preexecute()

Record basic runtime information in this dict before the exeuction is started.

Function used to record runtime information prior to executing the process we want to track, e.g., the execute_analysis(...) of a standard analysis.

The function may be overwritten in child classes to add recording of additional runtime information. All runtime data should be recorded in the main dict (i.e, self). This ensures in the case of standard analysis that the data is stored in the HDF5 file. Other data should be stored in separate variables that we may add to the object.

When overwriting the function we should typically call super(...,self).runinfo_record_pretexecute() last in the custom version to ensure that the start_time is properly recorded right before the execution of the analysis.

static string_to_structime(time_string, time_format=None)

Covert a time string to a time.struct_time using time.strptime

Parameters:
  • time_string – String with the time, e.g, with the start time of a program.
  • time_format – The time format to be used or None in which case run_info_dict.DEFAULT_TIME_FORMAT will be used.
static string_to_time(time_string, time_format=None)

Convert a time string to local time object using time.mktime.

Parameters:
  • time_string – String with the time, e.g, with the start time of a program.
  • time_format – The time format to be used or None in which case run_info_dict.DEFAULT_TIME_FORMAT will be used.