dataformat Package

Main module for specification of data formats. This module also contains the omsi_file module which specifies the OpenMSI HDF5 data format. In addition it defines the base class for third-party file readers (i.e, file_reader_base) and implements various basic file readers for third-party formats, e.g, img_file and mzml_file for IMG and MZML data files respectively (among others).

omsi.dataformat.omsi_file Module for specification of the OpenMSI file API.
omsi.dataformat.omsi_file.analysis Module for managing custom analysis data in OMSI HDF5 files.
omsi.dataformat.omsi_file.common Module for common data format classes and functionality.
omsi.dataformat.omsi_file.dependencies Base module for managing of dependencies between data in OpenMSI HDF5 files
omsi.dataformat.omsi_file.experiment OMSI file module for management of experiment data.
omsi.dataformat.omsi_file.format This module defines the basic format for storing mass spectrometry imaging data, metadata, and analysis in HDF5 in compliance with OpenMSI file format.
omsi.dataformat.omsi_file.instrument Module for managing instrument related data in OMSI files.
omsi.dataformat.omsi_file.main_file Module for managing OpenMSI HDF5 data files.
omsi.dataformat.omsi_file.metadata_collection Module for management of general metadata storage entities.
omsi.dataformat.omsi_file.methods Module for management of method specific data in OMSI data files
omsi.dataformat.omsi_file.msidata Module for managing MSI data in OMSI data files
omsi.dataformat.file_reader_base Module for base classes for implementation and integration of third-party file readers.
omsi.dataformat.bruckerflex_file This module provides functionality for reading bruker flex mass spectrometry image files
omsi.dataformat.img_file This module provides functionality for reading img mass spectrometry image files
omsi.dataformat.imzml_file This module provides functionality for reading imzML mass spectrometry image files.
omsi.dataformat.mzml_file This module provides functionality for reading mzml mass spectrometry image files.

file_reader_base Module

Module for base classes for implementation and integration of third-party file readers.

ToDo:

  • get_number_of_regions(...) should be updated to return a list of regions, one per dataset
  • Need to add base class for multi dataset formats
  • Need to add base class for multi dataset+region formats
  • Need to implement new file format for combined raw data file (ie., multiple raw files in one folder).
class omsi.dataformat.file_reader_base.file_reader_base(basename, requires_slicing=True)

Bases: object

Base-class used to define the basic interface that file-readers for a new format need to implement.

__init__ interface:

To avoid the need for custom code subclasses should be able to be constructed by providing just the basename parameter and optional requires_slicing parameter. If additional inputs are needed, then file conversion and management scripts may need to be modified to account for the custom requirements. Required Attributes:

Variables:
  • data_type – String indicating the data type to be used (e.g., uint16)
  • shape – Tuple indicating the shape of the data
  • mz – Numpy array with the m/z axis data. In the case of multi-data this is a list of numpy arrays, one per dataset.
  • basename – The basename provided for opening the file.

Required Interface Functions:

  • close_file : Close any opened files
  • is_valid_dataset : Check whether a given dir/file is valid under the current format
  • spectrum_iter : Generator function that iterates over all the spectra in the current data cube and yield the numpy array with the intensity and integer x, y position of the spectrum

Optional Interface Functions:

  • __getitem__ : Implement array slicing for files. Reuqired if the requires_slicing parameter

    should be supported.

  • supports_regions : Specify whether the format supports multiple regions (default=False)

  • supports_multidata: Specify whether the format supports multiple datasets (default=False)

  • supports_multiexperiment: Specify whether the format supports multiple experiments (default=False)

Construct the base class and define required attributes.

Parameters:
  • basename – The name of the file or directory with the file to be opened by the reader
  • requires_slicing – Boolean indicating whether the user requires array slicing via the __getitem__ function to work or not. This is an optimization, because many MSI data formats do not easily support arbitrary slicing of data but rather only iteration over spectra.
static available_formats()

Get dictionary of all available file formats that implement the file_format_base API.

Returns:Dictionary where the keys are the names of the formats and the values are the corresponding file reader classes.
close_file()

Close the file.

classmethod format_name()

Define the name of the format.

Returns:String indicating the name of the format.
get_dataset_metadata()

Get dict of additional metadata associated with the current dataset

NOTE: In the case that multiple regions and/or datasets are supported, this function should return the metadata of the currently selected dataset only. If no particular dataset is selected, then all should be returned.

Returns:Instance of omsi.shared.metadata_data.metadata_dict
get_number_of_datasets()

File readers with multi dataset support must overwrite this function to retrieve the true number of raw datasets in the file. Default implementation returns 1.

get_number_of_regions()

File readers with multi region support must overwrite this function to retrieve the true number of regions in the file. Default implementation returns 1.

classmethod is_valid_dataset(name)

Classmethod used to check whether a given directory (or file) defines as valid data file of the file format specified by the current child class

Parameters:name (String) – Name of the dir or file.
Returns:Boolean indicating whether the given file or folder is a valid file.
classmethod size(name)

Classmethod used to check the estimated size for the given file/folder.

Parameters:name (String) – Name of the dir or file.
Returns:Integer indicating the size in byte or None if unknown.
spectrum_iter()

Enable iteration of the spectra of the current data cube.

Iterate over all the spectra in the current data cube and yield the numpy array with the intensity and integer x, y position of the spectrum.

NOTE: As this is a generator one needs to use yield.

Returns:The function yields for each spectrum the following information:
  • tuple of (x,y) or (x,y,z) position of the spectrum
  • Numpy array with the spectrum
classmethod supports_multidata()

Define whether the file format support multiple independent datasets.

classmethod supports_multiexperiment()

Define whether the file format supports multiple independent experiments, each of which may contain multiple datasets.

classmethod supports_regions()

Define whether the file format support multiple regions.

class omsi.dataformat.file_reader_base.file_reader_base_multidata(basename, requires_slicing)

Bases: omsi.dataformat.file_reader_base.file_reader_base

Base-class used to define the basic interface for file-readers used to implement new file formats with support for multiple dataset (e.g, MSI dataset with multiple spectrum types). This class extends file_reader_base, and accordingly all required attributes and functions of file_reader_base must be implemented by subclasses.

In addition to the file_reader_base functions we need to implement the get_number_of_datasets(...) and get_dataset_dependencies(...) functions.

Variables:select_dataset – Unsigned integer indicating the currently selected dataset

Construct the base class and define required attributes.

get_dataset_dependencies()

Get the dependencies between the current dataset and any of the other datasets stored in the current file. If self.select_dataset is not set, then the function is expected to return a list of lists with all dependencies for all datasets.

Returns:List of dependencies (or list of lists of dependencies if self.select_dataset is None) where each dependency is a dict of the following form:
  • ‘omsi_object’: None, # The omsi file API object where the data is stored. Often None.
  • ‘link_name’: ms2_link_name, # Name for the dependency link to be used
  • ‘basename’: basename, # Basename of the file
  • ‘region’: None, # Index of the region in the dataset or None
  • ‘dataset’: ind2, # Index of the dataset withing the file or None
  • ‘help’:scan_types[ms1scan], # Help describing the depdency
  • dependency_type’: ... } # Type of dependency see dependency_dict.dependency_type for available types
get_number_of_datasets()

Get the number of available datasets.

set_dataset_selection(dataset_index)

Define the current dataset to be read.

classmethod supports_multidata()

Define whether the file format supports multiple data blocks.

class omsi.dataformat.file_reader_base.file_reader_base_with_regions(basename, requires_slicing)

Bases: omsi.dataformat.file_reader_base.file_reader_base

Base-class used to define the basic interface for file-readers used to implement new file formats with support for multiple imaging regions per file. This class extends file_reader_base, and accordingly all required attributes and functions of file_reader_base must be implemented by subclasses.

Additional required attributes:

  • select_region : Integer indicating which region should be selected. If set to None, indicates that the data should be treated as a whole. If set to a region index, then the data should be treated by the reader as if it only pertains to that region, ie., the shape of the data should be set accordingly and __getitem__ should behave as such as well.
  • region_dicts : List of dictionaries, where each dictionary describes a given region (e.g,. the origin and extend for rectangular regions.

Construct the base class and define required attributes.

get_dataset_dependencies()

Get the dependencies between the current region and any of the other region datasets stored in the current file. If self.select_region is not set, then the function is expected to return a list of lists with all dependencies for all datasets.

Returns:List of dependencies (or list of lists of dependencies if self.select_dataset is None) where each dependency is a dict of the following form:
  • ‘omsi_object’: None, # The omsi file API object where the data is stored. Often None.
  • ‘link_name’: ms2_link_name, # Name for the dependency link to be used
  • ‘basename’: basename, # Basename of the file
  • ‘region’: None, # Index of the region in the dataset or None
  • ‘dataset’: ind2, # Index of the dataset withing the file or None
  • ‘help’:scan_types[ms1scan], # Help describing the depdency
  • dependency_type’: ... } # Type of dependency see dependency_dict.dependency_type for available types
get_number_of_regions()

Get the number of available regions

get_region_selection()

Get the index of the selected region

get_regions()

Get list of all region dictionaries defining for each region the origin and extend of the region. See also self.region_dicts.

set_region_selection(region_index=None)

Define which region should be selected for local data reads.

Parameters:region_index – The index of the region that should be read. The shape of the data will be adjusted accordingly. Set to None to select all regions and treat the data as a single full 3D image.
classmethod supports_regions()

Define whether the file format support multiple regions.

img_file Module

This module provides functionality for reading img mass spectrometry image files

class omsi.dataformat.img_file.img_file(hdr_filename=None, t2m_filename=None, img_filename=None, basename=None, requires_slicing=True)

Bases: omsi.dataformat.file_reader_base.file_reader_base

Interface for reading a single 2D img file

The img format consists of three different files: i) hdr header file, ii) t2m which contains the m/z data, iii) img data file.

Open an img file for data reading.

Parameters:
  • hdr_filename (string) – The name of the hdr header file
  • t2m_filename (string) – The name of the t2m_filename
  • img_filename (string) – The name of the img data file
  • basename (string) – Instead of img_filename, t2m_filename, and hdr_filename one may also supply just a single basename. The basename is completed with the .img, .t2m, .hdr extension to load the data.
  • requires_slicing (Boolean) – Unused here. Slicing is always supported by this reader.
Raises ValueError:
 

In case that basename and hdr_filename, t2m_filename, and img_filename are specified.

close_file()

Close the img file

classmethod get_files_from_dir(dirname)

Get a list of all basenames of all img files in a given directory. Note: The basenames include the dirname.

classmethod is_valid_dataset(name)

Check whether the given file or directory points to a img file.

Parameters:name (unicode) – Name of the dir or file.
Returns:Boolean indicating whether the given file or folder is a valid img file.
classmethod size(name)

Classmethod used to check the estimated size for the given file/folder.

Parameters:name (unicode) – Name of the dir or file.
Returns:Integer indicating the size in byte or None if unknown.
spectrum_iter()

Enable iteration over the spectra in the file

Returns:tuple of ((x , y) , intensities), i.e., the tuple of (x, y) integer index of the spectrum and the numpy array of the intensities

mzml_file Module

This module provides functionality for reading mzml mass spectrometry image files.

filename = ‘/Users/oruebel/Devel/openmsi-data/mzML_Data/N2A2_Serratia_spots_extract_TI.mzML’

class omsi.dataformat.mzml_file.mzml_file(basename, requires_slicing=True, resolution=5)

Bases: omsi.dataformat.file_reader_base.file_reader_base_multidata

Interface for reading a single 2D mzml file with several distinct scan types.

Variables:available_mzml_types – Dict of available mzml flavors.

Open an img file for data reading.

Parameters:
  • basename (string) – The name of the mzml file. If basename is a directory, then the first mzML file found in the directory will be used instead.
  • requires_slicing (bool) – Should the complete data be read into memory (this makes slicing easier). (default is True)
  • resolution (float) – For profile data only, the minimum m/z spacing to use for creating the “full” reprofiled data cube
_mzml_file__compute_coordinates()

Internal helper function used to compute the coordinates for each scan.

Returns:2D numpy integer array of shape (numScans,2) indicating for each scan its x and y coordinate
classmethod _mzml_file__compute_filetype(filename)

Internal helper function used to compute the filetype.

classmethod _mzml_file__compute_mz_axis(filename, mzml_filetype, scan_types, resolution)

Internal helper function used to compute the mz axis of each scantype Returns a list of numpy arrays

static _mzml_file__compute_num_scans(filename=None)

Internal helper function used to compute the number of scans in the mzml file.

static _mzml_file__compute_scan_dependencies(scan_types=None, basename=None)

Takes a scan_types list and returns a list of tuples (x, y) indicating that scan_type[y] depends on scan_type[x]

_mzml_file__compute_scan_types_and_indices(filename=None)

Internal helper function used to compute a list of unique scan types in the mzml file. Also computes a numpy 1d array of ints which index every scan to relevant datacube.

static _mzml_file__parse_scan_parameters()

Internal helper function used to parse out scan parameters from the scan filter string

_mzml_file__read_all()

Internal helper function used to read all data. The function directly modifies the self.data entry. Data is now a list of datacubes

available_mzml_types = {'unknown': 'unknown', 'bruker': 'bruker', 'thermo': 'thermo'}
close_file()

Close the mzml file

get_dataset_dependencies()

Get the dependencies between the current dataset and any of the other datasets stored in the current file.

Inherited from

get_dataset_metadata()

Get dict of additional metadata associated with the current dataset.

Inherited from file_reader_base.file_reader_base_multidata.

Returns:Dict where keys are strings and associated values to be stored as metadata with the dataset.
classmethod get_files_from_dir(dirname)

Get a list of all basenames of all img files in a given directory. Note: The basenames include the dirname.

get_number_of_datasets()

Get the number of available datasets.

classmethod is_valid_dataset(name)

Check whether the given file or directory points to a img file.

Parameters:name (String) – Name of the dir or file.
Returns:Boolean indicating whether the given file or folder is a valid img file.
set_dataset_selection(dataset_index)

Define the current dataset to be read.

classmethod size(name, max_num_reads=1000)

Classmethod used to check the estimated size for the given file/folder. For mzml this is an estimate of the final size of the full 3D datacube. For efficiency the number of scans is estimated based on the size of the first 1000 scans.

Parameters:
  • name (unicode) – Name of the dir or file.
  • max_num_reads (int) – The maximum number of spectrum reads to be performed to estimate the file size
Returns:

Integer indicating the size in byte or None if unknown.

spectrum_iter()

Generator function that yields a position and associated spectrum for a selected datacube type.

Yield:(xidx, yidx) a tuple of ints representing x and y position in the image
Yield:yi, a numpy 1D-array of floats containing spectral intensities at the given position and for the selected datacube type
classmethod test()

Test method

bruckerflex_file Module

This module provides functionality for reading bruker flex mass spectrometry image files

Limitations:

  1. Currently the reader assumes a single global m/z axis for all spectra.
  2. The read of acqu files does not convert <..> entries to python values but leaves them as strings.
  3. __read_spotlist__ converts the regions to start with a 0 index. This is somewhat inconsistent in the spot list file. The spotname seems to number regions starting with 0 while the region list numbers them starting with 1.
  4. __read_spotlist__ computes the folder where the spots are located based on the filename of the spotlist. The question is whether this is always the case??? The advantage is that we do not rely on the regions.xml file which contains absolute paths which are in most cases invalid as the data has been copied between different machines in many cases and is stored in different locations on each of the machines.
  5. __read_spotlist__ currenly assumes that there is only one fid file per spot
  6. __read_spotlist__ currenlty only looks at where the acqu and fid file is located for the spot. Other files are currently ignored.
  7. __read_spotlist__ (and hence the reader at large) currently assumes that we have 2D images only.
  8. __read_spotlist__ currently generates maps for the image that assume that x and y pixel indices start at 0. Not all images may record data until the border, so that this strategy may add empty spectra rather than generating a new bounding box for the image.
  9. __read_spotlist__ assumes in the variable spotname_encoding a maximum of 24 characters in the spotname R01X080Y013. This should in general be more than sufficient as this allows for 7 characters for each R, X, Y entry, however, if this is not enough then this behaviour needs ot be changed.
  10. __getitem__ currently only works if we have read the full data into memory. An on-demand load should be supported as well.
  11. We can currently only selected either a single region or the full data but we cannot selected multiple regions at once. E.g. if a dataset contains 3 regions then we can either select all regions at once or region 1,2, or 3 but one cannot selected region 1+2, 1+3, or 2+3.

import bruckerflex_file spotlist = “/Users/oruebel/Devel/msidata/Bruker_Data/UNC IMS Data/20130417 Bmycoides Paenibacillus Early SN03130/” + “2013 Bmyc Paeni Early LP/2013 Bmyc Paeni Early LP Spot List.txt” exppath =”/Users/oruebel/Devel/msidata/Bruker_Data/UNC IMS Data/20130417 Bmycoides Paenibacillus Early SN03130/” + “2013 Bmyc Paeni Early LP/2013 Bmyc Paeni Early LP/0_R00X012Y006/1/1SLin” f = bruckerflex_file.bruckerflex_file( spotlist_filename = spotlist) f.s_read_fid( exppath+”/fid” , f.data_type ) testacqu = f.s_read_acqu( exppath+”/acqu” ) testmz = f.s_mz_from_acqu( testacqu ) testspotlist = f.s_read_spotlist(spotlist)

a = bruckerflex_file( dirname )

class omsi.dataformat.bruckerflex_file.bruckerflex_file(basename, fid_encoding='int32', requires_slicing=True)

Bases: omsi.dataformat.file_reader_base.file_reader_base_with_regions

Interface for reading a single bruker flex image file.

The reader supports standard array slicing for data read. I.e., to read a spectrum use [x,y,:] to read an ion image using [:,:,mz].

The reader supports multiple regions, i.e., reading of different independent regions that were imaged as part of the same dataset. Using the get_regions, get_number_of_regions, set_region_selection and get_region_selection the user can interact with the region settings. Using the set_region_selection the user can define whether the data of the complete image should be read (set_region(None), default) or whether data from a single region should be read. When a region is selected, then the reader acts as if the region were the complete dataset, i.e., the shape variable is addjusted to fit the selected region and the __get_item__ method (which is used to implement array-like slicing (e.g., [1,1,:])) behaves as if the selected region where the full data.

Open an img file for data reading.

Parameters:
  • basename (string) – Name of the textfile with the spotlist. Alternatively this may also be the folder with the spots.
  • requires_slicing (bool) – Should the complete data be read into memory (this makes slicing easier). (default is True)
  • fid_encoding (string) – String indicating in which binary format the intensity values (fid files) are stored. (default value is ‘int32’)
Variables:
  • self.basename – Name of the file with the spotlist
  • self.pixel_dict

    dictionary with the pixel array metadata (see also s_read_spotlist(...)). Some of the main keys of the dictionary are, e.g. (see also s_read_spotlist(...)) :

    • ‘spotfolder’ : String indicating the folder where all the spot-data is located
    • ‘fid’ : 2D numpy masked array of strings indicating for (x,y) pixel the fid file with the intensity values.
    • ‘acqu’ : 2D numpy masked array of strings indicating for each (x,y) pixel the acqu file with the metadata for that pixel.
    • ‘regions’ : 2D numpy masked array of integers indicating for each pixels the index of the region it belongs to.
    • ‘xpos’ : 2D numpy masked array of integers indicated for each pixel its x position.
    • ‘ypos’ : 2D numpy masked array of integers indicated for each pixel its x position.
    • ‘spotname’ : 2D masked numpy array of strings with the names of the spot corresponding to a pixel.
  • self.data_type – the encoding used for intensity values
  • self.shape – The 3D shape of the MSI data volume for the currently selected region.
  • self.full_shape – Shape of the full 3D MSI dataset including all regions imaged.
  • self.metadata – Dictionary with metadata from the acqu file
  • self.mz – The 1D numpy array with the m/z axis information
  • self.data – If requires_slicing is set to true then this 3D array includes the complete data of the MSI data cube. Missing data values (e.g., from regions not imaged during the aquistion processes) are completed with zero values.
  • self.region_dicts – Dictionary with description of the imaging regions
Raises ValueError:
 

In case that no valid data is found.

close_file()

Close the img file

get_dataset_dependencies()

Get the dependencies between the current region and any of the other region datasets stored in the current file. If self.select_region is not set, then the function is expected to return a list of lists with all dependencies for all datasets.

Returns:List of dependencies (or list of lists of dependencies if self.select_dataset is None) where each dependency is a dict of the following form:
  • ‘omsi_object’: None, # The omsi file API object where the data is stored. Often None.
  • ‘link_name’: ms2_link_name, # Name for the dependency link to be used
  • ‘basename’: basename, # Basename of the file
  • ‘region’: None, # Index of the region in the dataset or None
  • ‘dataset’: ind2, # Index of the dataset withing the file or None
  • ‘help’:scan_types[ms1scan], # Help describing the depdency
  • dependency_type’: ... } # Type of dependency see dependency_dict.dependency_type for available types
classmethod is_valid_dataset(name)

Determine whether the given file or name specifies a bruckerflex file

Parameters:name (string) – name of the file or dir
Returns:Boolean indicating whether the name is a valid bruckerflex
static s_mz_from_acqu(acqu_dict)

Construct the m/z axis from the data stored in the acqu_dict dictionary. See also s_read_acqu

Parameters:acqu_dict

Python dictionary with the complete information from the acqu file. See s_read_acqu(...).

returns:1D Numpy array of floats with the mz axis data.
static s_read_acqu(filename)

Construct an m/z axis for the given acqu file.

Parameters:filename (string) – String with the name+path for the acqu file.
Returns:Return dictonary with the parsed metadata information
static s_read_fid(filename, data_type='int32', selection=slice(None, None, None))

Read data from an fid file

Parameters:
  • filename (string) – String indicating the name+path to the fid file.
  • data_type – The numpy datatype encoding used for the fid files. (default is ‘int32’). In the instance of this class this is encoded in the data_type variable associated with the instance.
  • selection (slice or list, i.e., a selection that numpy understands) – This may be a python slice or a list of indecies to be read. Default value is to read all (i.e., slice(None,None,None))
Returns:

1D numpy array of intensity values read from the file.

static s_read_spotlist(spotlist_filename)

Parse the given spotlist file.

Parameters:spotlist_filename (string) – Name of the textfile with the spotlist
Returns:
The function returns a number of different items in from of a python dictionary.
Most data is stored as 2D spatial maps, indicting for each (x,y) location the corresponding data. Most data is stored as 2D masked numpy arrays. The masked of the array indicated whether data has been recorded for a given pixel or not. The dict contains the following keys:
  • ‘spotfolder’ : String indicating the folder where all the spot-data is located
  • ‘fid’ : 2D numpy masked array of strings with fid file name for each (x,y) pixel (intensities).
  • ‘acqu’ : 2D numpy masked array of strings with acqu file name for each (x,y) (metadata).
  • ‘regions’ : 2D numpy masked array of integers with region index for each (x,y) pixel.
  • ‘xpos’ : 2D numpy masked array of integers indicated for each pixel its x position.
  • ‘ypos’ : 2D numpy masked array of integers indicated for each pixel its x position.
  • ‘spotname’ : 2D masked numpy array of strings with spot name for each (x,y) pixel.
static s_spot_from_dir(in_dir, spot_folder_only=False)

Similar to s_read_spotlist but instead of using a spotlist file the structure of the data is parsed directly from the structure of the direcory containint all spots.

param in_dir:

Name of the directory with all spots

type in_dir:

string

param spot_folder_only:
 

If set to True, then the function only constructs the spot folders but does not check for acqu files etc. If set to True, only the spotfolder list will be returned.

returns:
The function returns None in case that no valid spots were found. Returns a list of strings

with the spotfolders if spot_folder_only is set to True. Otherwise, the function returns a number of different items in from of a python dictionary. Most data is stored as 2D spatial maps, indicting for each (x,y) location the corresponding data. Most data is stored as 2D masked numpy arrays. The masked of the array indicated whether data has been recorded for a given pixel or not. The dict contains the following keys:

  • ‘spotfolder’ : String indicating the folder where all the spot-data is located
  • ‘fid’ : 2D numpy masked array of strings with fid file name for each (x,y) pixel (intensities)
  • ‘acqu’ : 2D numpy masked array of strings with acqu file name for each (x,y) pixel (metadata)
  • ‘regions’ : 2D numpy masked array of integers with the index of the region for each (x,y) pixel.
  • ‘xpos’ : 2D numpy masked array of integers indicated for each pixel its x position.
  • ‘ypos’ : 2D numpy masked array of integers indicated for each pixel its x position.
  • ‘spotname’ : 2D masked numpy array of strings with the name of the spot corresponding to a pixel.
set_region_selection(region_index=None)

Define which region should be selected for local data reads.

Parameters:region_index – The index of the region that should be read. The shape of the data will be adjusted accordingly. Set to None to select all regions and treat the data as a single full 3D image.