Python Programming API

Basic Data Structure: proteins

The main focus of the Python scripts is turn all the different proteomics data format into one consistent data-structure, which is a JSON-compatible dictionary called proteins. In YAML format, proteins is laid out something like this:

'example-seqid':
  sequence: 'ACDEFGHKLMNP'
  description: 'Some tasteless protein'
  attr:
    key1: value1
    key2: value2
    other_seqids:
      - another-seqid
    seqid: 'example-seqid'
  sources:
    -
      matches:
        -
          sequence: 'CDE'
          i: 1
          intensity: 1.0
          attr:
            key3: value3
            key4: value4
          spectrum:
            -
              - 501
              - 34.3
            -
              - 503.4
              - 82.3

This data-structure is written verbeten into a JSON-based javascript file, which is then processed by the javascript application that generates the visual display. Optional fields are moved into the attr dictionary.

The primary level of organization of the dictionary proteins is, obviously, at the level of proteins. All peptide-spectrum matches will be sorted into each protein match. Presumably, the results will have been sorted into protein groups and read in with a usable sub-set of representative proteins for the protein identification.

The peptide information are sorted into separate lists in the sources field. This allows a clear demarcation for different experiments. However, for single experiments, this adds an extra layer, where the peptides must be accessed as:

source = proteins['example-seqid']['sources'][0]
peptides = source['peptides']

Peptide-spectrum match list

Each peptide entry in the peptides list represents a distinct peptide-spectrum match.

sequence: 'CDE'
i: 1
intensity: 1.0
attr:
  modifications:
    -
      i: 0
      mass: 344.4
  key3: value3
  key4: value4
spectrum:
  -
    - 501
    - 34.3
  -
    - 503.4
    - 82.3

i gives 0-based position of the peptide, and should match that of the full sequence.

intensity gives a value from -1.0 to 1.0, which is used to generate colors in the experiment-comparison mode. The coloring goes from a high color associated with 1.0 to a neutral color associated with 0.0, and down to a color associated with -1.0

spectrum gives a list peaks that will be used in the spectrum viewer. The first number corresponds to the m/z value and the second, to the intensity. In general, only the top 50 are read in, but of course you can add more, which will probably bloat the javascript file.

modifications is an optional field that describes any amino acid modifications in the peptide. It is a list of dictionaries. In each dictionary, the i gives the position, and mass gives the mass of the modified amino acid. To allow for N-terminal modifications, i can take the value -1. For C-terminal modifiactions, i can take the value n where n is the length of the sequence.

Sequence identifiers

As discussed in the Examples, handling sequence identifiers (seqids) correctly is a recurring problem in bioinformatics. If seqids are not formatted correctly, it becomes impossible to match data from different sources. So it useful if we can format or transform all seqids into a consistent format. This idea of passing in seqid transforming functions is available in all parts of peptagram. This way you can organize the seqids as you read them in.

In peptagram.proteins, there is a useful convenience function change_seqids_in_proteins that transforms the seqids found in a proteins datastructure, including the alternate seqids found in other_seqids. It takes a any string function clean_seqid and transforms all seqids found in proteins to this function, including the protein seqid keys at the top level.

import peptogram.proteins

def change_seqid(seqid):
  return seqid.split('|')[0]

peptagram.proteins.change_seqids_in_proteins(proteins, clean_seqid)

Multiple data sources

As the structure of each protein in proteins contains potentially several sources, it's important to track the name of the different experiments. Most of the parsing methods returns a source_names list that contains the name of the different source files. Once cleaned up, these can serve as labels for the different experiment, especially in the view of the experiment-comparison mode. Often the source name is a long directory name, where the unique part is in the basename. A quick way to clean this up is:

import os
source_labels = [os.path.basename(name) for name in source_names]

This source_labels can then be placed directly in the data structure used to generate the web-app.

To merge proteins from two different files you can use the merging function:

import peptogram.proteins

proteins = peptagram.proteins.merge_two_proteins(proteins1, proteins2)

Internally, proteins is actually proteins1 that has been modified by the contents of proteins2.

Protein sequences

Loading protein sequences in the proteins is a necessary step in peptogram as it is required to generate the visualizations. As well, it is necessary to view the protein. Several of the proteomics data formats do not provide this information and so we need to read in the fasta sequences from another source.

The most common source is a .fasta file, preferably one that was used for the peptide search. To load the protein sequenceinto the proteins structure:

import peptogram.proteins

def change_seqid(seqid):
  return seqid.split('|')[0]

peptagram.proteins.load_fasta_db_into_proteins(
    proteins, 'mascot/HUMAN.fasta', clean_seqid)

Filters

A couple of common filtering options are provided by the filter_proteins function. It takes a params that is:

params = {
  'exclude_seqids': 'exclude_seqids.txt',
  'include_seqids': 'include_seqids.txt',
  'fasta': 'sequences.fasta',
  'include_msms': 1,
  'match_filter': 3,
}

exclude_seqids - a text file containg all the seqids that are to be excluded

include_seqids - a text file containg the seqids that are allowed

fasta: the fasta file for the full length protein sequences, this triggers the procedure described in the above section

include_msms - value of 0 will drop all spectra, leading to a smaller webapp page, that might be necessary for very large peptagrams

match_filter - provides a number of common peptide filters (0: no filtering; 1: keep only tryptic peptides; 2: keep only semitryptic peptides; 3: keep only modified peptides)

This is then used to filter a proteins data-structure:

peptagram.proteins.filter_proteins(proteins, params)

Finangling MS/MS spectra

Being able to easily view spectra is one of the most useful aspects of peptagram. It was important that we could reconcile the scan identifiers in the different data formats so that we could load the spectra directly if the .mzML files available. Being able to do this has expanded the flexibility of the system. By the time each parser has generated a proteins structure, the peptide-spectrum matches should contain a valid .mzML scan identifier. This is the scan_id field in the attr dictionary of each peptide entry.

Given this, we can run:

peptagram.mzml.load_mzml(
    proteins, 0, 'example/morpheus/OK20130822_MPProtomap_KO1.mzML')

Loading data into the webapp

Once the proteins data is appropriately filled in, we can pipe it through to the web-app generating method.

The function that generates the peptagram requires a dictionary as a parameter:

import peptagram.proteins

peptagram.proteins.make_graphical_comparison_visualisation({
  'title': 'Mascot example',
  'proteins': proteins,
  'source_labels': map(peptagram.parse.basename, source_names),
  'color_names': ['P=1', 'P=0', ''],
  'out_dir': 'peptagram-mascot'
})

title gives the title that will be displayed across the top of the web-app

proteins is the dictionary that you generated consistent with the structure given above

source_labels is an optional list of the names of the different experiments that will be used in the experiment-comparison mode

out_dir gives the directory that will hold the peptagram