The main focus of the Python scripts is turn all the different proteomics data format into one consistent data-structure, which is a JSON-compatible dictionary called proteins
. In YAML format, proteins
is laid out something like this:
'example-seqid':
sequence: 'ACDEFGHKLMNP'
description: 'Some tasteless protein'
attr:
key1: value1
key2: value2
other_seqids:
- another-seqid
seqid: 'example-seqid'
sources:
-
matches:
-
sequence: 'CDE'
i: 1
intensity: 1.0
attr:
key3: value3
key4: value4
spectrum:
-
- 501
- 34.3
-
- 503.4
- 82.3
This data-structure is written verbeten into a JSON-based javascript file, which is then processed by the javascript application that generates the visual display. Optional fields are moved into the attr
dictionary.
The primary level of organization of the dictionary proteins
is, obviously, at the level of proteins. All peptide-spectrum matches will be sorted into each protein match. Presumably, the results will have been sorted into protein groups and read in with a usable sub-set of representative proteins for the protein identification.
The peptide information are sorted into separate lists in the sources
field. This allows a clear demarcation for different experiments. However, for single experiments, this adds an extra layer, where the peptides must be accessed as:
source = proteins['example-seqid']['sources'][0]
peptides = source['peptides']
Each peptide entry in the peptides
list represents a distinct peptide-spectrum match.
sequence: 'CDE'
i: 1
intensity: 1.0
attr:
modifications:
-
i: 0
mass: 344.4
key3: value3
key4: value4
spectrum:
-
- 501
- 34.3
-
- 503.4
- 82.3
i
gives 0-based position of the peptide, and should match that of the full sequence.
intensity
gives a value from -1.0 to 1.0, which is used to generate colors in the experiment-comparison mode. The coloring goes from a high color associated with 1.0 to a neutral color associated with 0.0, and down to a color associated with -1.0
spectrum
gives a list peaks that will be used in the spectrum viewer. The first number corresponds to the m/z value and the second, to the intensity. In general, only the top 50 are read in, but of course you can add more, which will probably bloat the javascript file.
modifications
is an optional field that describes any amino acid modifications in the peptide. It is a list of dictionaries. In each dictionary, the i
gives the position, and mass
gives the mass of the modified amino acid. To allow for N-terminal modifications, i
can take the value -1. For C-terminal modifiactions, i
can take the value n where n is the length of the sequence.
As discussed in the Examples, handling sequence identifiers (seqids) correctly is a recurring problem in bioinformatics. If seqids are not formatted correctly, it becomes impossible to match data from different sources. So it useful if we can format or transform all seqids into a consistent format. This idea of passing in seqid transforming functions is available in all parts of peptagram
. This way you can organize the seqids as you read them in.
In peptagram.proteins
, there is a useful convenience function change_seqids_in_proteins
that transforms the seqids found in a proteins
datastructure, including the alternate seqids found in other_seqids
. It takes a any string function clean_seqid
and transforms all seqids found in proteins
to this function, including the protein seqid keys at the top level.
import peptogram.proteins
def change_seqid(seqid):
return seqid.split('|')[0]
peptagram.proteins.change_seqids_in_proteins(proteins, clean_seqid)
As the structure of each protein in proteins
contains potentially several sources, it's important to track the name of the different experiments. Most of the parsing methods returns a source_names
list that contains the name of the different source files. Once cleaned up, these can serve as labels for the different experiment, especially in the view of the experiment-comparison mode. Often the source name is a long directory name, where the unique part is in the basename. A quick way to clean this up is:
import os
source_labels = [os.path.basename(name) for name in source_names]
This source_labels
can then be placed directly in the data
structure used to generate the web-app.
To merge proteins from two different files you can use the merging function:
import peptogram.proteins
proteins = peptagram.proteins.merge_two_proteins(proteins1, proteins2)
Internally, proteins
is actually proteins1
that has been modified by the contents of proteins2
.
Loading protein sequences in the proteins
is a necessary step in peptogram
as it is required to generate the visualizations. As well, it is necessary to view the protein. Several of the proteomics data formats do not provide this information and so we need to read in the fasta sequences from another source.
The most common source is a .fasta
file, preferably one that was used for the peptide search. To load the protein sequenceinto the proteins structure:
import peptogram.proteins
def change_seqid(seqid):
return seqid.split('|')[0]
peptagram.proteins.load_fasta_db_into_proteins(
proteins, 'mascot/HUMAN.fasta', clean_seqid)
A couple of common filtering options are provided by the filter_proteins
function. It takes a params
that is:
params = {
'exclude_seqids': 'exclude_seqids.txt',
'include_seqids': 'include_seqids.txt',
'fasta': 'sequences.fasta',
'include_msms': 1,
'match_filter': 3,
}
exclude_seqids
- a text file containg all the seqids that are to be excluded
include_seqids
- a text file containg the seqids that are allowed
fasta
: the fasta file for the full length protein sequences, this triggers the procedure described in the above section
include_msms
- value of 0 will drop all spectra, leading to a smaller webapp page, that might be necessary for very large peptagrams
match_filter
- provides a number of common peptide filters (0: no filtering; 1: keep only tryptic peptides; 2: keep only semitryptic peptides; 3: keep only modified peptides)
This is then used to filter a proteins
data-structure:
peptagram.proteins.filter_proteins(proteins, params)
Being able to easily view spectra is one of the most useful aspects of peptagram
. It was important that we could reconcile the scan identifiers in the different data formats so that we could load the spectra directly if the .mzML
files available. Being able to do this has expanded the flexibility of the system. By the time each parser has generated a proteins
structure, the peptide-spectrum matches should contain a valid .mzML
scan identifier. This is the scan_id
field in the attr
dictionary of each peptide entry.
Given this, we can run:
peptagram.mzml.load_mzml(
proteins, 0, 'example/morpheus/OK20130822_MPProtomap_KO1.mzML')
Once the proteins
data is appropriately filled in, we can pipe it through to the web-app generating method.
The function that generates the peptagram requires a dictionary as a parameter:
import peptagram.proteins
peptagram.proteins.make_graphical_comparison_visualisation({
'title': 'Mascot example',
'proteins': proteins,
'source_labels': map(peptagram.parse.basename, source_names),
'color_names': ['P=1', 'P=0', ''],
'out_dir': 'peptagram-mascot'
})
title
gives the title that will be displayed across the top of the web-app
proteins
is the dictionary that you generated consistent with the structure given above
source_labels
is an optional list of the names of the different experiments that will be used in the experiment-comparison mode
out_dir
gives the directory that will hold the peptagram