Proteomics with OMSSA | Sivome
Proteomics is the large-scale research of proteins (from Wiki :-)). Any -omics is mainly large-scale research of a selected biomolecule beneath research. Extra examples: Giant-scale research of metabolome(s) is metabolomics, large-scale research of lipids is lipidomics and so forth. A few of these molecules (or bio-molecules) i.e., lipids, metabolites, proteins will be quantified utilizing mass-spectrometer. Because the names says, the mass-spectrometer measures the mass utilizing properties of ionized molecules!
To investigate the output of mass-spectrometers, we want a program. OMSSA is one such instrument developed by Lewis Geer at NCBI, Nationwide Institutes of Well being. This program analyzes the info utilizing refined algorithm to finally establish the protein. There are different softwares/packages to research different molecules i.e., metabolites, lipids. Additionally, there are a number of different software program packages to research proteins, akin to Byonic, MODa, PEAKS, pFind3, X!TANDEM, SearchGUI and the listing goes on. Right here, let’s concentrate on OMSSA as I used this instrument prior to now.
Evaluate on mass-spectrometry primarily based proteomics: This is an article by Ruedi Aebersold and Mathias Mann.
OMSSA will be downloaded here
Link to publication on OMSSA
OMSSA obtain folder has a pattern information file (*.dta) for testing. The info that may be analyzed has many codecs, and there are a lot of instrument distributors complicating this. Nonetheless there are some customary information codecs akin to mzML, mzXML (for instance), that may be analyzed by any software program and there are instruments on the market to transform uncooked information to this format!
Let’s have a look at what is accessible in OMSSA obtain:
Listing of C:OMSSAomssa-2.1.9.win32
02/21/2019 12:14 PM <DIR> .
02/21/2019 12:14 PM <DIR> ..
02/21/2019 12:14 PM <DIR> contrib
12/06/2010 01:31 PM 2,444 disclaimer.txt
12/06/2010 01:31 PM 113,946 mods.xml
12/06/2010 01:31 PM 159 MSHHWGYGK.dta
12/06/2010 01:31 PM 554,832 msvcp80.dll
12/06/2010 01:31 PM 632,656 msvcr80.dll
12/06/2010 01:31 PM 73,134 OMSSA.xsd
12/06/2010 01:31 PM 2,551,808 omssa2pepXML.exe
12/06/2010 01:31 PM 2,953,216 omssacl.exe
12/06/2010 01:31 PM 2,457,600 omssamerge.exe
12/06/2010 01:31 PM 15,584 usermods.xml
10 File(s) 9,355,379 bytes
3 Dir(s) 817,232,166,912 bytes free
C:OMSSAomssa-2.1.9.win32>
Right here omssacl.exe is the core script that processes uncooked information to establish protein. MSHHWGYGK.dta is a pattern information file. Merely put, this .dta mass-spectra file has peaks. For those who plot this, X-axis is mass (particularly mass/cost – measure of charged ions) and y-axis is counts (i.e., what number of instances the instrument sees this mass). Let’s have a look at the .dta file.
C:OMSSAomssa-2.1.9.win32>cat MSHHWGYGK.dta
1102.5 1
147.11 10
204.13 10
219.08 10
356.14 10
367.20 10
424.22 10
493.20 10
610.30 10
679.28 10
736.30 10
747.36 10
884.42 10
899.36 10
956.38 10
971.45 10
1st column are the lots and the 2nd column (besides the primary line) are the counts. For simplicity, all peaks have the identical counts i.e, 10.
Let’s plot this with R. Let’s go in particulars about ggplot later.
library(ggplot2) # Device to plot in R
peptide_peaks = learn.csv("./MSHHWGYGK.dta", header = FALSE, sep =" ")
peptide_mass = peptide_peaks[1,1]*peptide_peaks[1,2] # 1st column parts
peptide_peaks = peptide_peaks[-1,]
ggplot(information = peptide_peaks, aes(x=V1, y=V2)) + geom_bar(stat="id") + labs(x="m/z", y = "Depth")
Mass-spectrometer generates tons of such .dta information and the purpose of this system is to establish all of the proteins it sees within the uncooked information. Since proteins are big (on common 400 amino acids), these are lower into small items known as peptides, that are then despatched into the mass-spec. Small items would permit for higher ionization and therefore higher identification.
Mass-spec cleaves the peptide additional. Let’s say the peptide is MSHHWGYGK. Mass-spec makes use of fragmentation approach to interrupt these additional into even smaller sub-units i.e., M, MS, MSH, … additionally from the other finish, Okay, KG, KGY .. and so forth. (The determine above reveals the lots of those items M, MS, MSH,……. KGY, KG, Okay from oppsoite finish).
So, the purpose of the software program is to seek out these small items M, MS, then sew them to MSHHWGYGK and use background data to see which protein does the peptide (MSHHWGYGK) match to. This background data is usually given to this system within the type of a fasta file, or a formatted fasta file within the case of OMSSA utilizing makeblastdb.
All this may be present in my git repo
Completely different variants of the fasta file, akin to CA2.fasta.p* are the information created by makeblastdb. This goes as enter to OMSSA as nicely.
Let’s run this system (lastly!)
omssacl -i 1,4 -mf 3 -mv 1 -f MSHHWGYGK.dta -d CA2.fasta -oc omssa_sample.csv
Extra data on the arguments used will be discovered from the above hyperlinks or utilizing omssa assist file. More info here. Some arguments give particular details about how the pattern is ready. Different arguments inform in regards to the mass-spectrometer traits e.g., what fragmentation approach.
Since it is a quite simple enter, the output if all the things goes proper seems to be one thing like this.
Truncated output from above run
MSHHWGYGK | 2.61923815969567e-008 | 1101.493 | sp|P00918.2|CAH2_HUMAN RecName:
Which means that the height listing, matched to “MSHHWGYGK” (matches the file title as nicely, so extremely seemingly it’s appropriate!). Very low E-value of two.6e-8 additionally confirms the reply is appropriate! The peptide has mass of 1101.5 (dalton), which appear to match nicely to the primary column of the .dta file.
From the fasta file, this system matches the peptide to CAH2_HUMAN RecName.
Most likely quickly, I’ll dive into some large-scale information and in addition begin taking a look at different software program.