Purchase of this document is not possible at this time because the store is temporarily disabled. Continue searching

Bioinformatics methods for protein identification using peptide mass fingerprinting data

ProQuest Dissertations and Theses, 2009
Dissertation
Author: Zhao Song
Abstract:
Protein identification using mass spectrometry is an important yet partially solved problem in the study of proteomics during the post-genomic era. The major techniques used in mass spectrometry are Peptide Mass Fingerprinting (PMF) and Tandem mass spectrometry (MS/MS). PMF is faster and economical compared with MS/MS and widely applicable in many fields. Our work focus on the method development for protein identification using PMF data and this work covers three subjects: (1) Protein Identification scoring function development: we developed the Probability Based Scoring Function (PBSF) which is used to quantify the degree of match between PMF data and candidate protein. The derived score is used to rank the protein and predict the identification. (2) Confidence Assessment development: scoring function may lead to false positive identification since the top hit from a database search may not be the target protein. In addition, the identification scores assigned singly by a scoring function (raw scores) are not normalized. Therefore, the ranking based on raw scores may be biased. To address the above issue, we have developed a statistical model to evaluate the confidence of the raw score and to improve the ranking of proteins for identification. (3) Software development: we implemented our computational methods in an open source package "ProteinDecision" which is freely available upon request.

iii TABLE OF CONTENTS ACKNOWLEDGEMENTS ................................................................................................ ii TABLE OF CONTENTS ................................................................................................... iii LIST OF TABLES ............................................................................................................ vii

LIST OF FIGURES .......... ................ ................ ................ .............. ............. ............. ....... viii ABSTRACT ...................................................................................................................... . ix Chapter 1.

INTRODUCTION ....................................................................................................1 1.1 High-throughput Data in Proteomics .............................................................2 1.2 Protein Identification Pipeline ........................................................................4 1.2.1 Protein Extraction and Digestion ..........................................................6 1.2.2 Protein Separation .................................................................................7 1.2.3 Mass Spectrometry Analysis .................................................................8 1.2.4 Protein Identification .............................................................................9 1.3 Mass Spectrometry Technology ............. ................ ................ ............. .........12 1.3.1 Peptide Mass Fingerprinting (PMF) ....................................................13 1.3.2 Tandem Mass (MS/MS) ......................................................................13 1.4 Materials and Database ................................................................................14 1.4.1 Materials ..............................................................................................14 1.4.2 Database ..............................................................................................15 1.5 Existing Computational Methods .................................................................16 1.5.1 MOWSE ..............................................................................................17 1.5.2 Profound ..............................................................................................18 1.5.3 Protein Prospector ...............................................................................20

iv 1.5.4 Normal Distribution Scoring Function ................................................21 1.5.5 Protein Identification Methods for Tandem Mass ...............................22 1.6 Dissertation Structure ..................................................................................23 2.

PROTEIN IDENTIFICATION SCORING FUNCTIONS .....................................24 2.1 Introduction ..................................................................................................24 2.2 Data Sources .................................................................................................25 2.3 Scoring Function ..........................................................................................28 2.3.1 Scoring Function Review ....................................................................28 2.3.1.1 Mowse ....................................................................................28 2.3.1.2 Profound ................ ................ .............. ............. ............. .........29 2.3.1.3 Normal Distribution Scoring Function ...................................30 2.3.2 Probability Based Scoring Functi on .............. ................ ............. .........31 2.3.2.1 Framework ..............................................................................31 2.3.2.2 Dependency of Peptides and Protein ......................................35 2.3.2.3 Peak Selection and Normalizat ion .........................................39 2.3.2.3.1 Peak Selection .........................................................39 2.3.2.3.2 Peak Normalization .................................................41 2.3.2.4 Modified PBSF .......................................................................45 2.4 Results ..........................................................................................................45 2.4.1 Score Schema Comparison ..................................................................45 2.4.2 Comparison with Mascot and Protein Prospector ...............................49 2.5 Discussion ....................................................................................................50 3.

CONFIDENCE ASSESSMENT .............................................................................53 3.1 Introduction ..................................................................................................53 3.2 Theory Fundamental ....................................................................................54

v 3.2.1 Binomial Distribution ..........................................................................54 3.2.2 Central Limit Theory ...........................................................................56 3.2.3 Normal Approximation to Binomial Distribution ...............................56 3.3 Confidence Assessment Approaches ...........................................................58 3.3.1 Central Limit Theory Approach ..........................................................58 3.3.2 Gram-Charlier Expansion Approach ...................................................61 3.4 Results ..........................................................................................................63 3.4.1 Study on Individual Protein Kinase 2 .................................................64 3.4.2 Bench Mark of Entire Data Set ...........................................................65 3.4.3 Bootstrap for Confidence Interval .......................................................71 3.4.4 Confidence Interpretation ....................................................................74 3.5 Discussion ....................................................................................................75 4.

SOFTWARE ...........................................................................................................76 4.1 SpotLink .......................................................................................................76 4.1.1 Clickable 2D gel ..................................................................................76 4.1.2 Web Pages of Protein Expression Profile ..........................................79 4.2 ProteinDecision ............................................................................................80 4.2.1 Functionality ........................................................................................81 4.2.2 Design and Implementation ................................................................84 4.3 ProteomeFactory ..........................................................................................86 5.

SUMMARY ............................................................................................................87 5.1 Work Conclusion ..........................................................................................87 5.2 Limitations ...................................................................................................90 5.3 Future works .................................................................................................91 APPENDIX

vi 1.

LIST OF AMINO ACIDS .......................................................................................99 2.

LIST OF ABBRECIATIONS ...............................................................................100 VITA .......................................................................................................................... ......101

vii LIST OF TABLES Table Page 1.

Gel Spots of Protein Standards .................................................................................26 2.

Chi-square Test for Dependency of peptide and protein ..........................................38 3.

Scale parameter estimation of 12 species for l ogistic distribution ...........................44 4.

Scoring function prediction benchmark ...................................................................50 5.

Protein identification for mitogen-activated pr otein kinase 2 ..................................64 6.

Benchmark for entire dataset ....................................................................................66 7.

Fitting for Shuffled Raw Scores Using Normal and 2-Normix Models ..................73

viii LIST OF FIGURES Figure Page 1.

Protein Identification Pipeline ....................................................................................5 2.

Trypsin digestion of a protein ....................................................................................7 3.

protein identification ................................................................................................10 4.

An example that how a peptide contributes to the frequency table..........................32 5.

(a) The distribution of peptide number in terms of protein categories .....................37 (b) The distribution of peptide frequency in terms of protein categories .................37 6.

logistic curves with different scale parameters ........................................................43 7.

Scoring function comparison ...................................................................................46 8.

(a) protein ranks top-1 versus selected peak s from PMF spectra. ...........................48 (b) protein ranks top-10 versus selected peaks from PMF spectra. .........................48 9.

Binomial distribution under different probability (Produced by Tayste May 2008) .........................................................................55 10.

Normal approximation to Binomial Distribution with different parameters ............57 11.

(a) The distribution for three groups ........................................................................69 (b) Boxplot of -log(P) for three groups .....................................................................70 12.

Comparison of four fitting method s ................... ................ ................ ............. .........72 13.

Clickable 2D gel .......................................................................................................78 14.

Protein Expression Profile ........................................................................................80 15.

GUI of ProteinDecision: The multi-ways for pe ak selection .......... ................ .........81 16.

GUI of ProteinDecision: the output panel for pr ediction result ...............................83 17.

Protein Decision Design ...........................................................................................85

ix BIOINFORMATICS METHODS FOR PROTEIN IDENTIFICATION USING PEPTIDE MASS FINGERPRINTING DATA Zhao Song Dr. Dong Xu, Dissertation Supervisor ABSTRACT

Protein identification using mass spectrometry is an important yet partially solved problem in the study of proteomics during the post-genomic era. The major techniques used in mass spectrometry are Peptide Mass Fingerprinting (PMF) and Tandem mass spectrometry (MS/MS). PMF is faster and economical compared with MS/MS and widely applicable in many fields. Our work focus on the method development for protein identification using PMF data and this work covers three subjects: (1) Protein Identification scoring function development: we developed the Probability Based Scoring Function (PBSF) which is used to quantify the degree of match between PMF data and candidate protein. The derived score is used to rank the protein and predict the identification. (2) Confidence Assessment de velopment: scoring function may lead to false positive identification since the top hit fro m a database search may not be the target protein. In addition, the identification scor es assigned singly by a scoring function (raw scores) are not normalized. Therefore, the ranking based on raw scores may be biased. To address the above issue, we have developed a statistical model to evaluate the confidence of the raw score and to improve the ranking of proteins for identification. (3) Software development: we implemented our computational methods in an open source package “ProteinDecision” which is freely available upon request.

1

Chapter 1. Introduction

Bioinformatics is originally developed for the analysis of biological sequences, and now in dealing with a variety of subjec t areas such as genomics, proteomics and structural biology, etc. One definition of bioinformatics is “conceptualizing biology in terms of molecules (in th e sense of physical chemistr y) and applying ‘informatics techniques’ (derived from di sciplines such as applied mathematics, computer science and statistics) to understand and organize the information associated with these molecules on a large scale” or in short “a management information system for molecular biology and has many practical applic ations.” [1] As “Biological data are flooding in at an unprecedented rate” [2], one of the most challenges in biology has become the computing problem which will apply computational techniques to understand the knowledge associated with bi ology data. The intersection of biology and computer science is a proper approach due to three reasons: (1) The biology itself is information oriented. The genes which are coded with Adenine, Thymine, Cytosine and Guanine play the most important role in organism’s physiology and behavior. (2) The biology data are being generated at a much faster speed with advanced technology [3]. (3) The development in co mputer technology in CPU [4], memory, hard-disk storage and Internet is matche d with the biology experiments processing power. With bioinformatics a pproach, researchers are ab le to develop tools and resources to analyze biol ogy data or in use of pr ediction and discovery.

2

An important branch of bioinformatics study is proteomics, the large-scale study of proteins [5-6], which is the main component of the physiological metabolic pathways of cells and the vital part of an organism. Protein is a chain of amino acid, which has 20 types with specific molecular weight for each (except for leucine and isoleucine, both of which have the same molecular weight but different topologies). The sequence of amino acids will fold to sp ecific structure and determine the protein functionality. The study of pr oteins is of great value in drug discovery, disease prevention, food authentication a nd agriculture industry, etc.

1.1. High-throughput Data in Proteomics Proteomics study requires high-throughput data, which are always in genome scale, high dimensional and collected from multiple sources. Because of the defect of current high-throughput techniqu es, high-throughput data ma y contain different types of noises that will make negative effects on the data quality. However, the following characteristics of high-throughput data make it essential for bioinformatics study in the post genomic era. First, high-throughput data supply thousands of measurements per sample, and the sheer amount of related data increases the n eed for better models to enhance inference [7]; second, innova tive computational models are in high demand to mining biological knowledge by usi ng integrated data, such as microarray, serial analysis of gene expression (S AGE) [8] and mass spectrometry. Third, high- throughput data is economical and time-sav ing in biological experiments. The

3

challenges are to efficiently control the da ta-quality by performing the pre-processing, the reliability assessment and the validation process and to find the proper model to analyze the integrated data systematically. It’s known that genome-wide technologies to detect protei n abundance are still lagging behind those that measure mRNA, and only few studies that measure protein abundance on a large scale are currently av ailable [9-14]. Proteins are chemically modified in the phase of Post-Translational Modificati on (PTM) [15], in which the properties of a protein are changed by proteolytic cleavage or by adding of a modifying group to one or more amino aci ds. The PTM is vital to the protein functionality, and can determine the activity state, localization and interaction with other proteins. The basic idea of proteomics is to compare proteomes qualitatively and quantitatively under different conditions. Like microarray, proteomics uses Gels for separation and analysis of mu ltiple proteins samples under different conditions in one batch. The Gel Electrophoresis can be pe rformed within one dimension (SDS-PAGE, IEF, Native -PAGE), two dimensions (2D-PAGE ), or in a capillary. Several forms of PAGE exist and can provide different type s of information about the proteins. Non denaturing PAGE, also called native PAGE, separates proteins according to their mass/charge ratio (m/z). SDS-PAGE, the mo st widely used electrophoresis technique, separates proteins primarily by mass. Tw o-dimensional PAGE (2D-PAGE) separates

4

proteins by isoelectric point in the fi rst dimension and by mass in the second dimension. Another high-throughput technique is Ma ss Spectrometry, which can provide a pool of peptides from protein samples. By comparing the peptid es pool to a protein sequence with pre-calculated peptide masse s, the similarity between the protein sample and the protein in database can be scored. This approach can be applied in protein identification.

1.2. Protein Identification Pipeline Protein identification is to determine th e composition of proteins in a sample of animal cells, bacterial or pl ant tissues, etc., often through mapping these proteins to known ones. It usually follows the pi peline of extraction, separation, mass spectrometry analysis and identification as shown in Figure 1. At the first step, proteins are extracted from organism; then the extracted samples will be sent to specific instrument for protein separation - th e separated proteins deposit as spots in a 2D gel, each of which representing a certain kind of protein; then a spot is picked up and put into mass spectrometr y instrument for digestion and analysis; finally, the mass spectrum is generated, an algorithm is then applied to compare the mass spectrum with candidate protei ns in database and score each of them. The top scored proteins are considered the best identification.

5

Protein Identification Flow Chart

Figure 1 . Protein Identification Pipeline Collection Extraction and Digestion Separation (Spot Picking) Mass Analysis Protein Identification Identified protein

Mass Spectrum Separated Sample 2D Gel Protein Samples Protein Sources

6

1.2.1 Protein Extraction and Digestion The basic principles in protein extrac tion are to be efficient and to avoid degradation [16]. Efficient extraction is dependent on how to break the interactions between proteins to release the bound in macromolecular assemblies. For this reason, solubilization methods, such as solubilization with the ionic detergent sodium dodecyl sulfate (SDS) which can be subjected to 2-D PAGE can be applied for extraction. Such extracts is close to optimal [17]. To avoid degradation, it will work to set proper temperature for a short time. After the proteins are purifi ed, the sample is ready fo r actual analysis, in which the protein is treated with specific protease and is cut into small pieces at specific amino acid sites. Different protease has diffe rent digestion site [18-19], for example, Trypsin cut proteins after Lys or Arg excep t when it’s followed by Pro; Chymotrypsin preferentially cleaves at Trp, Tyr and Ph e in position P1(high specificity) and to a lesser extent at Leu, Met and His in positi on P1 [20]; Arg-C proteinase preferentially cleaves at Arg in position P1 [20]. The most widely used protease is Trypsin because its Arg/Lys specificity is seen very nicely w ith pure alpha- and be ta trypsins [20]. In- gel digestion breaks the OH-H bond between amino acids and forms the peptide fragments and water, as shown in Figure 2.

7

Figure 2 . Trypsin digestion of a protein

1.2.2 Protein Separation

There are two approaches for analyzing protein mixture. One of them is based on Gel electrophoresis techni que and the other is chromatographic method. As we discussed in Section 1.1, Gels can be used to separate the protein mixture. Unlike nucleic acids, proteins have different charges, so th at they will deposit into gel at different rates under differe nt electromotive forces. Sin ce proteins are denatured in the presence of detergent, and denatured proteins lose their structure, the proteins with detergent such as sodium dodecyl sulfat e (SDS) deposit into gel with rates only depending on its mass [21]. The 2D-gel techni que is an extension application of the SDS technique, in 1-D electrophoresis, prot eins lie along a lane, separated from each other by a property such as isoe lectric point (pI). Then in th e 2-D, the aligned proteins are separated by their mass using SDS. Becau se it is unlikely that two proteins are similar in both pI and mass properties, prot eins are more effectively separated in 2-D electrophoresis than in 1- D electrophoresis [22].

8

Chromatographic is another widely used technique to separate or to analyze complex mixtures, which are distributed be tween stationary phase and mobile phase. When processing with a mixture, different components are passed through the system at different rates. The absorptive ma terials will then repeatedly take sorption/desorption actions during the moveme nt of the sample over the stationary bed. The spent time is determined by the molecular stationary phase. The main chromatographic methods include Liquid chromatography (LC), Gas chromatography, Affinity chromatography and Supercritical fl uid chromatography etc. In Liquid chromatography, the Hi gh Performance Liquid Chromatography (HPLC) is used most frequently.

1.2.3 Mass Spectrometry Analysis The Mass Spectrometry Analysis procedure is an important step in the pipeline. The main purpose is to quantify the sample peptide fragments and generate the mass spectrum which visualizes th e peptide abundance and dist ribution in a 2-D Figure. When protein is separated, two processe s will be applied to the sample. The first process is ionization, which can convert atoms or molecules to gas- phase ions by adding or removing charge d particles. Matrix Assisted Laser Desorption (MALDI) allows the analysis of biomolecules such as proteins and peptides [23-25]. The ionization is triggere d by a laser beam and a matrix is used to protect the biomolecule from being destr oyed. Electronspray Ionization (ESI) is a

9

powerful technique for producing ions from la rge and complex species [26]. It is more powerful for large molecules because it can handle the peaks with multiple charged ions. Compared with MALDI, which usua lly come up with direct ionization of peptide mixture in solid state, ESI is alwa ys used with LC/HPLC technique, which is related with ioni zation in liquid. The second process is the mass analyzer which is used to capture ions and separate ions according to their mass-charge ratio (m/z). There are several techniques being used in chemistry industry including se ctor [27], time of flight (TOF) [28], quadrupole, quadrupole ion trap [29], Fourie r transform ion cycl otron resonance (FT- MS) [30-31], etc. The time-of-flight (TOF) an alyzer accelerates i ons by using electric power and measures the time that is taken to the detector. If the pr ior process is using MALDI technique in converting molecules to ions, the charges are always identical; therefore the velocities of the ions only depend on their masses.

1.2.4 Protein Identification From the Section 1.2.1 to Section 1.2.3, prot eins are extracted, cleaned, separated and digested through a series of biological experiments. The prot ein data are finally transformed into mass spectrometry, which can be used for protein identification. The protein identification methods can be summarized into two ca tegories, one of which is de novo sequencing and the other is identif ication by database search. The de novo sequencing method [32] requires well execut ed MS based protein study, but not for

10 most cases; therefore, the database sear ch method has much broader application. Major methods of protein identification a pply mass spectrometry (MS) and database search, in which heuristic algorithms ar e designed to assign scores for all the candidate proteins in a database. The genera l approach for MS protein identification is to match the features derived from the MS spectrum of a protein sample with the database that contains the sequence frag ments of a protein digested by specific enzyme. The degree of the match is quantified with a score which can be ranked to reflect the search result. A good score functi on will rank the correct protein to the top of the searching result while a bad one may have more false positive in the list.

Figure 3 . protein identification

11 At the experiment stage, protein sample s are collected first. Then with an extraction instrument, the prot eins are separated from the samples and precipitated at different spots in a 2D gel page [33] acco rding to their molecular weights and pI values. By selecting a specified spot in the gel, the corresponding proteins will be mixed with specific enzyme and digested into small pieces. Finally the MALDI-TOF will generate the peptide mass fingerprint ing (PMF) spectrum [34-35] for protein identification. At the computational stage, a ll candidate proteins in searching database are theoretically digested using the same enzyme. The simulated spectrum for each candidate is created for comparison. Th e common PMF protein identification is carried out through two steps: (1) the experimental PMF sp ectral peaks are compared with simulated ones, and (2) the proteins in the sequence database with best matches are considered the top ca ndidates for proteins in the experimental sample. Unlike 2D gel, PMF provides at least some sequence-level information for protein identification. The PMF of a protein is like a finge rprint, which is unique to the molecule or represents a small population of the proteins in the database. With an enzyme digestion, a collection of peptides with the masses (or mass-to-charge ratios) identified from the PMF spectra will be mapped to known proteins. The use of the fingerprint to identify proteins relies on the ability to sear ch sequences that is already present in databases. Hence, it is importa nt that the organism has the whole genome sequences so that all the proteins can be determined or predicted. When the whole genome sequence is unavailable, researchers often search an MS spectrum against the

12 whole protein database (such as Swissprot) , trying to identify a protein in another species that is highly similar to th e homolog in the native organism.

1.3. Mass Spectrometry Technology Mass spectrometry (MS) is an analytical technique that identifies the chemical composition of a compound or sample base d on the mass-to-charge ratio of charged particles [36]. The developm ent of MS is marked by three milestones: the creation period from the beginning of 20 th century to 1950s; the prosper period from 1950s to1980 and the quick development period from 1990s until now. In 1899, Joseph John Thomson [37-38] invented the first ma ss spectrometer and found the isotopes of Ne20/Ne22 and electricity; in 1919, Franci s William Aston made the first speed focusing mass spectrometer; in 1946, William E. Stephens promoted the concept of time-of-flight analyzers [39]. In 1953, the first annual conference on mass spectrometry was held. During this pe riod, chemical ionization, fast atom bombardment, reflectron TOF techniques we re all developed very quickly. In 1980s- 1990s, ESI and MALDI were invented, illust rating the revolutionary development in mass spectrometry technology. In this section, we introduce the mo st popular two techniques: Peptide Mass Fingerprinting (PMF) and Tandem Mass (MS/MS).

13 1.3.1 Peptide Mass Fingerprinting (PMF) The first technique used for protein id entification using Mass Spectrometry is PMF which became popular in early 1980s . The Mass Spectrometric technology at that time was used to deal with the protei ns in gel sample including the process of quantification and qualification. Before analysis, the gel samples need to be washed, purified and then extracted. The extracted pr oteins are separated using Electrophoresis technique such as 2D gel. Proteins represented by th e spots are digested using particular protease such as Trypsin into piec es of peptides and these peptides will be ionized by specific mass spectrometry machine. A widely used Mass Spectrometer is MALDI-TOF which is commonly for PMF appr oach. The distribution of the m/z was recorded in the spectrum for comparison. Th eoretical and experimental values are compared. Scoring schemes are design to quantify the match between the experimental spectrum and each protein en try in the database. The best scored proteins are considered the candidates fo r the prediction of the unknown protein. PMF methods are simple and direct so that it is economi cal and fast.

1.3.2 Tandem Mass (MS/MS) Tandem Mass Spectrometry (MS/MS) is another commonly used technique in proteomics. It is always used when accura te analysis of proteins is required. The sample proteins are digested with specific pr otease such as Trypsin which is the most commonly used enzyme in Mass Spectrometry analysis. There are two steps in the

14 analysis. The first step is to use the MS as a separation unit. Individual tryptic peptides are separated and qua ntified in the first MS ion separation chamber and then they are are selected and sent to the ne xt MS chamber. There is a mass list that contains all tryptic peptides which are pr oduced by the first MS. They are known as Precursor or Parent ions scan (PIS). The lis t of tryptic peptides may contain proteins from other sources such as contaminant proteins. The peptides from contaminant proteins are also reflected in the mass spectru m but they are actually noise in the list. The important thing is to effectively select mass to send to the next step to perform a future analysis. In the second step, MS ion chamber break s the peptide which is sent from the step one. The peptide will be broken into am ino acids. This provides the signature of individual amino acids for the unknown pr oteins [40] so that it gives more information than the PMF especially when the molecular weights of peptides are equal but the sequences ar e different, MS/MS can dis tinguish the cases. MS/MS process is more accurate than PMF but is time consuming and costly.

1.4. Materials and Database

1.4.1 Materials

15 A given proteome often contains high di versity of protein complex. Since it is hard to achieve both qualitativ e and quantitative coverage of all these proteins, most researchers have been focused on sel ected subset for specific interest. Over the last few years, the number of proteomics studies has been increasing quickly, many research groups have been able to efficiently gene rate data based on the improved resolving power of mass sp ectrometers. However, the increasing amount of challenge comes that the quality of the data ha s been decreasing from the overall perspective. This requires compatib ility of data interpretation, which makes the computation analysis of proteomi cs data more and more important.

1.4.2 Database The database used for protein identifi cation is sprot45 from UniProtKB/Swiss- Prot (last updated in January 2005), togeth er with the 40 proteins from soybean (generated after January 2005) that we ha ve identified but not included in the database. The database has 163,275 proteins in total, and it is formatted in a specific form, including eight fields for each entry: accession number, peptide number, peptide sequences, peptide masses, peptide lengths , protein sequence, protein name, and protein molecular weight. (This is a self-d efined database format for preprocessing. We provide a package to transform a FAST A sequence file into this format.) The molecular weight of a peptide of N residues is calculated as

16 1 _ n i water i residue mass mass = + ∑ Equation 1

In Equation 1, there are two ways of calculation for amino acid mass: Average mass or Monoisotopic mass. This is due to different composition of Carbon isotopes in the chemical structure of amino acid. Considering in most cases Monoisotopic mass rather than Average mass will result in more accurate prediction, we build our database based on the Monoistopic mass calcu lation. Equation 1 take s into account an amino-terminal hydrogen and a carboxy-te rminal hydroxyl group, which sums up to 18.015. In our study, we only consider complete Trypsin digestion of a protein and peptide without including any missed cleavage. In addition, we assume that the charge state of all the peptides is 1 and no post- translational modification exists in any peptide. We use only mono-isotopic peaks.

1.5. Existing Computational Methods Several computational tools have been developed for PMF pr otein identification. MOWSE [42] was an earlier software package for PMF protein identification, and EMOWSE (http://www.hgmp.mrc.ac.uk/Soft ware/EMBOSS/Apps/emowse.html) is the latest implementation of the MOWSE al gorithm. MS-Fit in the Protein Prospector package (http://prospector.ucsf.edu/) [43] uses a variant of the MOWSE scoring

17 scheme. It incorporates several new feat ures, including constr aints on the minimum number of peptides to be matched for a possible hit, the numbe r of missed cleavages, and the target protein’s molecular wei ght range. Mascot (Matrix Science Inc., http://www.matrixscience.com/) [44] is an extension of the MOWSE algorithm. It incorporates the same scoring scheme, but provides a probabi lity-based score. ProFound (http://prowl.rockefeller. edu/) [45] uses the Bayesian probability theory and an Expert System for protein identificat ion, with a generalized probability score. OLAVPMF [46] applies a probabilistic model to estimate the ratio of two likelihoods between a list of experimental peptide ma sses and the corresponding list of expected ones. Probity [47] analytically calculat es the risk of random matching between experimental masses and theoretical masse s of a protein in a search database. ChemApplex considered peak intensity a nd the accuracy of the match between the experimental mass and the theoretical mass in the scoring function [48]. Ossipova et al. [49] developed a method to optimize the parameters for PMF protein identification in database search. In this section, we will describe MOWSE, Profound, Protein Prospector and NDSF briefly.

1.5.1 MOWSE MOWSE [42] is one of the earliest scor ing schemes in protein identification using PMF data, which is still widely app lied. The scheme is based on the number of possible matches within a target protein and the occurrence of the molecular weight of

Full document contains 112 pages
Abstract: Protein identification using mass spectrometry is an important yet partially solved problem in the study of proteomics during the post-genomic era. The major techniques used in mass spectrometry are Peptide Mass Fingerprinting (PMF) and Tandem mass spectrometry (MS/MS). PMF is faster and economical compared with MS/MS and widely applicable in many fields. Our work focus on the method development for protein identification using PMF data and this work covers three subjects: (1) Protein Identification scoring function development: we developed the Probability Based Scoring Function (PBSF) which is used to quantify the degree of match between PMF data and candidate protein. The derived score is used to rank the protein and predict the identification. (2) Confidence Assessment development: scoring function may lead to false positive identification since the top hit from a database search may not be the target protein. In addition, the identification scores assigned singly by a scoring function (raw scores) are not normalized. Therefore, the ranking based on raw scores may be biased. To address the above issue, we have developed a statistical model to evaluate the confidence of the raw score and to improve the ranking of proteins for identification. (3) Software development: we implemented our computational methods in an open source package "ProteinDecision" which is freely available upon request.