Our research area is interdisciplinary research of statistics, computation and biology, with current focus on computational and statistical proteomics. Our representative research results are summarized below.
1. Algorithms and software for protein and post-translational modification identification and quantification
Searching mass spectrometry data against protein databases to identify protein sequences and post-translational modifications is central to proteomics research. In 2004, we proposed a new scoring function named "kernelized Spectral Vector Dot Product (KSDP)", and developed pFind 1.0, the first protein identification search engine in China (Bioinformatics, 2004,20:1948~1954). Since then, pFind has been developed continuously for years and evolved into the well-known pFind protein identification system and pFind research group (http://pfind.ict.ac.cn).
The huge number of unexpected post-translational modifications on proteins are considered to be the "dark matter" in proteomic data. We have developed a variety of modification discovery algorithms. We proposed the open mass library search algorithm pMatch to discover unexpected modifications by comparing the similarities between modified and unmodified spectra. The paper of pMatch was accepted and reported in ISMB (2010), one of the top conferences of bioinformatics, and meanwhile published in Bioinformatics (2010). At present, pMatch has become an algorithm frequently cited and referenced in the field of mass library search and modification discovery. Based on pMatch, we have recently developed a glycosylation modification identification algorithm pMatchGlyco (BioMed Research International, 2018).
We developed DeltAMT, an algorithm for mass spectra clustering using peptide mass and retention time information to discover high-abundance modification types (Molecular & Cellular Proteomics, 2011). In the core fucosylated glycoprotein identification research collaborated with the State Key Laboratory of Proteomics of China, DeltAMT as well as other data analysis methods were used to successfully identify the largest set of core ucosylated sites at that time (Molecular & Cellular Proteomics, 2010).
We developed PTMiner, a high-accuracy probabilistic algorithm for modification localization and quality control for open (mass tolerant) database search （Molecular & Cellular Proteomics，2019). The algorithm automatically learns the prior probability, the mass-matching error distribution and the matching-peak intensity distribution from the mass spectral data through an iterative process, and uses the continuously updated prior probability and the two types of distributions to more accurately estimate the posterior probability of the modified site. We used PTMiner to analyze the modifications present in the massive data of human proteome draft, and localized more than one million modifications at 1% FDR, systematically characterizing known and unknown modifications in the human proteome. The paper was once the second ‘most read’ paper when published online. Based on the PTMiner algorithm, We developed SAVControl, a quality control method for protein amino acid mutations (can be treated as a special type of modification), which was published in Journal of Proteomics (2018).
In protein quantification, mass spectrometry usually has large randomness such as: 1) some peptides can be detected while some cannot be, and 2) peptides of the same concentrations may have a large difference in mass spectrometry signal intensity. These randomness seriously reduce the accuracy of protein quantification. In order to solve the above problems, we proposed the concept of quantitative mass-spectrometry efficiency of peptides, and developed a new protein absolute quantification algorithm, named LFAQ, based on the predicted peptide quantitative efficiencies (Analytical Chemistry, 2019a). Then we proposed to incorporate the digestibility of peptides into peptide detectability prediction model and developed AP3, a peptide detectability prediction algorithm based on the random-forest machine learning method (Analytical Chemistry, 2019b).
2. Proteomics data FDR control methods and applications
While big data are giving us big opportunities to discover new knowledge, there are also many big risks and pitfalls of false discoveries. False discovery rate (FDR) analysis in high-dimensional statistical inference is considered as one of the most important progress of statistics. In multiple hypothesis testing, the FDR is defined as the expectation of the proportion of falsely rejected hypotheses among all rejected hypotheses. The initial paper (Benjamini and Hochberg, J. R. Stat Society B, 1995) proposing the FDR has been cited more than 57,000 times, showing its importance and influence. The main researchers of FDR include famous statisticians Bradley Efron, John Storey and Emmanuel Candes.
Specially, how to accurately estimate the FDR of subgroups of hypothesis tests is a difficult problem, which was proposed initially by Bradley Efron (Ann. Appl. Stat. 2:197-223, 2008). This problem is practically important in proteomics. For the first time, we have mathematically studied the problem of FDR estimation for subgroups of peptide identifications (such as modified peptides) in proteomic data analysis. Via Bayesian analysis we theoretically proved that the subgroup FDR and the combined FDR are not equal to each other under the same scoring threshold, and thus proposed the principle of separate subgroup filtering and FDR estimation and derived a series of insightful theoretical results (Statistics and Its Interface, 2012).
Based on the above theoretical analysis, we proposed a simpler but more intuitive relationship between the subgroup FDR and combined FDR, and further developed Transfer FDR, an accurate FDR estimation method for small subgroups of peptide identifications (Molecular & Cellular Proteomics, 2014). The rational of Transfer FDR is as follows. When the abundance of the modification to be identified is low, the direct FDR estimation would be severely inaccurate due to insufficient data sample size. Based on the observation and analysis of real data, we invented a estimation method for the conditional probability of an erroneously identified peptide being a modified peptide. Based on this estimation, a quantitative relationship between the subgroup FDR of modified peptides and the combined FDR of all peptides is obtained. Through this relationship, the subgroup FDR can be indirectly predicted from the combined FDR, which can usually be accurately estimated. This overcomes the difficulty of small subgroup FDR estimation due to the lack of sample size.
We applied the above subgroup FDR analysis and Transferred FDR methods to a number of special identification problems. For example, in the study of FDR estimation of novel genes identified by six-frame translation in proteogenomics, it was found that if the combined FDR were used, the gene annotation ratio is the dominant factor affecting the real FDR of new genes (new peptides) (Bioinformatics, 2015). Also, the Transfer FDR method was successfully applied to the quality control of open modification search （Molecular & Cellular Proteomics，2019) and amino acid mutation identification （Journal of Proteomics，2018). In addition, the Transfer FDR method was successfully used in a collaborative study of primate-specific gene identification (Genome Research, 2019).
3. Statistical inference and data mining
In the process of analyzing biological data, we developed several general statistical inference and data mining methods, going one step forward from applied research to methodological and theoretical research.
The target-decoy competition (TDC) strategy is the gold standard method for FDR control of proteomic data. This method has been used for many years, but it is still an empirical method and lacks theoretical foundation. In this method, the ratio of the numbers of decoy and target results is usually used as an estimate of FDR, but whether this can control FDR (that is, to make the real FDR less than a specified threshold) is still unknown. We found that a +1 correction to the above estimate (decoy number plus 1) can strictly control FDR, and gave theoretical proof for this conclusion (arXiv, 2015).
Further and more important, we extended the above corrected TDC method to the general multiple hypothesis testing problem (arXiv, 2018). The previous FDR control methods in multiple hypothesis testing were usually based on a null distribution of the test statistic. However, all types of null distributions, including theoretical, permutation-based and empirical ones, have some inherent drawbacks. For example, the theoretical null distribution will fail if the assumptions on the sample distribution are wrong. In addition, many FDR control methods require the estimation of the proportion of true null hypotheses, which is difficult and has not been very well resolved. We proposed a general TDC-based FDR control method using random permutations. Our method does not need to estimate the null distribution of the statistic or the proportion of true null hypotheses, but is only based on the rank of the tests by some statistic/score. It constructs competitive decoy hypotheses from random sample permutations. We proved that this method can rigorously control FDR. Simulation experiments show that our method can control FDR more effectively than the Bayes and Empirical Bayes methods, and has greater statistical power.
Prof. Emmanuel Candes, a famous statistician from Stanford University, developed, in collaboration with Rina Foygel Barber, the knockoff filter method (Annals of Statistics, 43:2055, 2015), which is quite similar to our general TDC-FDR method. However, our "+1" correction and FDR control theorem was given earlier, though in the context of mass spectrometry (Kun He, master-degree thesis, 2013). As recognized by Prof. William Noble from the University of Washington and Prof. Uri Keich from the University of Sydney in their recent papers (Journal of Proteome Research, 18:585-593, 2019; arXiv: 1907.01458, 2019), our and Candes's results are independent researches:
“The +1 correction was proved by Barber and Candès (The Annals of Statistics, 43:2055, 2015) in the context of linear regression (see their “knockoff+” procedure) and by He et al. (arXiv, 2015) in the context of mass spectrometry (see their equation 25). ”
—— Cited from (Journal of Proteome Research, 18:585-593, 2019)
“The TDC approach has been theoretically established (subject to a small finite-sample correction) by He et al.(arXiv, 2015) and independently, and in a much wider context, by Barber and Candès (The Annals of Statistics, 43:2055, 2015).”
—— Cited from (arXiv: 1907.01458, 2019)
In addition, in solving the problem of protein homology prediction, We proposed several learning-to-rank algorithms based on kernel machines (e.g. SVM). With the local data normalization and the support-vector down sampling methods, we achieved the Champion Award (Tied for 1st Place Overall, Honorable Mentions for Squared Error and Average Precision in protein homology prediction task) in the ACM KDDCUP-2004 data mining competition. This was the first time that Chinese researchers have won the championship in KDDCUP, the most influential data mining competition worldwide. A query-adaptive ensemble learning algorithm was proposed later and had a better performance (ISBRA, 2011).