Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 20;48(14):e83.
doi: 10.1093/nar/gkaa498.

NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses

Affiliations
Free PMC article

NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses

Shisheng Wang et al. Nucleic Acids Res. .
Free PMC article

Abstract

Mass spectrometry (MS)-based quantitative proteomics experiments frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful stand-alone software, NAguideR, to enable implementation and evaluation of different missing value methods offered by 23 widely used missing-value imputation algorithms. NAguideR further evaluates data imputation results through classic computational criteria and, unprecedentedly, proteomic empirical criteria, such as quantitative consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating protein complexes and functional interactions. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables respectively, all generated by data independent acquisition mass spectrometry (DIA-MS) with substantial biological replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS experiments over those sub-optimal and low-performance algorithms. NAguideR further provides downloadable tables and figures supporting flexible data analysis and interpretation. NAguideR is freely available at http://www.omicsolution.org/wukong/NAguideR/ and the source code: https://github.com/wangshisheng/NAguideR/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The overall workflow of NAguideR. (A) Uploading of original proteomics data with missing values (NAs). (B) Optional data quality control step for removing proteins/peptides with high proportion of NAs or large CV. (C) Missing value imputation based on the embedded methods. (D) Performance evaluation by multiple criteria (four classic criteria and four proteomic criteria). (E) The selection of well-performed imputation methods guided by the classic criteria and proteomic criteria.
Figure 2.
Figure 2.
Systematic evaluation analysis of PhosDIA dataset. (A) Pearson correlation analysis of the original intensities and imputed intensities based on 23 methods. Density plots illustrate the correlation in detail between the original values and imputed values from minimum, SVD, and Impseq respectively as examples. NA in the correlation matrix means ‘No Result’ because the standard deviations of imputed values from zero and minimum method are equal to 0, and hence the cor function returns NA. (B) Comparison of the distribution of the correlation coefficient among original values and 23 imputation methods under the four proteomic criteria. The comprehensive scores distribution of 23 imputation methods under the four classic criteria (C) and four proteomic criteria (D). ‘Normalized values’ here means every score is divided by the corresponding maximum value.
Figure 3.
Figure 3.
The score distribution of every imputation method based on the proteomic criteria in the three proteomics datasets with different biological replicates. Left panel: PhosDIA, middle: PepSWATH, right: ProtSWATH. ‘Normalized values’ denotes that every score is divided by corresponding maximum value. ’10 versus 10’ means that there are 10 replicates in each group (marked with darkblue color), and ‘3 versus 3’ means that there are three replicates in each group (marked with red color).
Figure 4.
Figure 4.
Across sample, quantitative correlation coefficients obtained by different NA imputation methods. Comparisons of original values and imputed values of the quantitative correlation coefficients are shown which are derived under ACC_Charge criterion by the 23 imputation methods and ‘Requantification’ method for the pepSWATH dataset. The adjusted R squared (R2) of each result was also obtained by ‘lm’ function and shown for every imputation method.‘Requant’ denotes ‘Requantification’ method in OpenSWATH software.
Figure 5.
Figure 5.
Differential expression and simulation analysis of PhosDIA dataset. Volcano plots of original full data (labelled as ‘Gold Standard’) (A), imputed data from Impseq method (B), Seq-KNN method (C), minimum method (D), imputed data of randomly selected five biological replicates (labelled as ‘Random 5’) (E) and 3 biological replicates (labeled as ‘Random 3’) (F) in each group from Impseq method. (‘Down’ means down-regulated phosphopeptides, ‘Up’ means up-regulated phosphopeptides). (G) Cloud-rain plots indicating the number of differentially expressed peptides for the 100 randomly selected datasets by ‘Random 5’ and ‘Random 3’. Solid pink line means the number of differentially expressed peptides from gold standard samples. Dashed lines of red, blue and yellow indicate the distribution of the numbers of differentially expressed peptides from each imputation method with all, Random 5 and Random 3 samples, respectively.

Similar articles

Cited by

References

    1. Clark D.J., Dhanasekaran S.M., Petralia F., Pan J., Song X., Hu Y., da Veiga Leprevost F., Reva B., Lih T.M., Chang H.Y. et al. .. Integrated proteogenomic characterization of clear cell renal cell carcinoma. Cell. 2019; 179:964–983. - PMC - PubMed
    1. Gao Q., Zhu H., Dong L., Shi W., Chen R., Song Z., Huang C., Li J., Dong X., Zhou Y. et al. .. Integrated proteogenomic characterization of HBV-related hepatocellular carcinoma. Cell. 2019; 179:561–577. - PubMed
    1. Jiang Y., Sun A., Zhao Y., Ying W., Sun H., Yang X., Xing B., Sun W., Ren L., Hu B. et al. .. Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma. Nature. 2019; 567:257–261. - PubMed
    1. Moorthy K., Saberi Mohamad M., Deris S.. A review on missing value imputation algorithms for microarray gene expression data. Curr. Bioinformatics. 2014; 9:18–22.
    1. Jornsten R., Wang H.Y., Welsh W.J., Ouyang M.. DNA microarray data imputation and significance analysis of differential expression. Bioinformatics. 2005; 21:4155–4161. - PubMed

Publication types