基于自然语言处理的单细胞转录组数据伪时间分析
Pseudo-Time Analysis of Single-Cell Transcriptome Data Based on Natural Language Processing
DOI: 10.12677/BIPHY.2022.102004, PDF,    国家自然科学基金支持
作者: 卢雨儿, 胡 桓, 帅建伟, 林 海*:厦门大学物理系,福建 厦门;中国科学院大学,温州研究院,浙江 温州;陈玲玲, 程 烽:厦门大学物理系,福建 厦门
关键词: 单细胞测序伪时间轨迹推断自然语言处理基因组学 Single-Cell Sequencing Pseudo-Time Trajectory Inference Natural Language Processing Genomics
摘要: 针对单细胞转录组测序数据,人们已经提出了各种强大的分析模型和处理算法,用于细胞聚类、细胞类型识别、细胞伪时间轨迹推断、细胞RNA动力学、基因调控网络推断和RNA速度分析等。本文提出一种方法,将自然语言处理技术引入单细胞转录组数据分析中。算法首先采用TF-IDF表示转录组基因表达强度对细胞功能的影响程度,进一步把细胞演化发育过程所形成的各种基因表达变化,理解为自然语言中的各种句子文本,创新性地把自然语言文本分析技术应用于单细胞转录组演化发育的处理。通过在基因网络上随机行走生成各种基因序列文本,从而生成基因空间中基因的嵌入式词向量表示和细胞的嵌入式词向量表示,实现了对单细胞转录组数据的伪时间可视化分析。最后的分析结果表明该模型对于单细胞数据进行细胞发育伪时间分析是一种有效的方法。
Abstract: For single-cell transcriptome sequencing data, various powerful analytical models and processing algorithms have been proposed for cell clustering, cell type recognition, cell pseudo-time trajectory inference, cellular RNA dynamics, gene regulatory network inference, and RNA velocity analysis. This paper proposes an innovative approach to introducing natural language processing techniques into single-cell transcriptome data analysis. The algorithm first uses TF-IDF to indicate the degree of influence of transcriptome gene expression intensity on cell function, and further innovatively treats the various gene expression changes formed by the process of cell evolution and development as various sentence texts in natural language. Then, the natural language text analysis can be applied for the processing of evolutionary development of single-cell transcriptomes. Various gene sequence texts are generated by random walking process on the gene network, which generates the embedded word vector representation of genes and the embedded word vector representation of cells in the gene space, respectively. Finally, the pseudo-time visual analysis is considered for the single-cell transcriptome data. The final analysis results show that this model is an effective method for pseudo-time analysis of cell development for single-cell data.
文章引用:卢雨儿, 胡桓, 陈玲玲, 程烽, 帅建伟, 林海. 基于自然语言处理的单细胞转录组数据伪时间分析[J]. 生物物理学, 2022, 10(2): 31-38. https://doi.org/10.12677/BIPHY.2022.102004

参考文献

[1] Tang, F., Barbacioruet, C., Wang, Y., et al. (2009) mRNA-Seq Whole-Transcriptome Analysis of a Single Cell. Nat Methods, 6, 377-382. [Google Scholar] [CrossRef] [PubMed]
[2] Owens, B. (2012) Genomics: The Single Life. Na-ture, 491, 27-29. [Google Scholar] [CrossRef] [PubMed]
[3] Potter, S.S. (2018) Single-Cell RNA Sequencing for the Study of Development, Physiology and Disease. Nature Reviews Nephrology, 14, 479-492. [Google Scholar] [CrossRef] [PubMed]
[4] Baslan, T. and Hicks, J. (2017) Unravelling Biology and Shifting Paradigms in Cancer with Single-Cell Sequencing. Nature Reviews Cancer, 17, 557-569. [Google Scholar] [CrossRef] [PubMed]
[5] Kester, L. and van Oudenaarden, A. (2018) Single-Cell Transcriptomics Meets Lineage Tracing. Cell Stem Cell, 23, 166-179. [Google Scholar] [CrossRef] [PubMed]
[6] Papalexi, E. and Satija, R. (2018) Single-Cell RNA Sequencing to Explore Immune Cell Heterogeneity. Nature Reviews Immunology, 18, 35-45. [Google Scholar] [CrossRef] [PubMed]
[7] Carter, B. and Zhao, K. (2021) The Epigenetic Basis of Cellular Heterogeneity. Nature Reviews Genetics, 22, 235-250. [Google Scholar] [CrossRef] [PubMed]
[8] Woyke, T., D.F.R. Doud, and F. Schulz (2017) The Trajectory of Microbial Single-Cell Sequencing. Nature Methods, 14, 1045-1054. [Google Scholar] [CrossRef] [PubMed]
[9] Sade-Feldman, M., Yizhak, K., Nordman, E., et al. (2018) Defining T Cell States Associated with Response to Checkpoint Immunotherapy in Melanoma. Cell, 175, 998-1013.e20. [Google Scholar] [CrossRef] [PubMed]
[10] Mathys, H., Davila-Velderrain, J., Peng, Z., et al. (2019) Single-Cell Transcriptomic Analysis of Alzheimer’s Disease. Nature, 570, 332-337. [Google Scholar] [CrossRef] [PubMed]
[11] Su, Y., Chen, D., Yuan, D., et al. (2020) Multi-Omics Resolves a Sharp Disease-State Shift between Mild and Moderate COVID-19. Cell, 183, 1479-1495.e20. [Google Scholar] [CrossRef] [PubMed]
[12] Maier, B., Leader, A.M., Chen, S.T., et al. (2020) A Conserved Dendritic-Cell Regulatory Program Limits Antitumour Immunity. Nature, 580, 257-262. [Google Scholar] [CrossRef] [PubMed]
[13] Bocchi, V.D., Conforti, P., Vezzoli, E., et al. (2021) The Coding and Long Noncoding Single-Cell Atlas of the Developing Human Fetal Striatum. Science, 372, Article No. abf5759. [Google Scholar] [CrossRef] [PubMed]
[14] Bhaduri, A., Sandoval-Espinosa, C., Otero-Garcia, M., et al. (2021) An Atlas of Cortical Arealization Identifies Dynamic Molecular Signatures. Nature, 598, 200-204. [Google Scholar] [CrossRef] [PubMed]
[15] Hu, H., Liu, R., Zhao, C., et al. (2022) CITEMO(XMBD): A Flexible Single-Cell Multimodal Omics Analysis Framework to Reveal the Heterogeneity of Immune cells. RNA Biology, 19, 290-304. [Google Scholar] [CrossRef] [PubMed]
[16] Saelens, W., Cannoodt, R., Todorov, H. and Saeys, Y. (2019) A Comparison of Single-Cell Trajectory Inference Methods. Nature Biotechnology, 37, 547-554. [Google Scholar] [CrossRef] [PubMed]
[17] Haghverdi, L., Büttner, M., Wolf, F.A., Buettner, F. and Theis, F.J. (2016) Diffusion Pseudotime Robustly Reconstructs Lineage Branching. Nature Methods, 13, 845-848. [Google Scholar] [CrossRef] [PubMed]
[18] Setty, M., Tadmor, M.D., Reich-Zeliger, S., et al. (2016) Wishbone Iden-tifies Bifurcating Developmental Trajectories from Single-Cell Data. Nature Biotechnology, 34, 637-645. [Google Scholar] [CrossRef] [PubMed]
[19] Qiu, X., Mao, Q., Tang, Y., et al. (2017) Reversed Graph Embedding Re-solves Complex Single-Cell Trajectories. Nature Methods, 14, 979-982. [Google Scholar] [CrossRef] [PubMed]
[20] Setty, M., Kiseliovas, V., Levine, J., Gayoso, A., Mazutis, L. and Pe’er, D. (2019) Characterization of Cell Fate Probabilities in Single-Cell Data with Palantir. Nature Biotechnology, 37, 451-460. [Google Scholar] [CrossRef] [PubMed]
[21] Cong, Y., Chan, Y.B. and Ragan, M.A. (2016) Exploring Lateral Genetic Transfer among Microbial Genomes Using TF-IDF. Scientific Reports, 6, Article No. 29319. [Google Scholar] [CrossRef] [PubMed]
[22] Moussa, M. and Mandoiu, I.I. (2018) Single Cell RNA-seq Data Clustering Using TF-IDF Based Methods. BMC Genomics, 19, Article No. 569. [Google Scholar] [CrossRef] [PubMed]
[23] Wu, F., Zhang, C. and Zhang, L. (2021) A Deep Learning Framework Combined with Word Embedding to Identify DNA Replication Origins. Scientific Reports, 11, Article No. 844. [Google Scholar] [CrossRef] [PubMed]
[24] Stassen, S.V., Yip, G.G.K., Wong, K.K.Y., Ho, J.W.K. and Tsia, K.K. (2021) Generalized and Scalable Trajectory Inference in Single-Cell Omics Data with VIA. Nature Commu-nications, 12, Article No. 5528. [Google Scholar] [CrossRef] [PubMed]
[25] Moon, K.R., van Dijk, D., Wang, Z., et al. (2019) Visualizing Structure and Transitions in High-Dimensional Biological Data. Nature Biotechnology, 37, 1482-1492. [Google Scholar] [CrossRef] [PubMed]