基因剪切位点的统计分析研究
Research on Statistical Analysis of Gene Splicing Sites
DOI: 10.12677/HJCB.2016.63006, PDF, HTML, XML, 下载: 2,260  浏览: 6,349  科研立项经费支持
作者: 李宏彬*, 赫光中:咸阳职业技术学院医学院,陕西 咸阳
关键词: 基因剪切位点统计分析Gene Splice Site Statistical Analysis
摘要: 真核生物的基因由若干外显子和内含子交替组成,外显子序列在转录后保留,而内含子序列转录过程中被剪切掉。大量分子生物学实验验证基因的剪切位点遵从GT-AG规则,然而只有很少的含GT或AG序列是真剪切位点,目前预测的准确程度仍有待提高。本研究下载了HS3D剪切位点训练数据集,对启动子剪切位点附近的序列进行了统计分析研究。当真、假序列长度在剪切位点左旁和右旁均超出各七个位点时,序列呈现很高的特异性,可以使用这些特异性序列作为特征进行训练,从而准确地识别真假剪切位点。
Abstract: The genes of eukaryotes are composed of several exons and introns. After transcript process, sequences of exons are retained, while sequences of introns are cleaved off. A large number of experiments of molecular biology validate that the splicing sites between exon and intron follow the rule of GT-AG, only a few GT or AG sequences are true splicing sites, and the accuracy of the prediction still needs to be improved. In this study, the training dataset of splicing site of HS3D was downloaded, and a statistical analysis of the sequence near the splicing site of the promoter was carried out. The sequence showed high specificity when the true and false sequence lengths of the left splicing site side and right splicing site side were both more than seven, which was helpful to train the sequences characters so as to accurately identify the true and false splicing sites.
文章引用:李宏彬, 赫光中. 基因剪切位点的统计分析研究[J]. 计算生物学, 2016, 6(3): 41-49. http://dx.doi.org/10.12677/HJCB.2016.63006

参考文献

[1] Sun, J. (1993) Predicting the Splicing Sites of mRNA by Neural Network. Acta Biophysica Sinica, 9, 127-131.
[2] Xia, H., Zhou, Q. and Yanda, L.I. (2002) Application of Hidden Markov Model in the Recognition of Splicing Sites. Journal of Tsinghua University, 42, 1214-1217.
[3] Snyder, E.E. and Stormo, G.D. (1993) Identification of Coding Regions in Genomic DNA Sequences: An Application of Dynamic Programming and Neural Networks. Nucleic Acids Research, 21, 607-613.
http://dx.doi.org/10.1093/nar/21.3.607
[4] Zhang, L.R. and Luo, L.F. (2003) Splice Site Prediction with Quadratic Discriminant Analysis Using Diversity Measure. Nucleic Acids Research, 31, 6214-6220.
http://dx.doi.org/10.1093/nar/gkg805
[5] Cai, D., Delcher, A., Kao, B. and Kasif, S. (2000) Modeling Splice Sites with Bayes Networks. Bioinformatics, 16, 152-158.
http://dx.doi.org/10.1093/bioinformatics/16.2.152
[6] Yin, C. and Yau, S.T. (2007) Prediction of Protein Coding Regions by the 3-Base Periodicity Analysis of a DNA Sequence. Journal of Theoretical Biology, 247, 687-694.
http://dx.doi.org/10.1016/j.jtbi.2007.03.038
[7] Pollastro, P. and Rampone, S. (2002) HS3D, a Dataset of Homo Sapiens Splice Regions, and Its Extraction Procedure from a Major Public Database. International Journal of Modern Physics C, 13, 1105-1117.
http://dx.doi.org/10.1142/S0129183102003796