基于原子间作用势的晶体结构去重算法及其应用
Crystal Structure Deduplication Algorithm Based on Interatomic Potential and Its Applications
DOI: 10.12677/CSA.2022.125131, PDF,    科研立项经费支持
作者: 孔艳婷:中山大学计算机学院,广东 广州;陈 品, 颜 辉, 陈志广*:中山大学计算机学院,广东 广州;国家超级计算广州中心,广东 广州
关键词: 材料大数据结构相似度数据去重Material Big Data Structure Similarity Data Deduplication
摘要: 随着科学技术的发展,人们收集“大数据”的能力远远超过了分析它的能力。科学研究正经历从传统的实验观测、理论推演以及计算仿真向大数据研究转型,形成第四研究范式。材料数据信息学作为应用科学的重要分支之一,驱动着新材料的设计与发现。本文针对材料领域数据的冗余问题,提出一种基于原子间作用势的去重算法,可以有效地鉴别材料中的相同或相似结构,从而实现冗余数据的去重。实验结果表明,提出的算法与已有的材料结构去重算法相比,在准确性、鲁棒性以及计算效率各指标上均表现优异。进一步,本文利用开发的算法对材料领域三大知名数据库ICSD、CSD和COD数据库超过159万数据进行去重分析,有效地去除了101,643个相同和相似结构,并构建了开放共享的去冗余数据库,数据发布于:https://matgen.nscc-gz.cn/。
Abstract: With the development of science and technology, the ability of people to collect big data exceeds the ability to analyze it. Scientific research is undergoing a transformation from traditional experimental observation, theoretical deduction and simulation research to big data research, forming the fourth research paradigm. As one of the important branches of applied science, materials data informatics drives the design and discovery of new materials. Aiming at the problem of redundancy in the field of materials, this paper proposed a deduplication algorithm based on the interatomic potential, which can effectively identify the same and similar structures, thereby realizing the deduplication of redundant data. The experimental results show that this algorithm has excellent performance in accuracy, robustness and computational efficiency compared with the existing algorithms. Further, this paper used the proposed algorithm to deduplicate more than 1.59 million data in ICSD, CSD and COD databases, which effectively remove 101,643 same and similar structures, and built an open and shared de-redundancy database. The data was published at: https://matgen.nscc-gz.cn/.
文章引用:孔艳婷, 陈品, 颜辉, 陈志广. 基于原子间作用势的晶体结构去重算法及其应用[J]. 计算机科学与应用, 2022, 12(5): 1314-1330. https://doi.org/10.12677/CSA.2022.125131

参考文献

[1] Tolle, K.M., Tansley, D.S.W. and Hey, A.J.G. (2011) The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View]. Proceedings of the IEEE, 99, 1334-1337. [Google Scholar] [CrossRef
[2] Boubchir, M. and Aourag, H. (2021) Materials Genome Project: Mining the Ionic Conductivity in Oxide Perovskites. Materials Science & Engineering: B, 267, Arti-cle ID: 114984. [Google Scholar] [CrossRef
[3] Prajapati, P. and Shah, P. (2020) A Review on Secure Data Deduplication: Cloud Storage Security Issue. Journal of King Saud University-Computer and Information Sciences, [Google Scholar] [CrossRef
[4] Xia, W., Zou, X., Jiang, H., Zhou, Y., Liu, C., Feng, D., Hua, Y., Hu, Y. and Zhang, Y. (2020) The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems. IEEE Transactions on Parallel and Distributed Systems, 31, 2017-2031. [Google Scholar] [CrossRef
[5] Vinod-Prasad, P. (2020) A Novel Mathematical Model for Similarity Search in Pattern Matching Algorithms. Journal of Computer and Communications, 8, 94-99. [Google Scholar] [CrossRef
[6] Weng, M., Wang, Z., Qian, G., Ye, Y., Chen, Z., Chen, X., Zheng, S. and Pan, F. (2019) Identify Crystal Structures by a New Paradigm Based on Graph Theory for Building Materials Big Data. Science China Chemistry, 62, 982-986. [Google Scholar] [CrossRef
[7] Zhu, L., Amsler, M., Fuhrer, T., Schaefer, B., Faraji, S., Rostami, S., Ghasemi, S.A., Sadeghi, A., Grauzinyte, M., Wolverton, C. and Goedecker, S. (2016) A Fingerprint Based Metric for Measuring Similarities of Crystalline Structures. The Journal of Chemical physics, 144, Article ID: 034203, [Google Scholar] [CrossRef] [PubMed]
[8] Rooymans, C.J.M., Rabenau, A. and Stanley Whittingham, M. (2019) Crystal Structure and Chemical Bonding in Inorganic Chemistry. Journal of the Electrochemical Society, 123, 193C. [Google Scholar] [CrossRef
[9] Samet, D. and Adem. T. (2021) FFCASP: A Massively Parallel Crystal Structure Predic-tion Algorithm. Journal of Chemical Theory and Computation, 17, 2586-2598. [Google Scholar] [CrossRef] [PubMed]
[10] Alexandrov, E., Golov, A. and Shevchenko, A. (2018) Complex Approach to Analysis of Crystal Structures Based on a Unified Topological Model. Acta Crystallographica Section A Foundations and Advances, 74, 153-154. [Google Scholar] [CrossRef
[11] Therrien, F., Graf, P. and Stevanović, V. (2020) Matching Crystal Structures Atom-to-Atom. The Journal of Chemical Physics, 152, Article ID: 074106. [Google Scholar] [CrossRef] [PubMed]