构建信息检索系统:全局索引还是局部索引?
Building an Information Retrieval System: Global Indexing or Local Indexing?
DOI: 10.12677/SEA.2013.21002, PDF, HTML, XML, 下载: 3,410  浏览: 14,881  国家自然科学基金支持
作者: 王海涛, 岳 磅:深圳大学计算机与软件学院,深圳;赵艳琼:安徽移动网络部,合肥;韩家鑫:甲骨文公司,红木城,美国
关键词: 信息检索全局索引局部索引混合索引大数据分布式系统 Information Retrieval; Global Indexing; Local Indexing; Hybrid Indexing; Mass Data; Distributed System
摘要:

当今社会在生产与生活中产生的数据越来越多,要在海量的数据中搜索有用的信息,信息检索系统(IRS:Information Retrieval System,比如百度、谷歌等)是必不可少的工具。一个信息检索系统,特别是基于大规模数据集的信息检索系统,只有建立索引才能满足用户的检索需求,索引的好坏直接决定了信息检索系统的成败。数十年以来,对于信息检索系统中索引如何构建的研究一直没有中断,研究主要集中在对全局索引(Global Indexing)与局部索引(Local Indexing)及其混合类型(Hybrid Indexing)等结构的比较与探讨。本文详细介绍了几种索引的架构及其优缺点,回顾了相关的研究成果,分析了实际应用系统。最后,给出我们的观点与解决方案。

Abstract: Nowadays, there is more and more data generated in production and life. Information retrieval system (IRS, such as Baidu and Google) is an indispensable tool to search usable information from magnanimity information. For an IRS, especially based on large-scale data set, indexing is necessary. The index is good or bad directly determines the success or failure of the IRS. In the past decades, the research of indexing of IRS has been intensive. The research focus is comparison and discussion of global indexing, local indexing, hybrid indexing, etc. In this paper, these indexing are introduced, their advantages and disadvantages are discussed, and achievements of them are reviewed. Practical application system will be analyzed. Finally, our views and solutions will be given.

文章引用:王海涛, 赵艳琼, 韩家鑫, 岳磅. 构建信息检索系统:全局索引还是局部索引?[J]. 软件工程与应用, 2013, 2(1): 6-14. http://dx.doi.org/10.12677/SEA.2013.21002

参考文献

[1] M. Stonebraker. The case for shared nothing. Database Engi- neering Bulletin, 1986, 9(1): 4-9.
[2] A. Tomasic, H. Garcia-Molina. Performance of inverted indices in shared-nothing distributed text document information retrie- val systems. Proceedings of the Second International Conference on Parallel and Distributed Information Systems, San Diego, 1993.
[3] A. Abusukhon, M. P. Oakes, M. Talib and A. M. Abdalla. Com- parison between document-based, term-based and hybrid parti- tioning. IEEE Conference on Applications of Digital Informa- tion and Web Technologies, 4-6 August 2008: 90-95.
[4] B. B. Cambazoglu, A. Catal and C. Aykanat. Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems. Unpublished manuscript.
[5] J. Zhang, T. Suel. Optimized inverted list assignment in distri- buted search engine architectures. Proceedings of the 21st Paral- lel and Distributed Processing Symposium, 36-30 March 2007: 1-10.
[6] D. Bhagwat, K. Eshghi and P. Mehra. Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus, in KDD’07. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Disco- very and Data Mining, 2007: 105-112.
[7] A. Abusukhon, M. Talib and M. Oakes. An investigation into improving the load balance for term-based partitioning. In: R. Kaschek, et al., Eds., Proceedings of UNISCON 2008, The 2nd International United Information Systems Conference, Klagen- furt, 22-25 April 2008. Berlin, Heidelberg: Springer-Verlag, 2008: 380-392.
[8] C. Tang, S. Dwarkadas. Hybrid global-local indexing for effi- cient peer-to-peer information retrieval. Proceedings of Net- worked Systems Design and Implementation, Grand Hyatt, 29-31 Match 2004.
[9] A. Guttman. R-trees: A dynamic index structure for spatial searching. Proceedings of SIGMOD, Boston, 18-21 June 1984.
[10] J. L. Bentley. Multidimensional binary search trees used for asso- ciative searching. Communications of the ACM, 1975, 18(9): 509-517.
[11] R. Mao, W. J. Xu, N. Singh and D. P. Miranker. An assessment of a metric space database index to support sequence homology. International Journal on Artificial Intelligence Tools, 2005, 14(5): 867-885.
[12] E. Chavez, G. Navarro, R. Baeza-Yates and J. L. Marroquin. Searching in metric spaces (survey). ACM Computing Surveys, 2001.
[13] R. Mao, S. R. Ramakrishnan, G. Nuckolls and D. P. Miranker. Case study: An inverted index for mass spectra similarity query and comparison with a metric-space method. Proceedings of the 3rd International Conference on Similarity Search and Appli- cations, Istanbul, 18-19 September 2010: 93-99.
[14] R. Mao, W. J. Xu, S. Ramakrishnan, G. Nuckolls and D. P. Miranker. On optimizing distance-based similarity search for biological databases. Proceedings of the 2005 IEEE Computa- tional Systems Bioinformatics Conference, Stanford University, 8-11 August 2005: 351-361.
[15] R. Mao, W. L. Miranker and D. P. Miranker. Pivot selection: Di- mension reduction for distance-based indexing. Journal of Dis- crete Algorithms, Elsevier, 2012, 13(1): 32-46.
[16] R. Mao, W. L. Miranker and D. P. Miranker. Dimension reduce- tion for distance-based indexing. Proceedings of the 3rd Inter- national Conference on Similarity Search and Applications, Is- tanbul, 18-19 September 2010: 25-32.
[17] L. A. Barroso, J. Dean and U. Holzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 2003, 23(2): 22- 28.
[18] J. Dean. Challenges in building large-scale information retrieval systems: Invited talk. Web Search and Data Mining, 2009.
[19] Hadoop Apache Project. http://hadoop.apache.org
[20] Apache HBase Home. http://hbase.apache.org
[21] D. Peng, F. Dabek. Large-scale incremental processing using distributed transactions and notifications. Proceedings of the 9th USENIX Symposium on Operating Systems Design and Imple- mentation, Vancouver, 4-6 October 2010.
[22] S. Melnik, et al. Dremel: Interactive analysis of web-scale data- sets. Proceedings of the VLDB Endowment, 2010, 3(1): 330-339.
[23] U. Hoelzle, L. A. Barroso. The datacenter as a computer: An in- troduction to the design of warehouse-scale machines. San Ra- fael: Morgan and Claypool Publishers, 2009.