Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Disco- very and Data Mining

Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus, in KDD’07

作者:
D. Bhagwat K. Eshghi and P. Mehra.

关键词:
distributed indexingdocument routingindex partitioningscalabilitysimilarity-based search

摘要:
We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in com- mon with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to par- tition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and par- allel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that execut- ing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3% of the total number of partitions.

在线下载

相关文章:
在线客服:
对外合作:
联系方式:400-6379-560
投诉建议:feedback@hanspub.org
客服号

人工客服,优惠资讯,稿件咨询
公众号

科技前沿与学术知识分享