文档矢量化技术的研究进展与应用
Research Progress and Application of Document Vectorization Technology
DOI: 10.12677/jisp.2024.134036, PDF,    科研立项经费支持
作者: 王 彤, 陆利坤:北京印刷学院信息工程学院,北京
关键词: 文档矢量化矢量图像深度学习自然语言处理Document Vectorization Vector Image Deep Learning Natural Language Processing
摘要: 文档矢量化是一种将文档内容转化为数学向量表示的技术,一般来说就是将光栅图像或者栅格图像转换为矢量图像。通过矢量化,可以将文本数据转化为计算机可以理解和处理的形式,从而将文档资料通过计算机矢量化的格式(例如OFD,PDF等)完整地保存下来,为印刷过程中的文本处理、信息检索等领域提供了更多可能性。首先,介绍了文档矢量化的背景;其次,介绍了传统文档矢量化模型;然后,将传统方法到基于深度学习的方法进行了全面综述并对不同的方法进行了比较;最后,对文档矢量化的应用领域和发展进行探讨和展望。
Abstract: Document vectorization is a technique that converts the content of a document into a mathematical vector representation, generally a raster image or raster image into a vector image. Through vectorization, the text data can be converted into a form that the computer can understand and process, so that the document data can be completely saved through the computer vectorized format (such as OFD, PDF, etc.), providing more possibilities for text processing, information retrieval and other fields in the printing process. Firstly, the background of document vectorization is introduced. Secondly, the traditional document vectorization model is briefly introduced. Then, the vectorization and the key techniques of vectorization processing in recent years are introduced. Finally, the application fields and development of document vectorization are discussed and prospected.
文章引用:王彤, 陆利坤. 文档矢量化技术的研究进展与应用[J]. 图像与信号处理, 2024, 13(4): 416-426. https://doi.org/10.12677/jisp.2024.134036

参考文献

[1] Tian, X. and Günther, T. (2024) A Survey of Smooth Vector Graphics: Recent Advances in Representation, Creation, Rasterization, and Image Vectorization. IEEE Transactions on Visualization and Computer Graphics, 30, 1652-1671. [Google Scholar] [CrossRef] [PubMed]
[2] Le, Q.V. and Mikolov, T. (2014) Distributed Representations of Sentences and Documents. The 31st International Conference on Machine Learning (ICML 2014), Beijing, 21-26 June 2014, 1188-1196.
[3] Grootendorst, M. (2022) BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure.
[4] Tomas, M., Ilya, S., Kai, C., Greg, C., Jeffrey, D., et al. (2013) Distributed Representations of Words and Phrases and their Compositionality. Conference on Neural Information Processing Systems, Lake Tahoe, 5-10 December 2013, 3111-3119.
[5] Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., et al. (2018) Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, a Survey. Multimedia Tools and Applications, 78, 15169-15211. [Google Scholar] [CrossRef
[6] Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, October 2014, 1532-1543.[CrossRef
[7] Armand, J., Edouard, G., Piotr, B., Tomas, M., et al. (2017) Bag of Tricks for Efficient Text Classification. Conference of the European Chapter of the Association for Computational Linguistics, Valencia, 3-7 April 2017, 427-431.
[8] Qader, W.A., Ameen, M.M. and Ahmed, B.I. (2019) An Overview of Bag of Words: Importance, Implementation, Applications, and Challenges. 2019 International Engineering Conference (IEC), Erbil, Iraq, 23-25 June 2019, 200-204.
[9] Tomás, M., Kai, C., Greg, C., Jeffrey, D., et al. (2013) Efficient Estimation of Word Representations in Vector Space. Computing Research Repository.
[10] Arora, S., Liang, Y.Y. and Ma, T.Y. (2017) A Simple but Tough-to-Beat Baseline for Sentence Embeddings. International Conference on Learning Representations, Toulon, 24-26 April 2017, 1-16.
[11] Ryan, K., Yukun, Z., Ruslan, S., Richard, S.Z., Antonio, T., Raquel, U., Sanja, F., et al. (2015) Skip-Thought Vectors. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, 7-12 December 2015, 3294-3302.
[12] Jacob, D., Kenton, L., Kristina, T., et al. (2018) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics, 4171-4186.
[13] Sanh, V., Debut, L., Chaumond, J., Wolf, T., et al. (2019) DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. Obstetric Protocols for Labor Ward Management.
[14] Varsha, K., Felix, W., Kilian, Q.W., Yoav, A., et al. (2020) BERTScore: Evaluating Text Generation with BERT. International Conference on Learning Representations, Addis Ababa, 30 April 2020, 1904.
[15] Shen, I. and Chen, B. (2022) Clipgen: A Deep Generative Model for Clipart Vectorization and Synthesis. IEEE Transactions on Visualization and Computer Graphics, 28, 4211-4224. [Google Scholar] [CrossRef] [PubMed]
[16] Shen, L.X., Shen, E., Tai, Z.W., Xu, Y.H., Dong, J.X. and Wang, J.M. (2022) Visual Data Analysis with Task-Based Recommendations. Data Science and Engineering, 7, 354-369.
[17] Egiazarian, V., Voynov, O., Artemov, A., Volkhonskiy, D., Safin, A., Taktasheva, M., et al. (2020) Deep Vectorization of Technical Drawings. 16th European Conference, Glasgow, 23-28 August 2020, 582-598. [Google Scholar] [CrossRef
[18] Bessmeltsev, M. and Solomon, J. (2019) Vectorization of Line Drawings via Polyvector Fields. ACM Transactions on Graphics, 38, 1-12. [Google Scholar] [CrossRef
[19] Mikhail, B. and Justin, S. (2019) Vectorization of Line Drawings via Polyvector Fields. ACM Transactions on Graphics, 38, Article No. 9.
[20] Singh, A.K. and Shashi, M. (2019) Vectorization of Text Documents for Identifying Unifiable News Articles. International Journal of Advanced Computer Science and Applications, 10, 305-310. [Google Scholar] [CrossRef
[21] Rakhmanov, O. (2020) A Comparative Study on Vectorization and Classification Techniques in Sentiment Analysis to Classify Student-Lecturer Comments. Procedia Computer Science, 178, 194-204. [Google Scholar] [CrossRef
[22] Bhunia, A.K., et al. (2021) Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19-25 June 2021, 5668-5677.
[23] Nehab, D. (2020) Converting Stroked Primitives to Filled Primitives. ACM Transactions on Graphics, 39, 137:1-137:17. [Google Scholar] [CrossRef
[24] Das, S., et al. (2019) DewarpNet: Single-Image Document Unwarping with Stacked 3D and 2D Regression Networks. IEEE International Conference on Computer Vision, Seoul, 27 October-2 November 2019, 131-140.
[25] Lee, S.Y. (2019) Document Vectorization Method Using Network Information of Words. PLOS ONE, 14, e0219389. [Google Scholar] [CrossRef] [PubMed]
[26] Lee, D.L., Chuang, H. and Seamons, K. (1997) Document Ranking and the Vector-Space Model. IEEE Software, 14, 67-75. [Google Scholar] [CrossRef
[27] Chen, M.M. (2017) Efficient Vector Representation for Documents through Corruption. International Conference on Learning Representations, Toulon, 24-26 April 2017, 24-26.
[28] Mo, H., Simo-Serra, E., Gao, C., Zou, C. and Wang, R. (2021) General Virtual Sketching Framework for Vector Line Art. ACM Transactions on Graphics, 40, 1-14. [Google Scholar] [CrossRef
[29] Inoue, N., Kikuchi, K., Simo-Serra, E., Otani, M. and Yamaguchi, K. (2023) Towards Flexible Multi-Modal Document Models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 14287-14296. [Google Scholar] [CrossRef
[30] Zhao, S., Durand, F. and Zheng, C. (2018) Inverse Diffusion Curves Using Shape Optimization. IEEE Transactions on Visualization and Computer Graphics, 24, 2153-2166. [Google Scholar] [CrossRef] [PubMed]
[31] Java, A., Deshmukh, S., Aggarwal, M., Jandial, S., Sarkar, M. and Krishnamurthy, B. (2023) One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2-7 January 2023, 5426-5435. [Google Scholar] [CrossRef
[32] Chen, M., Chai, Y. and Shang, J. (2021) LCSSA Optimization for Vectorization Recognition Rate Improvement. Journal of Physics: Conference Series, 1827, Article ID: 012143. [Google Scholar] [CrossRef
[33] Li, K., et al. (2020) Cross-Domain Document Object Detection: Benchmark Suite and Method. Computer Vision and Pattern Recognition, Seattle, 14-19 June 2020, 12912-12921.
[34] Li, P.Z., et al. (2021) SelfDoc: Self-Supervised Document Representation Learning. Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 19-25 June 2021, 5652-5660.
[35] Lin, Y., Chen, W. and Chuang, Y. (2020) Bedsr-Net: A Deep Shadow Removal Network from a Single Document Image. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 14-19 June 2020, 12902-12911. [Google Scholar] [CrossRef
[36] Ma, K., et al. (2018) DocUNet: Document Image Unwarping via A Stacked U-Net. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 18-22 June 2018, 4700-4709.
[37] Ma, X., et al. (2022) Towards Layer-Wise Image Vectorization. Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 16293-16302.
[38] Naeem, M.F., et al. (2023) I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification. CVPR 2023, Vancouver, 17-24 June 2023, 15169-15179.
[39] Hoshyari, S., Dominici, E.A., Sheffer, A., Carr, N., Wang, Z., Ceylan, D., et al. (2018) Perception-Driven Semi-Structured Boundary Vectorization. ACM Transactions on Graphics, 37, Article No. 118. [Google Scholar] [CrossRef
[40] Qi, Y., Huang, W.R., Li, Q. and DeGange, J.L. (2020) Deeperase: Weakly Supervised Ink Artifact Removal in Document Text Images. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, 1-5 March 2020, 3511-3519. [Google Scholar] [CrossRef
[41] Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J., et al. (2019) Semantic Photo Manipulation with a Generative Image Prior. ACM Transactions on Graphics, 38, Article No. 59. [Google Scholar] [CrossRef
[42] Song, W., Abyaneh, M.M., Shabani, M.A. and Furukawa, Y. (2023) Vectorizing Building Blueprints. 16th Asian Conference on Computer Vision, Macao, 4-8 December 2022, 142-157. [Google Scholar] [CrossRef
[43] Ding, W.J., Qiao, L.M., Qiu, X., et al. (2023) PivotNet: Vectorized Pivot Learning for End-to-End HD Map Construction. IEEE International Conference on Computer Vision, Paris, 1-6 October 2023, 3649-3659.