基于关键词指导的图像中文描述生成
Image Description Generation in Chinese Based on Keywords Guidance
摘要: 图像描述生成技术可以加速图文内容的生产,因而有着广泛的应用前景。为了满足实际需要,我们提出了一种基于编码–解码框架的新方法。我们的模型通过融合图像和文本特征作为输入来指导图像描述的生成,文本特征包含图像的关键词信息作为图像特征的补充。实验结果表明,关键词信息加强了图像到图像描述的映射,本文模型比未融合关键词信息的模型具有更好的性能,并且不同的关键词信息对图像描述的生成有一定的控制作用。
Abstract: Technology for generating image description has a wide range of applications as it can speed up the production of graphic content. To meet the practical requirements, we propose a new method based on the encoder-decoder framework. Our model guides the generation of image description by fusing image and text features as input. The text features contain the semantics of keywords of images, as a supplement to image information. The experimental results show that keyword information enhances the mapping from image to image description. The model in this paper has better performance than the model without fusing keyword information, and different keyword information has certain control over the generation of image description.
文章引用:史秀聪. 基于关键词指导的图像中文描述生成[J]. 计算机科学与应用, 2020, 10(6): 1087-1097. https://doi.org/10.12677/CSA.2020.106113

参考文献

[1] Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., et al. (2017) Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics, 5, 339-351. [Google Scholar] [CrossRef
[2] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-manan, D., et al. (2014) Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B. and Tuy-telaars, T., Eds., European Conference on Computer Vision, Springer, Cham, 740-755. [Google Scholar] [CrossRef
[3] Flickr Image Dataset. Kaggle.com.
https://www.kaggle.com/hsankesara/flickr-image-dataset
[4] Vinyals, O., Toshev, A., Bengio, S. and Erhan, D. (2015) Show and Tell: A Neural Image Caption Generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 3156-3164. [Google Scholar] [CrossRef
[5] Karpathy, A. and Li, F.-F. (2015) Deep Visual-Semantic Alignments for Generating Image Descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 3128-3137. [Google Scholar] [CrossRef
[6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need. In: Advances in Neural Information Processing Systems, 5998-6008.
[7] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. International Conference on Machine Learning, June 2015, 2048-2057.
[8] You, Q., Jin, H., Wang, Z., Fang, C. and Luo, J. (2016) Image Captioning with Semantic Attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 27-30 June 2016, 4651-4659. [Google Scholar] [CrossRef
[9] Lu, J., Xiong, C., Parikh, D. and Socher, R. (2017) Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 375-383. [Google Scholar] [CrossRef
[10] Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W. and Chua, T.S. (2017) SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 5659-5667. [Google Scholar] [CrossRef
[11] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S. and Zhang, L. (2018) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 18-23 June 2018, 6077-6086. [Google Scholar] [CrossRef
[12] He, C. and Hu, H. (2019) Image Captioning with Text-Based Vis-ual Attention. Neural Processing Letters, 49, 177-185. [Google Scholar] [CrossRef
[13] He, X., Yang, Y., Shi, B. and Bai, X. (2019) VD-SAN: Visu-al-Densely Semantic Attention Network for Image Caption Generation. Neurocomputing, 328, 48-55. [Google Scholar] [CrossRef
[14] Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., et al. (2015) From Captions to Visual Concepts and Back. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 1473-1482. [Google Scholar] [CrossRef
[15] Li, N. and Chen, Z. (2018) Image Cationing with Visu-al-Semantic LSTM. IJCAI, July 2018, 793-799. [Google Scholar] [CrossRef
[16] Wang, Y., Lin, Z., Shen, X., Cohen, S. and Cottrell, G.W. (2017) Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition. Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 7272-7281. [Google Scholar] [CrossRef
[17] Ren, Z., Wang, X., Zhang, N., Lv, X. and Li, L.J. (2017) Deep Re-inforcement Learning-Based Image Captioning with Embedding Reward. Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 290-298. [Google Scholar] [CrossRef
[18] Zhang, L., Sung, F., Liu, F., Xiang, T., Gong, S., Yang, Y. and Hospedales, T.M. (2017) Actor-Critic Sequence Training for Image Captioning. arXiv preprint arXiv:1706.09601
[19] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556
[20] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[21] Papineni, K., Roukos, S., Ward, T. and Zhu, W.J. (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 2002, 311-318. [Google Scholar] [CrossRef
[22] Lin, C.Y. and Och, F.J. (2004) Looking for a Few Good Metrics: ROUGE and Its Evaluation. NTCIR Workshop, Tokyo, 2-4 June 2004.
[23] Vedantam, R., Lawrence Zitnick, C. and Parikh, D. (2015) Cider: Consensus-Based Image Description Evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 4566-4575. [Google Scholar] [CrossRef
[24] Sun, J. (2012) Jieba Chinese Word Segmen-tation Tool.
https://github.com/fxsjy/jieba
[25] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and Li, F.-F. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recog-nition, Miami, FL, 20-25 June 2009, 248-255. [Google Scholar] [CrossRef
[26] Ling, W., Dyer, C., Black, A.W. and Trancoso, I. (2015) Two/Too Simple Adaptations of Word2Vec for Syntax Problems. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Co, May-June 2015, 1299-1304. [Google Scholar] [CrossRef
[27] gensim: Topic Modelling for Humans. Radimrehurek.com.
https://radimrehurek.com/gensim/models/word2vec.html