大语言模型时代壮语情感词典构建的方法路径
Methodological Pathways for Constructing a Zhuang Sentiment Lexicon in the Era of Large Language Models
DOI: 10.12677/ml.2026.145407, PDF,    科研立项经费支持
作者: 高 艳:广西民族大学外国语学院,广西 南宁
关键词: 壮语情感词典大语言模型低资源语言跨语言映射Zhuang Language Sentiment Lexicon Large Language Models Low-Resource Languages Cross-Lingual Mapping
摘要: 壮语作为我国使用人口最多的少数民族语言,在自然语言处理领域长期处于低资源状态,情感词典等细粒度语义资源至今缺失。文章梳理情感词典构建方法的演进脉络,重点考察大语言模型时代涌现的三种新兴技术路径——人机协同主动学习、少样本上下文学习和参数高效微调,并结合壮语的资源现状逐一分析各路径的方法学要求与落地条件。研究表明,预训练语料覆盖薄弱、双语资源结构有限和专业人才稀缺三方面因素相互交织,使壮语情感词典建设难以依靠单一技术路径解决,需要根据资源条件灵活组合传统方法与新兴方法。
Abstract: Zhuang, the most widely spoken ethnic minority language in China, has long remained a low-resource language in natural language processing, with sentiment lexicons and other fine-grained semantic resources still absent. This paper traces the methodological evolution of sentiment lexicon construction and focuses on three emerging technical pathways enabled by large language models: human-in-the-loop active learning, few-shot in-context learning, and parameter-efficient fine-tuning. Each pathway is examined in light of the current resource conditions of Zhuang. The analysis reveals that three intertwined challenges, namely insufficient pretraining coverage, the limited structure of bilingual resources, and the scarcity of specialized expertise, prevent any single technical pathway from independently meeting the demands of Zhuang sentiment lexicon construction. A flexible combination of traditional methods and emerging approaches, calibrated to actual resource conditions, is therefore necessary. Among the three new pathways, few-shot in-context learning offers the strongest fit for current Zhuang research, owing to its lower technical threshold and reduced reliance on pretraining coverage.
文章引用:高艳. 大语言模型时代壮语情感词典构建的方法路径[J]. 现代语言学, 2026, 14(5): 328-334. https://doi.org/10.12677/ml.2026.145407

参考文献

[1] Mohammad, S.M. (2016) Sentiment Analysis: Detecting Valence, Emotions, and Other Affectual States from Text. In: Meiselman, H.L., Ed., Emotion Measurement, Elsevier, 201-237. [Google Scholar] [CrossRef
[2] Joshi, P., Santy, S., Budhiraja, A., Bali, K. and Choudhury, M. (2020) The State and Fate of Linguistic Diversity and Inclusion in the NLP World. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5-10 July 2020, 6282-6293. [Google Scholar] [CrossRef
[3] Hedderich, M.A., Lange, L., Adel, H., Strötgen, J. and Klakow, D. (2021) A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6-11 June 2021, 2545-2568. [Google Scholar] [CrossRef
[4] 陆小飞, 金檀. 大语言模型微调技术在语言分析与测试中的应用与展望[J]. 现代外语, 2025, 48(3): 413-421.
[5] Taboada, M., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (2011) Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics, 37, 267-307. [Google Scholar] [CrossRef
[6] Darwich, M., Mohd Noah, S.A., Omar, N. and Osman, N.A. (2019) Corpus-Based Techniques for Sentiment Lexicon Generation: A Review. Journal of Digital Information Management, 17, Article 296. [Google Scholar] [CrossRef
[7] Xu, Y., Cao, H., Du, W. and Wang, W. (2022) A Survey of Cross-Lingual Sentiment Analysis: Methodologies, Models and Evaluations. Data Science and Engineering, 7, 279-299. [Google Scholar] [CrossRef
[8] Esuli, A. and Sebastiani, F. (2006) SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining. LREC, 6, 417-422.
[9] Das, A. and Bandyopadhyay, S. (2010) SentiWordNet for Indian Languages. Proceedings of the Eighth Workshop on Asian Language Resources, Beijing, 21-22 August 2010, 56-63.
[10] Dehkharghani, R., Saygin, Y., Yanikoglu, B. and Oflazer, K. (2016) Sentiturknet: A Turkish Polarity Lexicon for Sentiment Analysis. Language Resources and Evaluation, 50, 667-685. [Google Scholar] [CrossRef
[11] B. Shelke, M., Sawant, D.D., Kadam, C.B., Ambhure, K. and Deshmukh, S.N. (2023) Marathi sentiwordnet: A Lexical Resource for Sentiment Analysis of Marathi. Concurrency and Computation: Practice and Experience, 35, e7497. [Google Scholar] [CrossRef
[12] Nguyen, K.V., Nguyen, V.D., Nguyen, P.X.V., Truong, T.T.H. and Nguyen, N.L. (2018) UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis. 2018 10th International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh City, 1-3 November 2018, 19-24. [Google Scholar] [CrossRef
[13] Koto, F., Rahimi, A., Lau, J.H. and Baldwin, T. (2020) Indolem and Indobert: A Benchmark Dataset and Pre-Trained Language Model for Indonesian NLP. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, 8-13 December 2020, 757-770. [Google Scholar] [CrossRef
[14] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998-6008.
[15] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., et al. (2020) Unsupervised Cross-Lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5-10 July 2020, 8440-8451. [Google Scholar] [CrossRef
[16] Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., et al. (2020) Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5-10 July 2020, 8342-8360. [Google Scholar] [CrossRef
[17] Kholodna, N., Julka, S., Khodadadi, M., Gumus, M.N. and Granitzer, M. (2024) LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages. In: Bifet, A., Krilavičius, T., Miliou, I. and Nowaczyk, S., Eds., Lecture Notes in Computer Science, Springer, 397-412. [Google Scholar] [CrossRef
[18] Cahyawijaya, S., Lovenia, H. and Fung, P. (2024) LLMs Are Few-Shot In-Context Low-Resource Language Learners. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, 16-21 June 2024, 405-433. [Google Scholar] [CrossRef
[19] Li, Y., Zhao, Z. and Scarton, C. (2025) It’s All about In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, 4-9 November 2025, 29532-29547. [Google Scholar] [CrossRef
[20] Zhang, C., Liu, X., Lin, J. and Feng, Y. (2024) Teaching Large Language Models an Unseen Language on the Fly. Findings of the Association for Computational Linguistics ACL 2024, Bangkok, 11-16 August 2024, 8783-8800. [Google Scholar] [CrossRef
[21] 余杰, 飞龙, 郭陆祥, 等. 基于通用大模型的民族语言大模型构建技术[J]. 中文信息学报, 2025, 39(8): 75-81.