基于双向增强和多阶监督的Text2SQL训练语料生成
Text2SQL Training Corpus Generation Based on Bidirectional Enhancement and Multi-Stage Supervision
摘要: 针对Text2SQL任务中训练语料人工标注成本高、场景覆盖有限的问题,本文提出一种基于双向增强与多阶监督的语料生成框架。该方法通过问题到SQL的正向增强与SQL到问题的逆向增强构建双向数据流,结合大语言模型的上下文理解与代码生成能力,创新性地引入四阶段监督审查机制(提问多样性扩充、提问质量审查、SQL自动生成、生成质量审查),极大地提高了低资源条件下训练语料生成的效率与质量。实验表明,该方法生成的语料所训练出来的模型执行准确率相较于传统人工标注语料微调模型提升了16.3%,相较于少样本提示学习方法提升了35.7%。其次,在语料的泛化迁移性方面,本文方法生成的语料对模型尺寸大小和提问难易程度的适应性都高于人工少量标注方式。
Abstract: To address the challenges of high annotation costs and limited scenario coverage in Text2SQL training corpus construction, this paper proposes a corpus generation framework based on bidirectional enhancement and multi-stage supervision. The method constructs a bidirectional data flow through question-to-SQL forward enhancement and SQL-to-question reverse enhancement, combines the contextual understanding and code generation capabilities of large language models (LLMs), and innovatively introduces a novel four-stage supervision and verification mechanism (question diversity expansion, question quality verification, SQL auto-generation, and generation quality verification), significantly improving the efficiency and quality of corpus generation under low-resource conditions. Experiments demonstrate that models trained with this generated corpus achieve a 16.3% improvement in execution accuracy compared to models fine-tuned with traditional human-annotated corpora and a 35.7% improvement over few-shot prompt learning methods. Furthermore, in terms of the generalization and transferability of the corpus, the corpus generated by this paper’s method is more adaptable to both model size and question difficulty levels than the manually annotated small-scale approach.
文章引用:黄浩. 基于双向增强和多阶监督的Text2SQL训练语料生成[J]. 计算机科学与应用, 2025, 15(7): 1-8. https://doi.org/10.12677/csa.2025.157174

参考文献

[1] Deng, N., Chen, Y. and Zhang, Y. (2022) Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect. arXiv: 2208.10099.
[2] Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.
https://github.com/taoyds/spider
[3] Marshan, A., Almutairi, A.N., Ioannou, A., Bell, D., Monaghan, A. and Arzoky, M. (2024) Medt5sql: A Transformers-Based Large Language Model for Text-to-SQL Conversion in the Healthcare Domain. Frontiers in Big Data, 7, Article 1371680. [Google Scholar] [CrossRef] [PubMed]
[4] Zhong, V., Xiong, C. and Socher, R. (2017) Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning. arXiv: 1709.00103.
[5] Dong, X., Zhang, C., Ge, Y., Mao, Y., Gao, Y., Lin, J., Lou, D., et al. (2023) C3: Zero-Shot Text-to-SQL with ChatGPT. arXiv: 2307.07306.
[6] Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B. and Zhou, J. (2023) Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. arXiv: 2308.15363.
[7] Li, Y., Guo, J., Yu, W., et al. (2023) BIRD: A New Benchmark for Cross-Domain Text-to-SQL Generation. ACL.
[8] Xu, X., Liu, C., Song, D., Zhang, Y., Shah, A., Tian, Y. and Salakhutdinov, R. (2017) SQLNet: Generating Structured Queries from Natural Language Without Reinforcement Learning. ACL.
[9] Wang, B., Shin, R., Liu, X., Polozov, O. and Richardson, M. (2020) RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-To-SQL Parsers. In: Jurafsky, D., Chai, J., Schluter, N. and Tetreault, J., Eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 7567-7578. [Google Scholar] [CrossRef
[10] Pourreza, M. and Rafiei, D. (2024) Din-SQL: Decomposed in-Context Learning of Text-to-SQL with Self-Correction. 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, 10-16 December 2023, 1-34.