基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

doi:10.3799/dqkx.2026.032

基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

doi: 10.3799/dqkx.2026.032

张雨昂¹,
谢忠^2,3,4,
田苗³,
吴麒瑞⁴,
吴亮²,
邱芹军^2, ,,
陈建国^3,5

1. 中国地质大学(武汉) 地理与信息工程学院, 湖北武汉 430078;
2. 中国地质大学(武汉) 计算机学院, 湖北武汉 430078;
3. 中国地质大学(武汉) 地质探测与评估教育部重点实验室, 湖北武汉 430074;
4. 中国地质大学(武汉) 未来技术学院, 湖北武汉 430078;
5. 中国地质大学(武汉) 资源学院, 湖北武汉 430074

详细信息

作者简介:
张雨昂(1997-),男,博士研究生,从事地质知识图谱构建及领域大模型应用研究。E-mail:zhangyuang@cug.edu.cn,ORCID:0009-0000-6213-9081

通讯作者:
邱芹军(1988 - ),男,副研究员,博士,从事地质知识图谱构建及推理应用研究。E-mail:qiuqinjun@cug.edu.cn,ORCID:0000-0002-9850-3751

中图分类号: P628
计量
- 文章访问数: 163
- HTML全文浏览量: 57
- PDF下载量: 26
- 被引次数: 0
出版历程
- 收稿日期: 2025-12-30
- 网络出版日期: 2026-02-28

A Large Language Model for Mineral Exploration via Multi-source Continual Pre-training and Integrated Retrieval-Augmented Generation

ZHANG Yuang¹,
XIE Zhong^2,3,4,
TIAN Miao³,
WU Qirui⁴,
WU Liang²,
QIU Qinjun^{2
, ,},
CHEN Jianguo^3,5

1. School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China;
2. School of Computer Science, China University of Geosciences, Wuhan 430078, China;
3. Key Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences, Wuhan 430074, China;
4. School of Future Technology, China University of Geosciences, Wuhan 430078, China;
5. School of Earth Resources, China University of Geosciences, Wuhan 430074, China

摘要

摘要: 为解决矿产勘查场景下通用大语言模型领域语料稀缺、领域术语覆盖与语体适配不足、事实性幻觉突出的问题。构建约2500万token规模的领域语料库，在此基础上提出课程式持续预训练策略，按术语、机制、案例三阶段组织训练数据，并配合渐进式Transformer block解冻与学习率调度，对Qwen3-1.7B进行持续预训练以实现分阶段领域适配，得到面向矿产勘查场景的大语言模型Geo-MineLLM；推理阶段集成Hybrid RAG，以混合检索与证据约束生成提升事实一致性。人工评估表明，Geo-MineLLM相较基座模型与同系列更大参数规模的模型显著提升领域问答表现；集成Hybrid RAG后，综合领域问答表现接近GPT-4.1。（该训练、推理一体化方案为矿产勘查领域大模型构建与可靠问答提供轻量化路径。
- 大语言模型 /
- 持续预训练 /
- 检索增强生成 /
- 矿产勘查
Abstract: To address the challenges faced by general-purpose large language models (LLMs) in mineral exploration, including scarcity of domain corpora, insufficient coverage of domain terminology and register adaptation, and pronounced factual hallucinations. We constructed a mineral-exploration corpus of approximately 25 million tokens and, on this basis, proposed a curriculum-based continual pre-training strategy, which organizes training data into three stages: terminology, mechanisms, and cases. Coupled with gradual unfreezing of Transformer blocks and learning-rate scheduling, we conducted continual pre-training of Qwen3-1.7B to achieve stage-wise domain adaptation, resulting in a mineral-exploration-oriented LLM, Geo-MineLLM. During inference, we integrated a Hybrid RAG framework, leveraging hybrid retrieval and evidence-constrained generation to enhance factual consistency. Human evaluation indicates that Geo-MineLLM substantially improves domain question-answering performance relative to the base model and larger-parameter models within the same family. With Hybrid RAG enabled, overall domain QA performance approaches that of GPT-4.1. The proposed training–inference integrated framework provides a lightweight pathway for building mineral-exploration LLMs and enabling reliable domain-specific question answering.
- LLMs /
- continual pre-training /
- retrieval-augmented generation /
- mineral exploration

HTML全文

参考文献(0)

施引文献

资源附件(0)

访问统计

点击查看大图

计量

文章访问数: 163
HTML全文浏览量: 57
PDF下载量: 26
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

doi: 10.3799/dqkx.2026.032

作者简介:
张雨昂(1997-),男,博士研究生,从事地质知识图谱构建及领域大模型应用研究。E-mail:zhangyuang@cug.edu.cn,ORCID:0009-0000-6213-9081

通讯作者:
邱芹军(1988 - ),男,副研究员,博士,从事地质知识图谱构建及推理应用研究。E-mail:qiuqinjun@cug.edu.cn,ORCID:0000-0002-9850-3751

计量

A Large Language Model for Mineral Exploration via Multi-source Continual Pre-training and Integrated Retrieval-Augmented Generation

计量

目录

留言板

基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

doi: 10.3799/dqkx.2026.032

作者简介: 张雨昂(1997-),男,博士研究生,从事地质知识图谱构建及领域大模型应用研究。E-mail:zhangyuang@cug.edu.cn,ORCID:0009-0000-6213-9081

通讯作者: 邱芹军(1988 - ),男,副研究员,博士,从事地质知识图谱构建及推理应用研究。E-mail:qiuqinjun@cug.edu.cn,ORCID:0000-0002-9850-3751

计量

出版历程

A Large Language Model for Mineral Exploration via Multi-source Continual Pre-training and Integrated Retrieval-Augmented Generation

计量

出版历程

目录

作者简介:
张雨昂(1997-),男,博士研究生,从事地质知识图谱构建及领域大模型应用研究。E-mail:zhangyuang@cug.edu.cn,ORCID:0009-0000-6213-9081

通讯作者:
邱芹军(1988 - ),男,副研究员,博士,从事地质知识图谱构建及推理应用研究。E-mail:qiuqinjun@cug.edu.cn,ORCID:0000-0002-9850-3751