基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

张雨昂; 谢忠; 田苗; 吴麒瑞; 吴亮; 邱芹军; 陈建国

doi:10.3799/dqkx.2026.032

基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

doi: 10.3799/dqkx.2026.032

张雨昂^1, ,,
谢忠^{2, 3, 4},
田苗³,
吴麒瑞⁴,
吴亮²,
邱芹军^2, , ,,
陈建国^{3, 5}

1.
中国地质大学(武汉)地理与信息工程学院, 湖北武汉 430078
2.
中国地质大学(武汉)计算机学院, 湖北武汉 430078
3.
中国地质大学(武汉)地质探测与评估教育部重点实验室, 湖北武汉 430074
4.
中国地质大学(武汉)未来技术学院, 湖北武汉 430078
5.
中国地质大学(武汉)资源学院, 湖北武汉 430074

基金项目:

国家自然科学基金项目 42301492

国家自然科学基金项目 42571487

国家重点研发计划项目 2023YFC2906404

国家重点研发计划项目 2023YFC2906400

详细信息

作者简介:
张雨昂（1997 -），男，博士研究生，从事地质知识图谱构建及领域大模型应用研究.ORCID：0009-0000-6213-9081.E-mail：zhangyuang@cug.edu.cn

通讯作者:
邱芹军，ORCID：0000-0002-9850-3751.E-mail：qiuqinjun@cug.edu.cn

中图分类号: P628
计量
- 文章访问数: 792
- HTML全文浏览量: 255
- PDF下载量: 133
- 被引次数: 0
出版历程
- 收稿日期: 2025-12-30
- 刊出日期: 2026-03-25

A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation

Zhang Yuang^{1
,
,},
Xie Zhong^{2, 3, 4},
Tian Miao³,
Wu Qirui⁴,
Wu Liang²,
Qiu Qinjun^{2
, ,
,},
Chen Jianguo^{3, 5}

1.
School of Geography and Information Engineering, China University of Geosciences (Wuhan), Wuhan 430078, China
2.
School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430078, China
3.
Key Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences (Wuhan), Wuhan 430074, China
4.
School of Future Technology, China University of Geosciences (Wuhan), Wuhan 430078, China
5.
School of Earth Resources, China University of Geosciences (Wuhan), Wuhan 430074, China

摘要

摘要:
为解决矿产勘查场景下通用大语言模型领域语料稀缺、领域术语覆盖与语体适配不足、事实性幻觉突出的问题，构建约2 500万token规模的领域语料库，在此基础上提出课程式持续预训练策略，按术语、机制、案例三阶段组织训练数据，并配合渐进式Transformer block解冻与学习率调度，对Qwen3-1.7B进行持续预训练以实现分阶段领域适配，得到面向矿产勘查场景的大语言模型Geo-MineLLM；推理阶段集成Hybrid RAG，以混合检索与证据约束生成提升事实一致性.人工评估表明，Geo-MineLLM相较基座模型与同系列更大参数规模的模型能显著提升领域问答表现；集成Hybrid RAG后，综合领域问答表现接近GPT-4.1.该训练、推理一体化方案为矿产勘查领域大模型构建与可靠问答提供了轻量化路径.
- 大语言模型 /
- 持续预训练 /
- 检索增强生成 /
- 矿产勘查 /
- 人工智能
Abstract:
To address the challenges faced by general-purpose large language models in mineral exploration, including scarcity of domain corpora, insufficient coverage of domain terminology and register adaptation, and pronounced factual hallucinations. We constructed a mineral-exploration corpus of approximately 25 million tokens and, on this basis, proposed a curriculum-based continual pre-training strategy, which organizes training data into three stages: terminology, mechanisms, and cases. Coupled with gradual unfreezing of Transformer blocks and learning-rate scheduling, we conducted continual pre-training of Qwen3-1.7B to achieve stage-wise domain adaptation, resulting in a mineral-exploration-oriented LLM, Geo-MineLLM. During inference, we integrated a Hybrid RAG framework, leveraging hybrid retrieval and evidence-constrained generation to enhance factual consistency. Human evaluation indicates that Geo-MineLLM substantially improves domain question-answering performance relative to the base model and larger-parameter models within the same family. With Hybrid RAG enabled, overall domain QA performance approaches that of GPT-4.1. The proposed training-inference integrated framework provides a lightweight pathway for building mineral-exploration LLMs and enabling reliable domain-specific question answering.
- large language models /
- continual pre-training /
- retrieval-augmented generation /
- mineral exploration /
- artificial intelligence

HTML全文

图 1 集成Hybrid RAG的智能问答框架

Fig. 1. An intelligent question-answering framework integrating Hybrid RAG

下载: 全尺寸图片幻灯片

图 2 Prospect-Curriculum CPT方法

Fig. 2. Prospect-Curriculum CPT Method

下载: 全尺寸图片幻灯片

图 3 P0阶段CPT的训练损失曲线与验证损失曲线

Fig. 3. Training loss curve and validation loss curve for CPT during phase P0

下载: 全尺寸图片幻灯片

图 4 P1阶段CPT的训练损失曲线与验证损失曲线

Fig. 4. Training loss curve and validation loss curve for CPT during phase P1

下载: 全尺寸图片幻灯片

图 5 P2阶段CPT的训练损失曲线与验证损失曲线

Fig. 5. Training loss curve and validation loss curve for CPT during phase P2

下载: 全尺寸图片幻灯片

图 6 人工评估与对比

a.基础地质维度；b.地质矿产资源勘探维度；c.矿床类型与分布维度；d.成矿规律维度

Fig. 6. Human evaluation and comparison

下载: 全尺寸图片幻灯片

图 7 模型问答示例

Fig. 7. Example of model-based question answering

下载: 全尺寸图片幻灯片

图 8 人工评估与对比

a.基础地质维度；b.地质矿产资源勘探维度；c.矿床类型与分布维度；d.成矿规律维度

Fig. 8. Human evaluation and comparison

下载: 全尺寸图片幻灯片

图 9 模型问答示例

Fig. 9. Example of model-based question answering

下载: 全尺寸图片幻灯片

表 1 语料明细

Table 1. Corpus details

语料数据明细	辞典语料	文献语料	报告语料
文件数（个）	8	106	152
token数（万）	1 314	72	1 135
样本平均token数（个）	124.36	417.76	413.49
样本数（个）	105 676	1 720	27 465

下载: 导出CSV

表 2 语料样本示例

Table 2. Example of corpus samples

数据类别	样本示例
辞典数据	黄铜矿，化学组成为CuFeS₂.有同质三象变体.常见的是四方晶系变体温度稳定在550~213 ℃.成分中常有机械混入物，银、金、铊、硒、碲、锗、镓、铟等.黄铜矿中还常见闪锌矿、黝锡矿等的小包体.晶体少见，呈四方四面体，主要呈致密块状或粒状集合体.黄铜黄色.表面常有蓝、紫红的斑状锖色.条痕绿黑色.金属光泽.不透明.硬度3~4.性脆.相对密度4.1~4.3.黄铜矿分布很广，可在各种条件下形成.主要有超基性岩铜镍硫化物矿床，也有接触交代矽卡岩矿床, 与黄铁矿、磁铁矿、磁黄铁矿共生.
文献数据	已有研究表明，在所有的流体类型中，岩浆流体对金的迁移能力最强，而单纯的岩浆结晶作用并不能形成金矿，这是因为岩浆中大部分Au将随岩浆结晶分异和去气作用而被萃取至流体相，而且这些过程会产生使成矿流体沿通道发生迁移的驱动力，这种驱动作用对于金矿床的形成至关重要.马头滩矿区内广泛发育岩浆岩，且相邻矿区内含矿围岩与矿石的Sr、Nd、pb、S、Si同位素具有继承相似性，说明岩浆作用为Au的富集和运移提供了前提(刘重芃等, 2014).
报告数据	东天山成矿区包括有8个Ⅲ级构造‒成矿单元，面积约18×10⁴ km².区内己发现铜矿产地63处，其中有超大型资源远景的矿床1处(延东)、大型矿床1处(土屋)、中型矿床2处(黄山及黄山东)、小型矿床11处.己探明铜储量145.08×10⁴ t，占新疆已知铜储量的29.30%；占天山己知铜储量的55.22%.

下载: 导出CSV

表 3 三阶段CPT通用超参数设置

Table 3. Three-stage CPT general hyperparameter settings

超参数	值
Compute type	fp16
Cutoff length	2 048
Batch size	8
Gradient accumulation	2
LR schedule	cosine
Pack sequences	true
Enable thinking	true

下载: 导出CSV

表 4 不同模型的时间性能

Table 4. Time performance of different models

模型名	时间性能(s)
DeepSeek-V3	15.86
GPT-4.1	14.31
Geo-MineLLM	5.39
Qwen3-1.7B	7.54
Qwen3-4B	22.95
Geo-MineLLM + Hybrid RAG	8.22

下载: 导出CSV

参考文献(37)

Bengio, Y., Louradour, J., Collobert, R., et al., 2009. Curriculum Learning. The 26th Annual International Conference on Machine Learning. Montreal. https://doi.org/10.1145/1553374.1553380
Cheng, Q. M., 2025. A New Paradigm for Mineral Resource Prediction Based on Human Intelligence-Artificial Intelligence Integration. Earth Science Frontiers, 32(4): 1-19 (in Chinese with English abstract).
Cormack, G. V., Clarke, C. L. A., Buettcher, S., 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. The 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Boston. https://doi.org/10.1145/1571941.1572114
Deng, C., Zhang, T. H., He, Z. M., et al., 2024. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. The 17th ACM International Conference on Web Search and Data Mining. Merida. https://doi.org/10.1145/3616855.3635772
Farquhar, S., Kossen, J., Kuhn, L., et al., 2024. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630(8017): 625-630. https://doi.org/10.1038/s41586-024-07421-0
Fu, Y., Wang, M. G., Wang, C. B., et al., 2025. GeoMinLM: A Large Language Model in Geology and Mineral Survey in Yunnan Province. Ore Geology Reviews, 182: 106638. https://doi.org/10.1016/j.oregeorev.2025.106638
Gupta, K., Thérien, B., Ibrahim, A., et al., 2023. Continual Pre-Training of Large Language Models: How to (Re) Warm Your Model? ICML2023, Hawaii. https://doi.org/10.48550/arXiv.2308.04014
Gururangan, S., Marasović, A., Swayamdipta, S., et al., 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. The 58th Annual Meeting of the Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.740
He, H., Ma, C., Ye, S., et al., 2024. Low Resource Chinese Geological Text Named Entity Recognition Based on Prompt Learning. Journal of Earth Science, 35(3): 1035-1043. https://doi.org/10.1007/s12583-023-1944-8
Hou, X. Y., Zhao, Y. J., Liu, Y., et al., 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Transactions on Software Engineering and Methodology, 33(8): 1-79. https://doi.org/10.1145/3695988
Howard, J., Ruder, S., 2018. Universal Language Model Fine-Tuning for Text Classification. The 56th Annual Meeting of the Association for Computational Linguistics. Melbourne. https://doi.org/10.18653/v1/p18-1031
Jawahar, G., Sagot, B., Seddah, D., 2019. What Does BERT Learn about the Structure of Language? The 57th Annual Meeting of the Association for Computational Linguistics, Florence. https://doi.org/10.18653/v1/P19-1356
Ji, Z. W., Lee, N., Frieske, R., et al., 2023. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12): 1-38. https://doi.org/10.1145/3571730
Karpukhin, V., Oguz, B., Min, S., et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering. The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. https://doi.org/10.18653/v1/2020.emnlp-main.550
Lachowycz, S., 2024. Utility of Artificial Intelligence in Geoscience. Nature Geoscience, 17(10): 953-955. https://doi.org/10.1038/s41561-024-01548-5
Lewis, P., Perez, E., Piktus, A., et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv, 2005.11401. https://arxiv.org/abs/2005.11401
Liu, C. P., Yang, H. M., Duan, R. C., et al., 2014. Metallogenic Age of the Matoutan Gold Deposit in East Tianshan and Its Geological Significance. Geological Bulletin of China, 33(6): 912-923 (in Chinese with English abstract).
Qiu, Q. J., Tian, M., Xie, Z., et al., 2023a. Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach. Journal of Earth Science, 34(5): 1406-1417. https://doi.org/10.1007/s12583-022-1789-8
Qiu, Q. J., Wang, B., Ma, K., et al., 2023b. A Practical Approach to Constructing a Geological Knowledge Graph: A Case Study of Mineral Exploration Data. Journal of Earth Science, 34(5): 1374-1389. https://doi.org/10.1007/s12583-023-1809-3
Qiu, Q. J., Wu, L., Ma, K., et al., 2023. A Knowledge Graph Construction Method for Geohazard Chain for Disaster Emergency Response. Earth Science, 48(5): 1875-1891 (in Chinese with English abstract).
Raffel, C., Shazeer, N., Roberts, A., et al., 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1-67.
Robertson, S., Zaragoza, H., 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 3(4): 333-389. https://doi.org/10.1561/1500000019
Shi, L. Y., Zuo, R. G., 2026. Foundation Model for Mineral Prospectivity Mapping. Earth Science, 53(3): 832-848(in Chinese with English abstract).
Wu, G., Wang, H. T., Zhang, K. Y., et al., 2025. GeoProspect: A Domain-Specific Geological Large Language Model with Enhanced Continual Learning. Neurocomputing, 650: 130801. https://doi.org/10.1016/j.neucom.2025.130801
Wu, S. J., Irsoy, O., Lu, S., et al., 2023. BloombergGPT: A Large Language Model for Finance. arXiv, 2303.17564. https://arxiv.org/abs/2303.17564
Yang, X., Chen, A. K., PourNejatian, N., et al., 2022. A Large Language Model for Electronic Health Records. NPJ Digital Medicine, 5: 194. https://doi.org/10.1038/s41746-022-00742-2
Zhang, B. Y., Tang, J. C., Zhang, T. Y., et al., 2026. Knowledge Graph and Question-Answering Model for Geological Prospecting Empowered by Large Language Models. Earth Science, 53(3): 982-995 (in Chinese with English abstract).
Zhang, K. P., Ma, L., Cui, B. B., et al., 2024a. Visual Large Language Model for Wheat Disease Diagnosis in the Wild. Computers and Electronics in Agriculture, 227: 109587. https://doi.org/10.1016/j.compag.2024.109587
Zhang, Y. F., Wei, C., He, Z. T., et al., 2024b. GeoGPT: An Assistant for Understanding and Processing Geospatial Tasks. International Journal of Applied Earth Observation and Geoinformation, 131: 103976. https://doi.org/10.1016/j.jag.2024.103976
Zhou, B., Li, K., 2025. Fusing Geoscience Large Language Models and Lightweight RAG for Enhanced Geological Question Answering. Geosciences, 15(10): 382. https://doi.org/10.3390/geosciences15100382
Zuo, R. G., Cheng, Q. M., Xu, Y., et al., 2024. Explainable Artificial Intelligence Models for Mineral Prospectivity Mapping. Scientia Sinica (Terrae), 54(9): 2917-2928 (in Chinese with English abstract). doi: 10.1360/N072024-0018
成秋明, 2025. 面向人类智能与人工智能融合的矿产资源预测新范式. 地学前缘, 32(4): 1-19.
刘重芃, 杨红梅, 段瑞春, 等, 2014. 东天山马头滩金矿的成矿时代及其地质意义. 地质通报, 33(6): 912-923.
邱芹军, 吴亮, 马凯, 等, 2023. 面向灾害应急响应的地质灾害链知识图谱构建方法. 地球科学, 48(5): 1875-1891. doi: 10.3799/dqkx.2022.313
师路易, 左仁广, 2026. 矿产预测大模型. 地球科学, 53(3): 832-848.
张宝一, 唐嘉成, 张彤蕴, 等, 2026. 大语言模型赋能的地质找矿知识图谱与问答模型构建. 地球科学, 53(3): 982-995. doi: 10.3799/dqkx.2025.176
左仁广, 成秋明, 许莹, 等, 2024. 可解释性矿产预测人工智能模型. 中国科学: 地球科学, 54(9): 2917-2928.

施引文献

资源附件(0)

访问统计

点击查看大图

图(9) / 表(4)

计量

文章访问数: 792
HTML全文浏览量: 255
PDF下载量: 133
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

doi: 10.3799/dqkx.2026.032

作者简介:
张雨昂（1997 -），男，博士研究生，从事地质知识图谱构建及领域大模型应用研究.ORCID：0009-0000-6213-9081.E-mail：zhangyuang@cug.edu.cn

通讯作者:
邱芹军，ORCID：0000-0002-9850-3751.E-mail：qiuqinjun@cug.edu.cn

计量

A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation

计量

目录

留言板

基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

doi: 10.3799/dqkx.2026.032

作者简介: 张雨昂（1997 -），男，博士研究生，从事地质知识图谱构建及领域大模型应用研究.ORCID：0009-0000-6213-9081.E-mail：zhangyuang@cug.edu.cn

通讯作者: 邱芹军，ORCID：0000-0002-9850-3751.E-mail：qiuqinjun@cug.edu.cn

计量

出版历程

A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation

计量

出版历程

目录

作者简介:
张雨昂（1997 -），男，博士研究生，从事地质知识图谱构建及领域大模型应用研究.ORCID：0009-0000-6213-9081.E-mail：zhangyuang@cug.edu.cn

通讯作者:
邱芹军，ORCID：0000-0002-9850-3751.E-mail：qiuqinjun@cug.edu.cn