• 中国出版政府奖提名奖

    中国百强科技报刊

    湖北出版政府奖

    中国高校百佳科技期刊

    中国最美期刊

    留言板

    尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

    姓名
    邮箱
    手机号码
    标题
    留言内容
    验证码

    基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

    张雨昂 谢忠 田苗 吴麒瑞 吴亮 邱芹军 陈建国

    张雨昂, 谢忠, 田苗, 吴麒瑞, 吴亮, 邱芹军, 陈建国, 2026. 基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建. 地球科学, 51(3): 1025-1039. doi: 10.3799/dqkx.2026.032
    引用本文: 张雨昂, 谢忠, 田苗, 吴麒瑞, 吴亮, 邱芹军, 陈建国, 2026. 基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建. 地球科学, 51(3): 1025-1039. doi: 10.3799/dqkx.2026.032
    Zhang Yuang, Xie Zhong, Tian Miao, Wu Qirui, Wu Liang, Qiu Qinjun, Chen Jianguo, 2026. A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation. Earth Science, 51(3): 1025-1039. doi: 10.3799/dqkx.2026.032
    Citation: Zhang Yuang, Xie Zhong, Tian Miao, Wu Qirui, Wu Liang, Qiu Qinjun, Chen Jianguo, 2026. A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation. Earth Science, 51(3): 1025-1039. doi: 10.3799/dqkx.2026.032

    基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

    doi: 10.3799/dqkx.2026.032
    基金项目: 

    国家自然科学基金项目 42301492

    国家自然科学基金项目 42571487

    国家重点研发计划项目 2023YFC2906404

    国家重点研发计划项目 2023YFC2906400

    详细信息
      作者简介:

      张雨昂(1997 -),男,博士研究生,从事地质知识图谱构建及领域大模型应用研究.ORCID:0009-0000-6213-9081.E-mail:zhangyuang@cug.edu.cn

      通讯作者:

      邱芹军,ORCID:0000-0002-9850-3751.E-mail:qiuqinjun@cug.edu.cn

    • 中图分类号: P628

    A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation

    • 摘要:

      为解决矿产勘查场景下通用大语言模型领域语料稀缺、领域术语覆盖与语体适配不足、事实性幻觉突出的问题,构建约2 500万token规模的领域语料库,在此基础上提出课程式持续预训练策略,按术语、机制、案例三阶段组织训练数据,并配合渐进式Transformer block解冻与学习率调度,对Qwen3-1.7B进行持续预训练以实现分阶段领域适配,得到面向矿产勘查场景的大语言模型Geo-MineLLM;推理阶段集成Hybrid RAG,以混合检索与证据约束生成提升事实一致性.人工评估表明,Geo-MineLLM相较基座模型与同系列更大参数规模的模型能显著提升领域问答表现;集成Hybrid RAG后,综合领域问答表现接近GPT-4.1.该训练、推理一体化方案为矿产勘查领域大模型构建与可靠问答提供了轻量化路径.

       

    • 图  1  集成Hybrid RAG的智能问答框架

      Fig.  1.  An intelligent question-answering framework integrating Hybrid RAG

      图  2  Prospect-Curriculum CPT方法

      Fig.  2.  Prospect-Curriculum CPT Method

      图  3  P0阶段CPT的训练损失曲线与验证损失曲线

      Fig.  3.  Training loss curve and validation loss curve for CPT during phase P0

      图  4  P1阶段CPT的训练损失曲线与验证损失曲线

      Fig.  4.  Training loss curve and validation loss curve for CPT during phase P1

      图  5  P2阶段CPT的训练损失曲线与验证损失曲线

      Fig.  5.  Training loss curve and validation loss curve for CPT during phase P2

      图  6  人工评估与对比

      a.基础地质维度;b.地质矿产资源勘探维度;c.矿床类型与分布维度;d.成矿规律维度

      Fig.  6.  Human evaluation and comparison

      图  7  模型问答示例

      Fig.  7.  Example of model-based question answering

      图  8  人工评估与对比

      a.基础地质维度;b.地质矿产资源勘探维度;c.矿床类型与分布维度;d.成矿规律维度

      Fig.  8.  Human evaluation and comparison

      图  9  模型问答示例

      Fig.  9.  Example of model-based question answering

      表  1  语料明细

      Table  1.   Corpus details

      语料数据明细 辞典语料 文献语料 报告语料
      文件数(个) 8 106 152
      token数(万) 1 314 72 1 135
      样本平均token数(个) 124.36 417.76 413.49
      样本数(个) 105 676 1 720 27 465
      下载: 导出CSV

      表  2  语料样本示例

      Table  2.   Example of corpus samples

      数据类别 样本示例
      辞典数据 黄铜矿,化学组成为CuFeS2.有同质三象变体.常见的是四方晶系变体温度稳定在550~213 ℃.成分中常有机械混入物,银、金、铊、硒、碲、锗、镓、铟等.黄铜矿中还常见闪锌矿、黝锡矿等的小包体.晶体少见,呈四方四面体,主要呈致密块状或粒状集合体.黄铜黄色.表面常有蓝、紫红的斑状锖色.条痕绿黑色.金属光泽.不透明.硬度3~4.性脆.相对密度4.1~4.3.黄铜矿分布很广,可在各种条件下形成.主要有超基性岩铜镍硫化物矿床,也有接触交代矽卡岩矿床, 与黄铁矿、磁铁矿、磁黄铁矿共生.
      文献数据 已有研究表明,在所有的流体类型中,岩浆流体对金的迁移能力最强,而单纯的岩浆结晶作用并不能形成金矿,这是因为岩浆中大部分Au将随岩浆结晶分异和去气作用而被萃取至流体相,而且这些过程会产生使成矿流体沿通道发生迁移的驱动力,这种驱动作用对于金矿床的形成至关重要.马头滩矿区内广泛发育岩浆岩,且相邻矿区内含矿围岩与矿石的Sr、Nd、pb、S、Si同位素具有继承相似性,说明岩浆作用为Au的富集和运移提供了前提(刘重芃等, 2014).
      报告数据 东天山成矿区包括有8个Ⅲ级构造‒成矿单元,面积约18×104 km2.区内己发现铜矿产地63处,其中有超大型资源远景的矿床1处(延东)、大型矿床1处(土屋)、中型矿床2处(黄山及黄山东)、小型矿床11处.己探明铜储量145.08×104 t,占新疆已知铜储量的29.30%;占天山己知铜储量的55.22%.
      下载: 导出CSV

      表  3  三阶段CPT通用超参数设置

      Table  3.   Three-stage CPT general hyperparameter settings

      超参数
      Compute type fp16
      Cutoff length 2 048
      Batch size 8
      Gradient accumulation 2
      LR schedule cosine
      Pack sequences true
      Enable thinking true
      下载: 导出CSV

      表  4  不同模型的时间性能

      Table  4.   Time performance of different models

      模型名 时间性能(s)
      DeepSeek-V3 15.86
      GPT-4.1 14.31
      Geo-MineLLM 5.39
      Qwen3-1.7B 7.54
      Qwen3-4B 22.95
      Geo-MineLLM + Hybrid RAG 8.22
      下载: 导出CSV
    • Bengio, Y., Louradour, J., Collobert, R., et al., 2009. Curriculum Learning. The 26th Annual International Conference on Machine Learning. Montreal. https://doi.org/10.1145/1553374.1553380
      Cheng, Q. M., 2025. A New Paradigm for Mineral Resource Prediction Based on Human Intelligence-Artificial Intelligence Integration. Earth Science Frontiers, 32(4): 1-19 (in Chinese with English abstract).
      Cormack, G. V., Clarke, C. L. A., Buettcher, S., 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. The 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Boston. https://doi.org/10.1145/1571941.1572114
      Deng, C., Zhang, T. H., He, Z. M., et al., 2024. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. The 17th ACM International Conference on Web Search and Data Mining. Merida. https://doi.org/10.1145/3616855.3635772
      Farquhar, S., Kossen, J., Kuhn, L., et al., 2024. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630(8017): 625-630. https://doi.org/10.1038/s41586-024-07421-0
      Fu, Y., Wang, M. G., Wang, C. B., et al., 2025. GeoMinLM: A Large Language Model in Geology and Mineral Survey in Yunnan Province. Ore Geology Reviews, 182: 106638. https://doi.org/10.1016/j.oregeorev.2025.106638
      Gupta, K., Thérien, B., Ibrahim, A., et al., 2023. Continual Pre-Training of Large Language Models: How to (Re) Warm Your Model? ICML2023, Hawaii. https://doi.org/10.48550/arXiv.2308.04014
      Gururangan, S., Marasović, A., Swayamdipta, S., et al., 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. The 58th Annual Meeting of the Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.740
      He, H., Ma, C., Ye, S., et al., 2024. Low Resource Chinese Geological Text Named Entity Recognition Based on Prompt Learning. Journal of Earth Science, 35(3): 1035-1043. https://doi.org/10.1007/s12583-023-1944-8
      Hou, X. Y., Zhao, Y. J., Liu, Y., et al., 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Transactions on Software Engineering and Methodology, 33(8): 1-79. https://doi.org/10.1145/3695988
      Howard, J., Ruder, S., 2018. Universal Language Model Fine-Tuning for Text Classification. The 56th Annual Meeting of the Association for Computational Linguistics. Melbourne. https://doi.org/10.18653/v1/p18-1031
      Jawahar, G., Sagot, B., Seddah, D., 2019. What Does BERT Learn about the Structure of Language? The 57th Annual Meeting of the Association for Computational Linguistics, Florence. https://doi.org/10.18653/v1/P19-1356
      Ji, Z. W., Lee, N., Frieske, R., et al., 2023. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12): 1-38. https://doi.org/10.1145/3571730
      Karpukhin, V., Oguz, B., Min, S., et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering. The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. https://doi.org/10.18653/v1/2020.emnlp-main.550
      Lachowycz, S., 2024. Utility of Artificial Intelligence in Geoscience. Nature Geoscience, 17(10): 953-955. https://doi.org/10.1038/s41561-024-01548-5
      Lewis, P., Perez, E., Piktus, A., et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv, 2005.11401. https://arxiv.org/abs/2005.11401
      Liu, C. P., Yang, H. M., Duan, R. C., et al., 2014. Metallogenic Age of the Matoutan Gold Deposit in East Tianshan and Its Geological Significance. Geological Bulletin of China, 33(6): 912-923 (in Chinese with English abstract).
      Qiu, Q. J., Tian, M., Xie, Z., et al., 2023a. Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach. Journal of Earth Science, 34(5): 1406-1417. https://doi.org/10.1007/s12583-022-1789-8
      Qiu, Q. J., Wang, B., Ma, K., et al., 2023b. A Practical Approach to Constructing a Geological Knowledge Graph: A Case Study of Mineral Exploration Data. Journal of Earth Science, 34(5): 1374-1389. https://doi.org/10.1007/s12583-023-1809-3
      Qiu, Q. J., Wu, L., Ma, K., et al., 2023. A Knowledge Graph Construction Method for Geohazard Chain for Disaster Emergency Response. Earth Science, 48(5): 1875-1891 (in Chinese with English abstract).
      Raffel, C., Shazeer, N., Roberts, A., et al., 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1-67.
      Robertson, S., Zaragoza, H., 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 3(4): 333-389. https://doi.org/10.1561/1500000019
      Shi, L. Y., Zuo, R. G., 2026. Foundation Model for Mineral Prospectivity Mapping. Earth Science, 53(3): 832-848(in Chinese with English abstract).
      Wu, G., Wang, H. T., Zhang, K. Y., et al., 2025. GeoProspect: A Domain-Specific Geological Large Language Model with Enhanced Continual Learning. Neurocomputing, 650: 130801. https://doi.org/10.1016/j.neucom.2025.130801
      Wu, S. J., Irsoy, O., Lu, S., et al., 2023. BloombergGPT: A Large Language Model for Finance. arXiv, 2303.17564. https://arxiv.org/abs/2303.17564
      Yang, X., Chen, A. K., PourNejatian, N., et al., 2022. A Large Language Model for Electronic Health Records. NPJ Digital Medicine, 5: 194. https://doi.org/10.1038/s41746-022-00742-2
      Zhang, B. Y., Tang, J. C., Zhang, T. Y., et al., 2026. Knowledge Graph and Question-Answering Model for Geological Prospecting Empowered by Large Language Models. Earth Science, 53(3): 982-995 (in Chinese with English abstract).
      Zhang, K. P., Ma, L., Cui, B. B., et al., 2024a. Visual Large Language Model for Wheat Disease Diagnosis in the Wild. Computers and Electronics in Agriculture, 227: 109587. https://doi.org/10.1016/j.compag.2024.109587
      Zhang, Y. F., Wei, C., He, Z. T., et al., 2024b. GeoGPT: An Assistant for Understanding and Processing Geospatial Tasks. International Journal of Applied Earth Observation and Geoinformation, 131: 103976. https://doi.org/10.1016/j.jag.2024.103976
      Zhou, B., Li, K., 2025. Fusing Geoscience Large Language Models and Lightweight RAG for Enhanced Geological Question Answering. Geosciences, 15(10): 382. https://doi.org/10.3390/geosciences15100382
      Zuo, R. G., Cheng, Q. M., Xu, Y., et al., 2024. Explainable Artificial Intelligence Models for Mineral Prospectivity Mapping. Scientia Sinica (Terrae), 54(9): 2917-2928 (in Chinese with English abstract). doi: 10.1360/N072024-0018
      成秋明, 2025. 面向人类智能与人工智能融合的矿产资源预测新范式. 地学前缘, 32(4): 1-19.
      刘重芃, 杨红梅, 段瑞春, 等, 2014. 东天山马头滩金矿的成矿时代及其地质意义. 地质通报, 33(6): 912-923.
      邱芹军, 吴亮, 马凯, 等, 2023. 面向灾害应急响应的地质灾害链知识图谱构建方法. 地球科学, 48(5): 1875-1891. doi: 10.3799/dqkx.2022.313
      师路易, 左仁广, 2026. 矿产预测大模型. 地球科学, 53(3): 832-848.
      张宝一, 唐嘉成, 张彤蕴, 等, 2026. 大语言模型赋能的地质找矿知识图谱与问答模型构建. 地球科学, 53(3): 982-995. doi: 10.3799/dqkx.2025.176
      左仁广, 成秋明, 许莹, 等, 2024. 可解释性矿产预测人工智能模型. 中国科学: 地球科学, 54(9): 2917-2928.
    • 加载中
    图(9) / 表(4)
    计量
    • 文章访问数:  792
    • HTML全文浏览量:  255
    • PDF下载量:  133
    • 被引次数: 0
    出版历程
    • 收稿日期:  2025-12-30
    • 刊出日期:  2026-03-25

    目录

      /

      返回文章
      返回