基于多源持续预训练与集成检索增强生成的矿产勘查大语言模型构建

张雨昂; 谢忠; 田苗; 吴麒瑞; 吴亮; 邱芹军; 陈建国

doi:10.3799/dqkx.2026.032

Volume 51 Issue 3

Mar. 2026

Turn off MathJax

Article Contents

Article Navigation > Earth Science > 2026 > 51(3): 1025-1039

Zhang Yuang, Xie Zhong, Tian Miao, Wu Qirui, Wu Liang, Qiu Qinjun, Chen Jianguo, 2026. A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation. Earth Science, 51(3): 1025-1039. doi: 10.3799/dqkx.2026.032

Citation:

Zhang Yuang, Xie Zhong, Tian Miao, Wu Qirui, Wu Liang, Qiu Qinjun, Chen Jianguo, 2026. A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation. Earth Science, 51(3): 1025-1039. doi: 10.3799/dqkx.2026.032

Citation:

Zhang Yuang, Xie Zhong, Tian Miao, Wu Qirui, Wu Liang, Qiu Qinjun, Chen Jianguo, 2026. A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation. Earth Science, 51(3): 1025-1039. doi: 10.3799/dqkx.2026.032

PDF( 6036 KB)

A Large Language Model for Mineral Exploration via Multi-Source Continual Pre-Training and Integrated Retrieval-Augmented Generation

doi: 10.3799/dqkx.2026.032

Zhang Yuang^{1
,
,},
Xie Zhong^{2, 3, 4},
Tian Miao³,
Wu Qirui⁴,
Wu Liang²,
Qiu Qinjun^{2
,
,
,},
Chen Jianguo^{3, 5}

1.
School of Geography and Information Engineering, China University of Geosciences (Wuhan), Wuhan 430078, China
2.
School of Computer Science, China University of Geosciences (Wuhan), Wuhan 430078, China
3.
Key Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences (Wuhan), Wuhan 430074, China
4.
School of Future Technology, China University of Geosciences (Wuhan), Wuhan 430078, China
5.
School of Earth Resources, China University of Geosciences (Wuhan), Wuhan 430074, China

Received Date: 2025-12-30
Publish Date: 2026-03-25

Abstract

Abstract

To address the challenges faced by general-purpose large language models in mineral exploration, including scarcity of domain corpora, insufficient coverage of domain terminology and register adaptation, and pronounced factual hallucinations. We constructed a mineral-exploration corpus of approximately 25 million tokens and, on this basis, proposed a curriculum-based continual pre-training strategy, which organizes training data into three stages: terminology, mechanisms, and cases. Coupled with gradual unfreezing of Transformer blocks and learning-rate scheduling, we conducted continual pre-training of Qwen3-1.7B to achieve stage-wise domain adaptation, resulting in a mineral-exploration-oriented LLM, Geo-MineLLM. During inference, we integrated a Hybrid RAG framework, leveraging hybrid retrieval and evidence-constrained generation to enhance factual consistency. Human evaluation indicates that Geo-MineLLM substantially improves domain question-answering performance relative to the base model and larger-parameter models within the same family. With Hybrid RAG enabled, overall domain QA performance approaches that of GPT-4.1. The proposed training-inference integrated framework provides a lightweight pathway for building mineral-exploration LLMs and enabling reliable domain-specific question answering.
- large language models,
- continual pre-training,
- retrieval-augmented generation,
- mineral exploration,
- artificial intelligence

FullText(HTML)

References(37)

References

Bengio, Y., Louradour, J., Collobert, R., et al., 2009. Curriculum Learning. The 26th Annual International Conference on Machine Learning. Montreal. https://doi.org/10.1145/1553374.1553380

Cheng, Q. M., 2025. A New Paradigm for Mineral Resource Prediction Based on Human Intelligence-Artificial Intelligence Integration. Earth Science Frontiers, 32(4): 1-19 (in Chinese with English abstract).

Cormack, G. V., Clarke, C. L. A., Buettcher, S., 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. The 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Boston. https://doi.org/10.1145/1571941.1572114

Deng, C., Zhang, T. H., He, Z. M., et al., 2024. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. The 17th ACM International Conference on Web Search and Data Mining. Merida. https://doi.org/10.1145/3616855.3635772

Farquhar, S., Kossen, J., Kuhn, L., et al., 2024. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630(8017): 625-630. https://doi.org/10.1038/s41586-024-07421-0

Fu, Y., Wang, M. G., Wang, C. B., et al., 2025. GeoMinLM: A Large Language Model in Geology and Mineral Survey in Yunnan Province. Ore Geology Reviews, 182: 106638. https://doi.org/10.1016/j.oregeorev.2025.106638

Gupta, K., Thérien, B., Ibrahim, A., et al., 2023. Continual Pre-Training of Large Language Models: How to (Re) Warm Your Model? ICML2023, Hawaii. https://doi.org/10.48550/arXiv.2308.04014

Gururangan, S., Marasović, A., Swayamdipta, S., et al., 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. The 58th Annual Meeting of the Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.740

He, H., Ma, C., Ye, S., et al., 2024. Low Resource Chinese Geological Text Named Entity Recognition Based on Prompt Learning. Journal of Earth Science, 35(3): 1035-1043. https://doi.org/10.1007/s12583-023-1944-8

Hou, X. Y., Zhao, Y. J., Liu, Y., et al., 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Transactions on Software Engineering and Methodology, 33(8): 1-79. https://doi.org/10.1145/3695988

Howard, J., Ruder, S., 2018. Universal Language Model Fine-Tuning for Text Classification. The 56th Annual Meeting of the Association for Computational Linguistics. Melbourne. https://doi.org/10.18653/v1/p18-1031

Jawahar, G., Sagot, B., Seddah, D., 2019. What Does BERT Learn about the Structure of Language? The 57th Annual Meeting of the Association for Computational Linguistics, Florence. https://doi.org/10.18653/v1/P19-1356

Ji, Z. W., Lee, N., Frieske, R., et al., 2023. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12): 1-38. https://doi.org/10.1145/3571730

Karpukhin, V., Oguz, B., Min, S., et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering. The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. https://doi.org/10.18653/v1/2020.emnlp-main.550

Lachowycz, S., 2024. Utility of Artificial Intelligence in Geoscience. Nature Geoscience, 17(10): 953-955. https://doi.org/10.1038/s41561-024-01548-5

Lewis, P., Perez, E., Piktus, A., et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv, 2005.11401. https://arxiv.org/abs/2005.11401

Liu, C. P., Yang, H. M., Duan, R. C., et al., 2014. Metallogenic Age of the Matoutan Gold Deposit in East Tianshan and Its Geological Significance. Geological Bulletin of China, 33(6): 912-923 (in Chinese with English abstract).

Qiu, Q. J., Tian, M., Xie, Z., et al., 2023a. Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach. Journal of Earth Science, 34(5): 1406-1417. https://doi.org/10.1007/s12583-022-1789-8

Qiu, Q. J., Wang, B., Ma, K., et al., 2023b. A Practical Approach to Constructing a Geological Knowledge Graph: A Case Study of Mineral Exploration Data. Journal of Earth Science, 34(5): 1374-1389. https://doi.org/10.1007/s12583-023-1809-3

Qiu, Q. J., Wu, L., Ma, K., et al., 2023. A Knowledge Graph Construction Method for Geohazard Chain for Disaster Emergency Response. Earth Science, 48(5): 1875-1891 (in Chinese with English abstract).

Raffel, C., Shazeer, N., Roberts, A., et al., 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1-67.

Robertson, S., Zaragoza, H., 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 3(4): 333-389. https://doi.org/10.1561/1500000019

Shi, L. Y., Zuo, R. G., 2026. Foundation Model for Mineral Prospectivity Mapping. Earth Science, 53(3): 832-848(in Chinese with English abstract).

Wu, G., Wang, H. T., Zhang, K. Y., et al., 2025. GeoProspect: A Domain-Specific Geological Large Language Model with Enhanced Continual Learning. Neurocomputing, 650: 130801. https://doi.org/10.1016/j.neucom.2025.130801

Wu, S. J., Irsoy, O., Lu, S., et al., 2023. BloombergGPT: A Large Language Model for Finance. arXiv, 2303.17564. https://arxiv.org/abs/2303.17564

Yang, X., Chen, A. K., PourNejatian, N., et al., 2022. A Large Language Model for Electronic Health Records. NPJ Digital Medicine, 5: 194. https://doi.org/10.1038/s41746-022-00742-2

Zhang, B. Y., Tang, J. C., Zhang, T. Y., et al., 2026. Knowledge Graph and Question-Answering Model for Geological Prospecting Empowered by Large Language Models. Earth Science, 53(3): 982-995 (in Chinese with English abstract).

Zhang, K. P., Ma, L., Cui, B. B., et al., 2024a. Visual Large Language Model for Wheat Disease Diagnosis in the Wild. Computers and Electronics in Agriculture, 227: 109587. https://doi.org/10.1016/j.compag.2024.109587

Zhang, Y. F., Wei, C., He, Z. T., et al., 2024b. GeoGPT: An Assistant for Understanding and Processing Geospatial Tasks. International Journal of Applied Earth Observation and Geoinformation, 131: 103976. https://doi.org/10.1016/j.jag.2024.103976

Zhou, B., Li, K., 2025. Fusing Geoscience Large Language Models and Lightweight RAG for Enhanced Geological Question Answering. Geosciences, 15(10): 382. https://doi.org/10.3390/geosciences15100382

Zuo, R. G., Cheng, Q. M., Xu, Y., et al., 2024. Explainable Artificial Intelligence Models for Mineral Prospectivity Mapping. Scientia Sinica (Terrae), 54(9): 2917-2928 (in Chinese with English abstract). doi: 10.1360/N072024-0018

成秋明, 2025. 面向人类智能与人工智能融合的矿产资源预测新范式. 地学前缘, 32(4): 1-19.

刘重芃, 杨红梅, 段瑞春, 等, 2014. 东天山马头滩金矿的成矿时代及其地质意义. 地质通报, 33(6): 912-923.

邱芹军, 吴亮, 马凯, 等, 2023. 面向灾害应急响应的地质灾害链知识图谱构建方法. 地球科学, 48(5): 1875-1891. doi: 10.3799/dqkx.2022.313

师路易, 左仁广, 2026. 矿产预测大模型. 地球科学, 53(3): 832-848.

张宝一, 唐嘉成, 张彤蕴, 等, 2026. 大语言模型赋能的地质找矿知识图谱与问答模型构建. 地球科学, 53(3): 982-995. doi: 10.3799/dqkx.2025.176

左仁广, 成秋明, 许莹, 等, 2024. 可解释性矿产预测人工智能模型. 中国科学: 地球科学, 54(9): 2917-2928.

Relative Articles

Supplements(0)

Cited By

Proportional views