军事科学院系统工程研究院;
生成式人工智能(AI-Generated Content,AIGC)关键技术突破推动多模态大语言模型(Multimodal Large Language Models,MLLMs)军事垂直领域应用过程中存在评估体系评估指标不够健全的问题,为解决此问题,采用自顶向下正向设计与自底向上聚合评估相结合的方法,构建包含智能化军事需求—智能化场景任务—系统性能评估—体系效能评估的“四域”,与基础支撑服务—算法指标体系—综合安全防护的“三维”军事大模型评估体系框架,提出评估大模型的主要维度、关键指标和基本流程,并定性定量相结合给出相应评估指标体系,为军事大模型赋能装备体系和作战效能提供评估支撑。
344 | 0 | 208 |
下载次数 | 被引频次 | 阅读次数 |
[1]许志伟,李海龙,李博,等. AIGC大模型测评综述:使能技术,安全隐患和应对[J].计算机科学与探索,2024,18(9):2293-2325.
[2]蔡磊,孟宪波,韩冬梅,等.大模型在军事垂直领域的应用[J].舰船科学技术,2024,46(5):171-175.
[3]赵睿卓,曲紫畅,陈国英,等.大语言模型评估技术研究进展[J].数据采集与处理,2024(3):502-523.
[4]赵月,何锦雯,朱申辰,等.大语言模型安全现状与挑战[J].计算机科学,2024,51(1):68-71.
[5] Jhong K Y. Evaluating artificial intelligence for operations in the information environment[D]. Monterey,CA:Naval Postgraduate School,2023.
[6] Li B,Fang G,Yang Y,et al. Evaluating ChatGPT’s information extraction capabilities:An assessment of performance,explainability,calibration,and faithfulness[DB/OL]. 2024-07-17. https://arxiv.org/abs/2304.11633v1.
[7] Yu H,Liu J,Zhang X,et al. A survey on evaluation of out-of-distribution generalization[DB/OL]. 2024-07-29.http://arxiv.org/abs/2403.01874.
[8] Burns G R,Collier R T,Cornish R J,et al. Evaluating artificial intelligence methods for use in kill chain functionS[R]. Monterey, CA:Naval Postgraduate School,2021.
[9] Long L,Wang R,Xiao R,et al. On llms-driven synthetic data generation, curation, and evaluation:A survey[DB/OL]. 2024-07-29. http://arxiv. org/abs/2406. 15126.
[10] Tian J,Li Y,Chen W,et al. Diagnosing the first-order logical reasoning ability through logicnLI[C]. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,2021:3738-3747.
[11] Hendrycks D,Burns C,Kadavath S,et,al. Measuring mathematical problem solving with the math dataset[DB/OL]. 2024-09-21. https://arxiv. org/abs/2103.03874v2.
[12] Mitchell T,Sheffield N,Richardson D,et,al. An analytic framework for assessing artificial intelligence and assistive automation enabled command and control decision aids for mission effectiveness[J]. Industrial and Systems Engineering Review,2023,11(1-2):1-8.
[13] Freeman L,Kauffman J,Sobien D,et al. Best practices for addressing new challenges in testing and evaluating artificial intelligence enabled systems[J]. AIRC Perspectives,2022:11.
[14] Fan Z, Ghaddar B, Wang X, et al. Artificial intelligence for operations research:Revolutionizing the operations research process[DB/OL]. 2024-10-11. http://arxiv. org/abs/2401. 03244.
[15]孙毅,裘杭萍,郑雨,等.自然语言预训练模型知识增强方法综述[J].中文信息学报,2021,35(7):10-29.
[16] Liu P,Yuan W,Fu J,et al. Pre-train,prompt,and predict:A systematic survey of prompting methods in natural language processing[DB/OL]. 2024-07-05.http://arxiv. org/abs/2107.13586.
[17]刘文炎,沈楚云,王祥丰,等.可信机器学习的公平性综述[J].软件学报,2021,32(5):1404-1426..
[18] XAI—Explainable artificial intelligence|Science Robotics[EB/OL]. 2024-07-18. https://www. science. org/doi/10. 1126/scirobotics. aay7120.
[19] La Malfa E. On robustness for natural language processing[D]. Oxford:University of Oxford,2023.
[20] Reed A R. Uncertainty quantification:Artificial intelligence and machine learning in military systems[J]. Air&Space Operations Review,2023,2(1).
[21] Pfaff C A,Lowrance C J,Washburn B M,et al. Trusting AI:Integrating artificial intelligence into the army’s professional expert knowledge[M]. USAWC Press,2023.
[22] Schulman J, Wolski F, Dhariwal P, et al. Proximal Policy Optimization Algorithms[DB/OL]. 2024-07-08.http://arxiv. org/abs/1707. 06347.
[23] Chen D, Chen R, Zhang S, et, al. MLLM-as-aJudge:Assessing multimodal llm-as-a-judge with vision-language benchmark[DB/OL]. 2024-07-19. https://arxiv. org/abs/2402. 04788v3.
[24] Zhao W X,Zhou K,Li J,et,al. A survey of large language models[DB/OL]. 2024-07-05. http://arxiv. org/abs/2303. 18223.
[25]王立盟. 2023年国外军事人工智能领域科技发展综述[J].战术导弹技术,2024(2):17-26.
[26] Li Z,Xu X,Shen T,et al. Leveraging large language models for nlg evaluation:Advances and challenges[DB/OL]. 2024-07-02.
[27]王亚珅,陈浩,葛悦涛,等. 2023年人工智能领域科技发展综述[J].战术导弹技术,2024(1):20-32+67.
[28] Dong Y,Mu R,Zhang Y,et,al. Safeguarding large language models:A survey[DB/OL]. 2024-07-02.http://arxiv. org/abs/2406. 02622.
[29]张子春,刘增良,余达太.一种大数据条件下军事信息服务安全评估模型[J].信息安全与通信保密,2014(6):90-94+99.
[30] Shadowcast:针对视觉语言模型的隐蔽数据中毒攻击[DB/OL]. 2024-07-16. https://arxiv.org/abs/2402.06659.
[31]龙育诚.纵向联邦学习对抗攻击和鲁棒性研究[D].广州:广州大学,2024.
[32]党亚娟. ChatGPT潜在军事应用及风险分析[J].国防科技工业,2023(3):54-56.
[33] Khoshnoodi M,Jain V,Gao M,et al. A comprehensive survey of accelerated generation techniques in large language models[DB/OL]. 2024-07-02. http://arxiv.org/abs/2405.13019.
基本信息:
DOI:10.16358/j.issn.1009-1300.20240112
中图分类号:E91
引用信息:
[1]张龙,王数,雷震等.AIGC军事大模型评估体系框架研究[J].战术导弹技术,2025,No.229(01):42-52.DOI:10.16358/j.issn.1009-1300.20240112.
基金信息: