llama.cpp

GGUF
扩展名	.gguf
开发者	Georgi Gerganov與社群
首次发布	2023年8月22日，22個月前
最新版本	v3
格式类型	机器学习張量

llama.cpp
原作者	Georgi Gerganov
開發者	Georgi Gerganov與社群
首次发布	2023年3月10日，2年前
源代码库	github.com/ggml-org/llama.cpp
编程语言	C++、C
类型	大型语言模型函式庫
许可协议	MIT許可證

llama.cpp是用來在多種大型语言模型（例如LLaMA）上執行推理的开放源代码函式庫。^[3]此函式庫中也包含了命令列工具^[4]以及介面簡易的网络应用程序伺服器。^[5]^[6]

背景

2022年9月底，Georgi Gerganov開始開發GGML函式庫，這是實作张量代数的C語言函式庫。Gerganov開發GGML函式庫的目的是實現嚴格的記憶體管理與多執行緒。GGML的建立則是受到法布里斯·贝拉開發LibNC的啟發。^[7]

在開發llama.cpp之前，Gerganov曾經開發過類似的函式庫，為使用OpenAI語音轉文字模型Whisper（英语：Whisper (speech recognition system)）的whisper.cpp。^[8]

開發

Georgi Gerganov從2023年3月開始開發llama.cpp，llama.cpp是LLaMA推理程式碼的無外部依賴關係純C/C++實作。llama.cpp改善了在沒有圖形處理器或其他專用硬體的電腦上的效能，這也是此專案的其中一個目標。^[3]^[9]^[10]因為可以僅在中央处理器上執行（甚至可以在Android上運作），llama.cpp得到了缺乏專用硬體的使用者青睞。^[9]^[11]雖然一開始是為CPU設計的，但後來還是新增了GPU推理支援。^[12]

2024年3月，Justine Tunney為x86與ARM CPU引入新的最佳化矩陣乘法核心至此專案，改善了FP16與8位元量化資料類型的提示詞評估效能。^[13]Tunney也製作了llamafile這套工具，這套工具把模型與llama.cpp整合到單一個檔案中，並可透過同樣由Tunney開發的Cosmopolitan Libc函式庫在多個作業系統上執行。^[13]

架構

llama.cpp支援多種硬體目標，包含x86、ARM、CUDA、Metal、Vulkan（1.2或更新版本）與SYCL。^[14]^[15]^[16]^[17]這些後端構成了GGML張量函式庫，並供llama.cpp中不同模型的程式碼使用。^[18]llama.cpp支援提前而非即時量化模型。^[19]llama.cpp也使用了多種CPU擴充指令集最佳化效能：x86-64的AVX、AVX2與AVX-512，以及ARM上的Neon。Apple晶片也是此專案的重要目標。^[20]^[21]

GGUF檔案格式

GGUF（GGML通用檔案）^[24]檔案格式是二進位格式，將張量與元数据儲存在同一個檔案中，用以快速儲存與載入模型資料。^[25]此檔案格式是llama.cpp專案於2023年8月開始使用，在新增對其他模型架構的支援時也維持向後相容性。^[12]^[26]

GGUF檔案通常是透過轉換以PyTorch等其他機器學習函式庫開發的模型所建立的。^[25]

設計

此格式著重於量化，亦即降低模型權重的精確度。如此可以降低記憶體使用量，提昇速度，缺點是會降低模型精度。^[27]^[26]

GGUF支援2位元至8位元的量化整數類型^[28]，以及常見的浮點資料格式（如float32、float16與bfloat16）與1.56位元量化。^[4]

參考資料

^ Initial release · ggerganov/llama.cpp@26c0846. GitHub. [2025-07-12] （英语）.
^ llama.cpp/LICENSE at master · ggerganov/llama.cpp. GitHub （英语）.
^ ^3.0 ^3.1 Connatser, Matthew. How this open source LLM chatbot runner hit the gas on x86, Arm CPUs. theregister.com. [2025-07-12]. （原始内容存档于2024-05-10）.
^ ^4.0 ^4.1 Mann, Tobias. Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it. theregister. 2024-07-14 [2025-07-12]. （原始内容存档于2025-07-06）.
^ Alden, Daroc. Portable LLMs with llamafile [LWN.net]. lwn.net. [2024-07-30]. （原始内容存档于2025-03-06）.
^ Mann, Tobias. Intro to speculative decoding: Cheat codes for faster LLMs. theregister. 2024-12-15 （英语）.
^ Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532). Changelog. 2023-03-22 [2025-07-12]. （原始内容存档于2025-07-08）（英语）.
^ ggerganov/whisper.cpp. GitHub. [2025-07-12]. （原始内容存档于2025-04-03）.
^ ^9.0 ^9.1 Edwards, Benj. You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi. arstechnica.com. 2023-03-13 [2025-07-12]. （原始内容存档于2024-01-09）.
^ Wiest, Isabella Catharina; Ferber, Dyke; Zhu, Jiefu; van Treeck, Marko; Meyer, Meyer, Sonja K.; Juglan, Radhika; Carrero, Zunamys I.; Paech, Daniel; Kleesiek, Jens; Ebert, Matthias P.; Truhn, Daniel; Kather, Jakob Nikolas. Privacy-preserving large language models for structured medical information retrieval. npj Digital Medicine. 2024, 7 (257): 257. PMC 11415382 . PMID 39304709. doi:10.1038/s41746-024-01233-2.
^ Democratizing AI with open-source language models. lwn.net. [2025-07-12]. （原始内容存档于2024-07-28）.
^ ^12.0 ^12.1 Rajput, Saurabhsingh; Sharma, Tushar. Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency. 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C). 2024-06-04: 238–242. ISBN 979-8-3503-6625-9. doi:10.1109/ICSA-C63560.2024.00049.
^ ^13.0 ^13.1 Connatser, Matthew. Llamafile LLM driver project boosts performance on CPU cores. www.theregister.com. [2024-05-10]. （原始内容存档于2024-05-10）（英语）.
^ Gerganov, Georgi; Nguyen, Xuan Son; Slaren. Introduction to ggml. Huggingface. 2024-08-13 [2025-07-12]. （原始内容存档于2025-06-03）.
^ Kluska, Piotr; Castell´o, Adri´an; Scheidegger, Florian; I. Malossi, A. Cristiano; Quintana-Ort´ı, Enrique. QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers (PDF). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. June 2024 [2025-07-12]. （原始内容存档 (PDF)于2024-11-10）.
^ Jianyu, Zhang; Hengyu, Meng; Ying, Hu; Yu, Luo; Xiaoping, Duan; Corporation, Majumder Abhilash Intel. Run LLMs on Intel GPUs Using llama.cpp. The Parallel Universe. No. 57 (Intel). 2024-07: 34–37 （英语）.
^ Bolz, Jeff. Machine Learning in Vulkan with Cooperative Matrix 2 (PDF). Cambridge, UK: The Khronos Group/Nvidia. 2025-02-11 [2025-07-12]. （原始内容存档 (PDF)于2025-04-17）（英语）.
^ Pounder, Les. How To Create Your Own AI Chatbot Server With Raspberry Pi 4. tomshardware.com. 2023-03-25 [2025-07-12]. （原始内容存档于2023-08-15）.
^ Walkowiak, Bartosz; Walkowiak, Tomasz. Implementation of language models within an infrastructure designed for Natural Language Processing (PDF). International Journal of Electronics and Telecommunications. 2024, 70 (1): 153–159 [2025-07-12]. doi:10.24425/ijet.2024.149525.
^ ggerganov/llama.cpp. GitHub.
^ Larabel, Michael. Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4. www.phoronix.com. 2024-03-31 [2025-07-12]. （原始内容存档于2025-03-13）（英语）.
^ GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp. GitHub （英语）.
^ ggml/docs/gguf.md at master · ggerganov/ggml. GitHub. [2025-07-12]. （原始内容存档于2025-01-31）（英语）.
^ ggerganov/llama.cpp/gguf-py/README.md. GitHub. [2025-07-12].
^ ^25.0 ^25.1 GGUF. huggingface.co. [2025-07-12].
^ ^26.0 ^26.1 Mucci, Tim. GGUF versus GGML. www.ibm.com. 2024-07-03 [2025-07-12]. （原始内容存档于2025-06-04）（美国英语）.
^ Labonne, Maxime. Quantize Llama models with GGUF and llama.cpp. Medium. Towards Data Science. 2023-11-29 [2024-05-09]. （原始内容存档于2024-05-09）（英语）.
^ Cabezas, Darío; Fonseca-Delgado, Rigoberto; Reyes-Chacón, Iván; Vizcaino-Imacaña, Paulina; Morocho-Cayamcela, Manuel. Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students. Proceedings of the 19th International Conference on Software Technologies. 2024: 395–402. ISBN 978-989-758-706-1. doi:10.5220/0012763000003753.

[githubrelease-1] Initial release · ggerganov/llama.cpp@26c0846. GitHub. [2025-07-12] （英语）.

[license-2] .cpp/LICENSE at master · ggerganov/llama.cpp. GitHub （英语）.

[register-llamafile-3] 3.0 ^3.1 Connatser, Matthew. How this open source LLM chatbot runner hit the gas on x86, Arm CPUs. theregister.com. [2025-07-12]. （原始内容存档于2024-05-10）.

[theregister_14_Jul_2024-4] 4.0 ^4.1 Mann, Tobias. Honey, I shrunk the LLM! A beginner's guide to quantization – and testing it. theregister. 2024-07-14 [2025-07-12]. （原始内容存档于2025-07-06）.

[lwn-5] Alden, Daroc. Portable LLMs with llamafile [LWN.net]. lwn.net. [2024-07-30]. （原始内容存档于2025-03-06）.

[theregister_15_December_2024-6] Mann, Tobias. Intro to speculative decoding: Cheat codes for faster LLMs. theregister. 2024-12-15 （英语）.

[changelog-podcast-mar-2023-7] Bringing Whisper and LLaMA to the masses with Georgi Gerganov (Changelog Interviews #532). Changelog. 2023-03-22 [2025-07-12]. （原始内容存档于2025-07-08）（英语）.

[whisper-8] rganov/whisper.cpp. GitHub. [2025-07-12]. （原始内容存档于2025-04-03）.

[arstechnica-9] 9.0 ^9.1 Edwards, Benj. You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi. arstechnica.com. 2023-03-13 [2025-07-12]. （原始内容存档于2024-01-09）.

[Wiest-10] Wiest, Isabella Catharina; Ferber, Dyke; Zhu, Jiefu; van Treeck, Marko; Meyer, Meyer, Sonja K.; Juglan, Radhika; Carrero, Zunamys I.; Paech, Daniel; Kleesiek, Jens; Ebert, Matthias P.; Truhn, Daniel; Kather, Jakob Nikolas. Privacy-preserving large language models for structured medical information retrieval. npj Digital Medicine. 2024, 7 (257): 257. PMC 11415382 . PMID 39304709. doi:10.1038/s41746-024-01233-2.

[11] Democratizing AI with open-source language models. lwn.net. [2025-07-12]. （原始内容存档于2024-07-28）.

[Rajput-12] 12.0 ^12.1 Rajput, Saurabhsingh; Sharma, Tushar. Benchmarking Emerging Deep Learning Quantization Methods for Energy Efficiency. 2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C). 2024-06-04: 238–242. ISBN 979-8-3503-6625-9. doi:10.1109/ICSA-C63560.2024.00049.

[llamafileregister-13] 13.0 ^13.1 Connatser, Matthew. Llamafile LLM driver project boosts performance on CPU cores. www.theregister.com. [2024-05-10]. （原始内容存档于2024-05-10）（英语）.

[Gerganov_Slaren_Nguyen_Introduction_to_ggml-14] Gerganov, Georgi; Nguyen, Xuan Son; Slaren. Introduction to ggml. Huggingface. 2024-08-13 [2025-07-12]. （原始内容存档于2025-06-03）.

[Kluska-15] Kluska, Piotr; Castell´o, Adri´an; Scheidegger, Florian; I. Malossi, A. Cristiano; Quintana-Ort´ı, Enrique. QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers (PDF). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. June 2024 [2025-07-12]. （原始内容存档 (PDF)于2024-11-10）.

[Run_LLMs_on_Intel_GPUs_Using_llama.cpp-16] Jianyu, Zhang; Hengyu, Meng; Ying, Hu; Yu, Luo; Xiaoping, Duan; Corporation, Majumder Abhilash Intel. Run LLMs on Intel GPUs Using llama.cpp. The Parallel Universe. No. 57 (Intel). 2024-07: 34–37 （英语）.

[Bolz-17] Bolz, Jeff. Machine Learning in Vulkan with Cooperative Matrix 2 (PDF). Cambridge, UK: The Khronos Group/Nvidia. 2025-02-11 [2025-07-12]. （原始内容存档 (PDF)于2025-04-17）（英语）.

[tomshardware-18] Pounder, Les. How To Create Your Own AI Chatbot Server With Raspberry Pi 4. tomshardware.com. 2023-03-25 [2025-07-12]. （原始内容存档于2023-08-15）.

[Walkowiak-19] Walkowiak, Bartosz; Walkowiak, Tomasz. Implementation of language models within an infrastructure designed for Natural Language Processing (PDF). International Journal of Electronics and Telecommunications. 2024, 70 (1): 153–159 [2025-07-12]. doi:10.24425/ijet.2024.149525.

[llama.cpprepo-20] rganov/llama.cpp. GitHub.

[phoronix-llamafile-21] Larabel, Michael. Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4. www.phoronix.com. 2024-03-31 [2025-07-12]. （原始内容存档于2025-03-13）（英语）.

[githubgguf-22] GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp. GitHub （英语）.

[ggufdoc-23] /docs/gguf.md at master · ggerganov/ggml. GitHub. [2025-07-12]. （原始内容存档于2025-01-31）（英语）.

[gguf-py-24] rganov/llama.cpp/gguf-py/README.md. GitHub. [2025-07-12].

[huggingface-25] 25.0 ^25.1 GGUF. huggingface.co. [2025-07-12].

[ibm-gguf-vs-ggml-26] 26.0 ^26.1 Mucci, Tim. GGUF versus GGML. www.ibm.com. 2024-07-03 [2025-07-12]. （原始内容存档于2025-06-04）（美国英语）.

[towardsdatascience-27] Labonne, Maxime. Quantize Llama models with GGUF and llama.cpp. Medium. Towards Data Science. 2023-11-29 [2024-05-09]. （原始内容存档于2024-05-09）（英语）.

[Cabezas-28] Cabezas, Darío; Fonseca-Delgado, Rigoberto; Reyes-Chacón, Iván; Vizcaino-Imacaña, Paulina; Morocho-Cayamcela, Manuel. Integrating a LLaMa-based Chatbot with Augmented Retrieval Generation as a Complementary Educational Tool for High School and College Students. Proceedings of the 19th International Conference on Software Technologies. 2024: 395–402. ISBN 978-989-758-706-1. doi:10.5220/0012763000003753.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]