Important
bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
💫 Intel® LLM library for PyTorch*#
IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].
Note
-
It is built on top of the excellent work of
llama.cpp,transfromers,bitsandbytes,vLLM,qlora,AutoGPTQ,AutoAWQ, etc. - It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
-
50+ models have been optimized/verified on
ipex-llm(including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
Latest update 🔥#
[2024/05]
ipex-llmnow supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.[2024/04] You can now run Open WebUI on Intel GPU using
ipex-llm; see the quickstart here.[2024/04] You can now run Llama 3 on Intel GPU using
llama.cppandollama; see the quickstart here.[2024/04]
ipex-llmnow supports Llama 3 on Intel GPU and CPU.[2024/04]
ipex-llmnow provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.[2024/03]
bigdl-llmhas now becomeipex-llm(see the migration guide here); you may find the originalBigDLproject here.[2024/02]
ipex-llmnow supports directly loading model from ModelScope (魔搭).[2024/02]
ipex-llmadded inital INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.[2024/02] Users can now use
ipex-llmthrough Text-Generation-WebUI GUI.[2024/02]
ipex-llmnow supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.[2024/02]
ipex-llmnow supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).[2024/01] Using
ipex-llmQLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
More updates
[2023/12]
ipex-llmnow supports ReLoRA (see “ReLoRA: High-Rank Training Through Low-Rank Updates”).[2023/12]
ipex-llmnow supports Mixtral-8x7B on both Intel GPU and CPU.[2023/12]
ipex-llmnow supports QA-LoRA (see “QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models”).[2023/12]
ipex-llmnow supports FP8 and FP4 inference on Intel GPU.[2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models in to
ipex-llmis available.[2023/11]
ipex-llmnow supports vLLM continuous batching on both Intel GPU and CPU.[2023/10]
ipex-llmnow supports QLoRA finetuning on both Intel GPU and CPU.[2023/10]
ipex-llmnow supports FastChat serving on on both Intel CPU and GPU.[2023/09]
ipex-llmnow supports Intel GPU (including iGPU, Arc, Flex and MAX).[2023/09]
ipex-llmtutorial is released.
ipex-llm Performance#
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below [1] (and refer to [2][3][4] for more details).
|
|
You may follow the guide to run ipex-llm performance benchmark yourself.
ipex-llm Demos#
See demos of running local LLMs on Intel Iris iGPU, Intel Core Ultra iGPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.
| Intel Iris iGPU | Intel Core Ultra iGPU | Intel Arc dGPU | 2-Card Intel Arc dGPUs |
|
|
|
|
llama.cpp(Phi-3-mini Q4_0)
|
Ollama(Mistral-7B Q4_K)
|
TextGeneration-WebUI(Llama3-8B FP8)
|
FastChat(QWen1.5-32B FP6)
|
ipex-llm Quickstart#
Docker#
GPU Inference in C++: running
llama.cpp,ollama,OpenWebUI, etc., withipex-llmon Intel GPUGPU Inference in Python: running HuggingFace
transformers,LangChain,LlamaIndex,ModelScope, etc. withipex-llmon Intel GPUvLLM on GPU: running
vLLMserving withipex-llmon Intel GPUFastChat on GPU: running
FastChatserving withipex-llmon Intel GPU
Run#
llama.cpp: running llama.cpp (using C++ interface of
ipex-llmas an accelerated backend forllama.cpp) on Intel GPUollama: running ollama (using C++ interface of
ipex-llmas an accelerated backend forollama) on Intel GPUFastChat: running
ipex-llminFastChatserving on on both Intel GPU and CPULangChain-Chatchat RAG: running
ipex-llminLangChain-Chatchat(Knowledge Base QA using RAG pipeline)Text-Generation-WebUI: running
ipex-llminoobaboogaWebUIBenchmarking: running (latency and throughput) benchmarks for
ipex-llmon Intel CPU and GPU
Install#
Windows GPU: installing
ipex-llmon Windows with Intel GPULinux GPU: installing
ipex-llmon Linux with Intel GPU
See also
For more details, please refer to the installation guide
Code Examples#
Low bit inference
INT4 inference: INT4 LLM inference on Intel GPU and CPU
FP8/FP4 inference: FP8 and FP4 LLM inference on Intel GPU
INT8 inference: INT8 LLM inference on Intel GPU and CPU
INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
FP16/BF16 inference
FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
Save and load
Low-bit models: saving and loading
ipex-llmlow-bit modelsGGUF: directly loading GGUF models into
ipex-llmAWQ: directly loading AWQ models into
ipex-llmGPTQ: directly loading GPTQ models into
ipex-llm
Finetuning
Integration with community libraries
See also
For more details, please refer to the ipex-llm document.
Verified Models#
| Model | CPU Example | GPU Example |
|---|---|---|
| LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) | link1, link2 | link link |
| LLaMA 2 | link1, link2 | link link |
| LLaMA 3 | link | link |
| ChatGLM | link | |
| ChatGLM2 | link | link |
| ChatGLM3 | link | link |
| GLM-4 | link | link |
| GLM-4V | link | link |
| Mistral | link | link |
| Mixtral | link | link |
| Falcon | link | link |
| MPT | link | link |
| Dolly-v1 | link | link |
| Dolly-v2 | link | link |
| Replit Code | link | link |
| RedPajama | link1, link2 | |
| Phoenix | link1, link2 | |
| StarCoder | link1, link2 | link |
| Baichuan | link | link |
| Baichuan2 | link | link |
| InternLM | link | link |
| Qwen | link | link |
| Qwen1.5 | link | link |
| Qwen2 | link | link |
| Qwen-VL | link | link |
| Aquila | link | link |
| Aquila2 | link | link |
| MOSS | link | |
| Whisper | link | link |
| Phi-1_5 | link | link |
| Flan-t5 | link | link |
| LLaVA | link | link |
| CodeLlama | link | link |
| Skywork | link | |
| InternLM-XComposer | link | |
| WizardCoder-Python | link | |
| CodeShell | link | |
| Fuyu | link | |
| Distil-Whisper | link | link |
| Yi | link | link |
| BlueLM | link | link |
| Mamba | link | link |
| SOLAR | link | link |
| Phixtral | link | link |
| InternLM2 | link | link |
| RWKV4 | link | |
| RWKV5 | link | |
| Bark | link | link |
| SpeechT5 | link | |
| DeepSeek-MoE | link | |
| Ziya-Coding-34B-v1.0 | link | |
| Phi-2 | link | link |
| Phi-3 | link | link |
| Phi-3-vision | link | link |
| Yuan2 | link | link |
| Gemma | link | link |
| DeciLM-7B | link | link |
| Deepseek | link | link |
| StableLM | link | link |
| CodeGemma | link | link |
| Command-R/cohere | link | link |
| CodeGeeX2 | link | link |
| MiniCPM | link | link |
Get Support#
Please report a bug or raise a feature request by opening a Github Issue
Please report a vulnerability by opening a draft GitHub Security Advisory
[1]
Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.