LLM stands for Large Language Model and is a large-scale AI model that has been trained with an extensive amount of text and code. Beside the well known and widespread Chat GPT, today there are many powerful Open Source alternatives available. The advantage of an Open Source LLM is that you can use such a model in your own application within your own environment. There is no dependency on an external service provider that can raise prices, shut down services, or remove models.
But the question that inevitably arises is: Where to start? At least that’s the question I asked myself. After some research I found out that it isn’t such difficulty as it sounds to run a local LLM.
First of all there is a place called Hugging Face providing a kind of market place for all kinds of AI models. After you have registers yourself on the page you can search and download all kinds of different Models. Of course each model is different and addresses different needs and requirements. But the good news is that there is a kind of common open standard to run a LLM called LLaMA CCP. Lamma CCP allows you to run a LLM with minimal setup and state-of-the-art performance on a wide variety of hardware – locally and in the cloud. And of course there is also a Python binding available. And this makes is easy to test a LLM in a Docker container.
Download your Open Source LLM
Before you can start building your own Docker Container you need to download a LLM from the Hugging Face website. You need an activated account (which may take a view hours until your account is verified ) and than you can search for your preferred model.
The models I tested so far are :
- Llama 2 Chat from Meta
- Mistral-7B Instruct from Mistral AI
Both models are Open Source and available in different size (quality) and formats. What we need to run the model with LLaMA CCP is the ‘.gguf’ format.
- https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
- https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
Build Your Dockerfile
After you have downloaded your model you can build your own Docker Image with a Dockerfile
like this:
FROM python
RUN pip install llama-cpp-python
COPY ./app /app
COPY ./models/*.gguf /app
WORKDIR /app
CMD ["python", "test.py"]
This Docker image imports your scripts form the local app/ directory and the Model files from the local model/ directory. Finally it starts a python test script:
from llama_cpp import Llama
# Put the location of to the GGUF model that you've download from HuggingFace here
model_path = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
# Create a llama model
model = Llama(model_path=model_path)
# Prompt creation
system_message = "You are a helpful software developer"
user_message = "What do you know about BPMN 2.0 and Imixs-Workflow?"
prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""
# Model parameters
max_tokens = 500
# Run the model
output = model(prompt, max_tokens=max_tokens, echo=True)
# Print the model output
print(output)
This script is very simple but useful for a first test. It loads your model and creates the prompt. You can customize all the parameters to test the capabilities of a model.
Now you can build your Docker image with:
$ docker build . -t my-llm
… and run it with
$ docker run -it --rm my-llm
The output will look something like this:
....
llama_new_context_with_model: CPU output buffer size = 62.50 MiB
llama_new_context_with_model: CPU compute buffer size = 73.00 MiB
llama_new_context_with_model: graph nodes = 1060
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
Guessed chat format: mistral-instruct
llama_print_timings: load time = 3687.93 ms
llama_print_timings: sample time = 44.64 ms / 213 runs ( 0.21 ms per token, 4771.40 tokens per second)
llama_print_timings: prompt eval time = 3687.85 ms / 45 tokens ( 81.95 ms per token, 12.20 tokens per second)
llama_print_timings: eval time = 55366.43 ms / 212 runs ( 261.16 ms per token, 3.83 tokens per second)
llama_print_timings: total time = 59751.24 ms / 257 tokens
{'id': 'cmpl-432370a3-607d-46f2-8f42-1482b9aa8496', 'object': 'text_completion', 'created': 1710880430, 'model': 'mistral-7b-instruct-v0.2.Q4_K_M.gguf', 'choices': [{'text': '<s>[INST] <<SYS>>\nYou are a helpful software developer\n<</SYS>>\nWhat do you know about BPMN 2.0 and Imixs-Workflow? [/INST] BPMN 2.0, or Business Process Model and Notation 2.0, is an open standard process modeling language used to represent business processes graphically and execute them in a workflow engine. It provides a graphical notation for modeling the flow of work between people (users), systems, and organizations. It covers all aspects of process modeling, including the definition of the process flow, the definition of the data that is being processed, and the definition of the rules and conditions that determine the flow of the process.\n\nImixs-Workflow is an open source workflow engine written in Java that enables the execution of BPMN 2.0 processes. It provides features such as human workflow tasks, automatic process execution based on events, advanced form handling, and integration with external systems. It is often used in enterprise applications to manage complex business processes, especially in industries such as finance, healthcare, and manufacturing. Imixs-Workflow also provides an easy way to integrate with other Java applications and databases.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 45, 'completion_tokens': 212, 'total_tokens': 257}
I hope this short tutorial will help you to get started with your own Open Source LLM.
I suggest taking a look at Ollama. It’s a wrapper around Llama CPP. It also offers a reimplementation of the OpenAI API and it’s generally very user friendly.