+ huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --local-dir models/ --local-dir-use-symlinks False

huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --local-dir models/ --local-dir-use-symlinks False

Build the mixtral branch of llama.cpp

(base) ljubomir@gigul2(3935572.python:0):~/llama.cpp$ git fetch origin mixtral
remote: Enumerating objects: 162, done.
remote: Counting objects: 100% (162/162), done.
remote: Compressing objects: 100% (62/62), done.
remote: Total 162 (delta 108), reused 149 (delta 100), pack-reused 0
Receiving objects: 100% (162/162), 322.80 KiB | 975.00 KiB/s, done.
Resolving deltas: 100% (108/108), done.
From https://github.com/ggerganov/llama.cpp
* branch mixtral -> FETCH_HEAD
* [new branch] mixtral -> origin/mixtral

(base) ljubomir@gigul2(3935572.python:0):~/llama.cpp$ git checkout mixtral
Branch 'mixtral' set up to track remote branch 'mixtral' from 'origin'.
Switched to a new branch 'mixtral'

(base) ljubomir@gigul2(3935572.python:0):~/llama.cpp$ make

Run mixtral branch and model:

./server -t 8 -m models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf -c 2048 &

Open browser at

http://127.0.0.1:8080

In the text box put

User: Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant

Llama: Subject: Apology and Updates on Company Transactions - Due to my sincere apologies for the delay in updating our company transactions in QuikBooks for the quarterly VAT return. I understand how crucial it is to have accurate records and meet deadlines, which unfortunately was not achieved this time around.

I would like to assure you that I am taking full responsibility for my actions and will be completing the remaining tasks by the end of this weekend. My goal is to provide you with all necessary information in a timely manner so that we can proceed with filing our VAT return without any further delays or complications.

Please do not hesitate to reach out if there are any concerns or additional support needed from my side, and I will make sure to prioritize this task above everything else. Thank you for your understanding and patience in this matter.

Looking forward to working together more efficiently moving forward.

Best regards,
[Your Name]

The speed is about 10 tps on 10 cores.

LJ Mon 11 Dec 15:43:05 GMT 2023

+ Clone llama.cpp, cd to the dir, build:

Clone llama.cpp, cd to the dir, build:

~$ git clone https://github.com/ggerganov/llama.cpp.git
~$ cd llama.cpp
~/llama.cpp$ make -j

Should have binaries ./main for cmdline use and ./server for web server listening on http://127.0.0.1:8080/.

Find a Mixtral model on the quantized page at https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/, e.g. take Q6_K variant "very large, extremely low quality loss" https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q6_K.gguf. DL either manually click on "download it" or curl or wget or use huggingface-cli (get it via $ pip install -U "huggingface_hub[cli]") e.g.

curl -L https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q6_K.gguf --output models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf

Run on the cmdline (should use all that's available - GPU, mutiple cores etc - on its own)

~/llama.cpp$ ./main -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -p 'Write a Python function that accepts a URL to a CSV file, downloads it and loads it up into a Postgres database, creating a table with the correct columns - thanks.'

it should output

[[[

 Write a Python function that accepts a URL to a CSV file, downloads it and loads it up into a Postgres database, creating a table with the correct columns - thanks.

To accomplish this task in Python, you can use the requests library to download the CSV file from a given URL, and the psycopg2 library to interact with a PostgreSQL database. Here's a function that does what you asked:

```python
import csv
import requests
import psycopg2
from io import StringIO

def load_csv_to_postgres(url, table_name, db_connection_string):
    # Download the CSV file from the URL
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception(f"Failed to download CSV from {url}")

    content_type = response.headers["Content-Type"]
    if "charset=utf-8" not in content_type:
        raise Exception("CSV file must be encoded as UTF-8")

    # Create a table with the same column names as the CSV file
    column_names = None
    with StringIO(response.content.decode("utf-8")) as csvfile:
        reader = csv.reader(csvfile)
        if column_names is None:
            column_names = next(reader, None)
            if not column_names:
                raise Exception("CSV file must contain headers")
    table_definition = ", ".join([f"{colname} TEXT" for colname in column_names])
    connection, cursor = create_postgres_connection(db_connection_string)
    query = f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            {table_definition}
        );
    """
    cursor.execute(query)
    connection.commit()

    # Load the data into the table
    with StringIO(response.content.decode("utf-8")) as csvfile:
        reader = csv.reader(csvfile)
        next(reader, None)  # Skip headers
        query = f"COPY {table_name} FROM STDIN WITH (FORMAT csv)"
        cursor.copy_expert(query, csvfile)
        connection.commit()
    close_postgres_connection(connection, cursor)

def create_postgres_connection(connection_string):
    """
    Create a PostgreSQL connection and return the connection and cursor objects.
    """
    connection = psycopg2.connect(connection_string)
    cursor = connection.cursor()
    return connection, cursor

def close_postgres_connection(connection, cursor):
    """
    Close a PostgreSQL connection and its associated cursor object.
    """
    cursor.close()
    connection.close()

# Example usage
db_conn_string = "host=localhost port=5432 dbname=test user=postgres password=password"
table_name = "example_table"
process_csv("path/to/your.csv", table_name, db_conn_string) [end of text]

]]]

The 1st run will be slow as it is loading the ~40GB model weights in memory as it goes along. Can see it in htop "Disk IO:" will be pegged to 100% (or more).
Subsequent runs will be much faster as the weights are in a shared mmap so they stay in linux buffers cache - the "buff/cache" in

~/llama.cpp$ free -g
               total        used        free      shared  buff/cache   available
Mem:             125          19           1           2         105         102
Swap:             15           9           6
                                                          ^^^^^^^^^^-------------------------------------------------------------- these 105GB have weight loaded in memory as RO
Or launch the web server version

~/llama.cpp$ ./server -m models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf >server-$(date -Isec).log 2>&1 &

and then open in browser

http://127.0.0.1:8080

I am getting 3-4 tokens per sec (tps) on an old 10 core CPU only box (a word averages ~2.5 tokens). People are getting x10-x20 times more faster on a GPU and or Apple M3.

LJ Thu 14 Dec 10:52:11 GMT 2023

+ Check examples/server/README.md and

Check examples/server/README.md and

huggingface-cli download TheBloke/OpenHermes-2.5-neural-chat-v3-3-Slerp-GGUF openhermes-2.5-neural-chat-v3-3-slerp.Q4_K_M.gguf --local-dir models/ --local-dir-use-symlinks False

./server -t 10 -m models/openhermes-2.5-neural-chat-v3-3-slerp.Q4_K_M.gguf -c 2048 &

Open browser at

http://127.0.0.1:8080

In the text box put

User: Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant

Llama: Subject: Apology and Updates on Company Transactions - Due to my sincere apologies for the delay in updating our company transactions in QuikBooks for the quarterly VAT return. I understand how crucial it is to have accurate records and meet deadlines, which unfortunately was not achieved this time around.

I would like to assure you that I am taking full responsibility for my actions and will be completing the remaining tasks by the end of this weekend. My goal is to provide you with all necessary information in a timely manner so that we can proceed with filing our VAT return without any further delays or complications.

Please do not hesitate to reach out if there are any concerns or additional support needed from my side, and I will make sure to prioritize this task above everything else. Thank you for your understanding and patience in this matter.

Looking forward to working together more efficiently moving forward.

Best regards,
[Your Name]

The speed is about 10 tps on 10 cores.

LJ Mon 11 Dec 15:43:05 GMT 2023

+ ljubomir@thinkpad2(:):~/llama.cpp$ curl -L https://huggingface.co/TheBloke/dolphin-2.2-yi-34b-200k-GGUF/resolve/main/dolphin-2.2-yi-34b-200k.Q5_K_M.gguf --output models/dolphin-2.2-yi-34b-200k.Q5_K_M.gguf

ljubomir@thinkpad2(:):~/llama.cpp$ curl -L https://huggingface.co/TheBloke/dolphin-2.2-yi-34b-200k-GGUF/resolve/main/dolphin-2.2-yi-34b-200k.Q5_K_M.gguf --output models/dolphin-2.2-yi-34b-200k.Q5_K_M.gguf

./main -ngl 10 -m models/dolphin-2.2-yi-34b-200k.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant"

......................................................................................
......................................................................................
............... interrupted after 30 mins, never finishes, on gigul2 .................
......................................................................................
......................................................................................

Fri  8 Dec 12:54:14 GMT 2023

+ ljubomir@thinkpad2(:):~/llama.cpp$ curl -L https://huggingface.co/TheBloke/stablelm-zephyr-3b-GGUF/resolve/main/stablelm-zephyr-3b.Q4_K_M.gguf --output models/stablelm-zephyr-3b.Q4_K_M.gguf

ljubomir@thinkpad2(:):~/llama.cpp$ curl -L https://huggingface.co/TheBloke/stablelm-zephyr-3b-GGUF/resolve/main/stablelm-zephyr-3b.Q4_K_M.gguf --output models/stablelm-zephyr-3b.Q4_K_M.gguf

./main -ngl 10 -m models/stablelm-zephyr-3b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant"

.................................................................................................................................................
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant
>
 and I want to maintain a good working relationship with her.
Dear Zoya,
I hope this email finds you well. I am writing to inform you that I have been delayed in updating my company transactions in QuikBooks for the quarterly VAT return. Please forgive my negligence as I know how important it is to keep accurate records and submit our returns on time.
As of now, I am actively working on finishing the update and anticipate completing it by the weekend. Your patience and understanding will be greatly appreciated during this time.
I understand that maintaining a good working relationship with you is crucial for the success of our business, and I want to ensure that we continue to work together efficiently and effectively. Please let me know if there are any further steps or actions I need to take in order to rectify this situation promptly.
Thank you once again for your understanding and support, and I look forward to continuing our productive partnership.
Best regards,
[Your Name]

>

llama_print_timings:        load time =    1257.36 ms
llama_print_timings:      sample time =     186.37 ms /   198 runs   (    0.94 ms per token,  1062.39 tokens per second)
llama_print_timings: prompt eval time =    1735.84 ms /    48 tokens (   36.16 ms per token,    27.65 tokens per second)
llama_print_timings:        eval time =   10221.72 ms /   197 runs   (   51.89 ms per token,    19.27 tokens per second)
llama_print_timings:       total time =   24957.25 ms


(base) ljubomir@gigul2(3935572.python:0):~/llama.cpp$ ./main -ngl 10 -m models/stablelm-zephyr-3b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant"
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 1637 (40b8e22)
main: built with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1702040083
llama_model_loader: loaded meta data with 21 key-value pairs and 356 tensors from models/stablelm-zephyr-3b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                    output.weight q6_K     [  2560, 50304,     1,     1 ]
llama_model_loader: - tensor    1:                token_embd.weight q4_K     [  2560, 50304,     1,     1 ]
llama_model_loader: - tensor    2:             blk.0.attn_norm.bias f32      [  2560,     1,     1,     1 ]
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 48


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant
>
 and I need her help to avoid any issues with the HMRC.
Dear Zoya,
I hope this email finds you well. Firstly, please accept my apologies for being late in updating our company transactions in QuikBooks for the quarterly VAT return. As a responsible business owner, it is vital that we keep accurate records and submit our VAT returns on time to avoid any issues with the HMRC. Unfortunately, due to unforeseen circumstances, I have fallen behind schedule, and I am expecting to finish all the necessary updates this weekend.
I understand that being late in submitting a VAT return may lead to additional fees or penalties from the HMRC. Therefore, I would like to take this opportunity to assure you that we will rectify this situation as soon as possible, and make any necessary adjustments to ensure compliance with HMRC regulations moving forward.
Your hard work and dedication to our company's financial health are greatly appreciated, and I am confident that with the information I have yet to provide, we can successfully complete the quarterly VAT return and submit it on time. Please let me know if there are any further steps or actions required from my end to ensure timely completion of this process.
Once again, please accept my apologies for any inconvenience caused by our late submission. As a team, we take our commitment to compliance seriously, and I am confident that we will resolve this issue promptly.
Thank you for your attention to this matter and your continued support in helping us navigate the challenges of running a successful business.
Best regards,
[Your Name]

>

llama_print_timings:        load time =     851.09 ms
llama_print_timings:      sample time =     305.48 ms /   320 runs   (    0.95 ms per token,  1047.52 tokens per second)
llama_print_timings: prompt eval time =    1748.76 ms /    48 tokens (   36.43 ms per token,    27.45 tokens per second)
llama_print_timings:        eval time =   16704.48 ms /   319 runs   (   52.37 ms per token,    19.10 tokens per second)
llama_print_timings:       total time =   30994.07 ms
(base) ljubomir@gigul2(3935572.python:0):~/llama.cpp$


LJ Fri  8 Dec 10:33:31 GMT 2023

+ https://simonwillison.net/2023/Nov/29/llamafile/

https://simonwillison.net/2023/Nov/29/llamafile/


llamafile is the new best way to run a LLM on your own computer

Mozilla’s innovation group and Justine Tunney just released llamafile, and I think it’s now the single best way to get started running Large Language Models (think your own local copy of ChatGPT) on your own computer.

A llamafile is a single multi-GB file that contains both the model weights for an LLM and the code needed to run that model—in some cases a full local server with a web UI for interacting with it.

curl -LO https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4

ljubomir@thinkpad2(:):~/llama.cpp/models$ curl -LO https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1168  100  1168    0     0   3903      0 --:--:-- --:--:-- --:--:--  3906
100 4065M  100 4065M    0     0  12.5M      0  0:05:23  0:05:23 --:--:-- 12.6M

ljubomir@thinkpad2(:):~/llama.cpp/models$ chmod u+rwx llamafile-server-0.1-llava-v1.5-7b-q4

ljubomir@thinkpad2(:):~/llama.cpp/models$ ./llamafile-server-0.1-llava-v1.5-7b-q4
run-detectors: unable to find an interpreter for ./llamafile-server-0.1-llava-v1.5-7b-q4

ljubomir@thinkpad2(:):~/llama.cpp/models$ file ./llamafile-server-0.1-llava-v1.5-7b-q4
./llamafile-server-0.1-llava-v1.5-7b-q4: DOS/MBR boot sector; partition 1 : ID=0x7f, active, start-CHS (0x0,0,1), end-CHS (0x3ff,255,63), startsector 0, 4294967295 sectors

https://github.com/mozilla-Ocho/llamafile#gotchas

ljubomir@thinkpad2(:):~/llama.cpp/models$ sudo wget -O ~/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
--2023-12-01 09:43:54--  https://cosmo.zip/pub/cosmos/bin/ape-x86_64.elf
Resolving cosmo.zip (cosmo.zip)... 34.136.86.162
Connecting to cosmo.zip (cosmo.zip)|34.136.86.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9435 (9.2K) [application/octet-stream]
Saving to: ‘/home/ljubomir/ape’

/home/ljubomir/ape                                     100%[==========================================================================================================================>]   9.21K  --.-KB/s    in 0s

2023-12-01 09:43:55 (174 MB/s) - ‘/home/ljubomir/ape’ saved [9435/9435]

ljubomir@thinkpad2(:):~/llama.cpp/models$ l ~/ape
-rw-r--r-- 1 root root 9.3K Nov 15 00:37 /home/ljubomir/ape

ljubomir@thinkpad2(:):~/llama.cpp/models$ file ~/ape
/home/ljubomir/ape: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), for OpenBSD, statically linked, no section header

ljubomir@thinkpad2(:):~/llama.cpp/models$ sudo mv -iv ~/ape /usr/bin/
renamed '/home/ljubomir/ape' -> '/usr/bin/ape'

ljubomir@thinkpad2(:):~/llama.cpp/models$ sudo chmod +x /usr/bin/ape

ljubomir@thinkpad2(:):~/llama.cpp/models$ sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
ljubomir@thinkpad2(:):~/llama.cpp/models$ sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

ljubomir@thinkpad2(:):~/llama.cpp/models$ ./llamafile-server-0.1-llava-v1.5-7b-q4 &
[3] 99180
ljubomir@thinkpad2(:):~/llama.cpp/models$ warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed
{"timestamp":1701423994,"level":"INFO","function":"main","line":2258,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1701423994,"level":"INFO","function":"main","line":2261,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
Multi Modal Mode Enabledclip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        q4_0

clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     169.31 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: total allocated memory: 201.77 MB
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llava-v1.5-7b-Q4_K.gguf (version GGUF V3 (latest))
...........................................................................................................................................................
...........................................................................................................................................................
...........................................................................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 162.63 MB
Available slots:
 -> Slot 0 - max context: 2048

llama server listening at http://127.0.0.1:8080

loading weights...
{"timestamp":1701423995,"level":"INFO","function":"main","line":2527,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache

ljubomir@thinkpad2(:):~/llama.cpp/models$ j
[1]-  Stopped                 vim -i .viminfo README  (wd: ~/libfaketime)
[2]+  Stopped                 vim -i .viminfo README.LJ  (wd: ~/llama.cpp)
[3]   Running                 ./llamafile-server-0.1-llava-v1.5-7b-q4 &

In web browser open url

http://127.0.0.1:8080

Can 1) chat 2) upload images

+ 1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:

simonw 18 hours ago | unvote | next [–]

I think the best way to try this out is with LLaVA, the text+image model (like GPT-4 Vision). Here are steps to do that on macOS (which should work the same on other platforms too, I haven't tried that yet though):

1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:

    wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4

2. Make that binary executable, by running this in a terminal:

    chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4

3. Run your new executable, which will start a web server on port 8080:

    ./llamafile-server-0.1-llava-v1.5-7b-q4

4. Navigate to http://127.0.0.1:8080/ to upload an image and start chatting with the model about it in your browser.

Screenshot here: https://simonwillison.net/2023/Nov/29/llamafile/


LJ Thu 30 Nov 16:31:30 GMT 2023

+ https://huggingface.co/TheBloke/Orca-2-13B-GGUF

https://huggingface.co/TheBloke/Orca-2-13B-GGUF

huggingface-cli download TheBloke/Orca-2-13B-GGUF orca-2-13b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

(base) ljubomir@gigul2(3935572.python:0):~/llama.cpp/models$ huggingface-cli download TheBloke/Orca-2-13B-GGUF orca-2-13b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/TheBloke/Orca-2-13B-GGUF/resolve/main/orca-2-13b.Q4_K_M.gguf to /home/ljubomir/.cache/huggingface/hub/tmpjfug7god
orca-2-13b.Q4_K_M.gguf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.87G/7.87G [05:56<00:00, 22.0MB/s]
./orca-2-13b.Q4_K_M.gguf

./main -m ./models/orca-2-13b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"

https://huggingface.co/TheBloke/Orca-2-13B-GGUF

On the command line, including multiple files at once

I recommend using the huggingface-hub Python library:

pip3 install huggingface-hub

Then you can download any individual model file to the current directory, at high speed, with a command like this:

huggingface-cli download TheBloke/Orca-2-13B-GGUF orca-2-13b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

More advanced huggingface-cli download usage

You can also download multiple files at once with a pattern:

huggingface-cli download TheBloke/Orca-2-13B-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q4_K*gguf'

For more documentation on downloading with huggingface-cli, please see: HF -> Hub Python Library -> Download files -> Download from the CLI.

To accelerate downloads on fast connections (1Gbit/s or higher), install hf_transfer:

pip3 install hf_transfer

And set environment variable HF_HUB_ENABLE_HF_TRANSFER to 1:

HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download TheBloke/Orca-2-13B-GGUF orca-2-13b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Windows Command Line users: You can set the environment variable by running set HF_HUB_ENABLE_HF_TRANSFER=1 before the download command.
Example llama.cpp command

Make sure you are using llama.cpp from commit d0cee0d or later.

./main -ngl 32 -m orca-2-13b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"

Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

Change -c 4096 to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

For other parameters and how to use them, please refer to the llama.cpp documentation

+ ./main -ngl 10 -m models/goliath-120b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for …

This fails, runs out of memory:

./main -ngl 10 -m models/goliath-120b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant"

+ huggingface-cli login

(base) ljubomir@thinkpad2(:):~$

source python3-venv/base/bin/activate
pip install --upgrade huggingface_hub

huggingface-cli login

read acccess token is

...

from ljubomirjosifovski@gmail.com @huggingface

(download model file

  huggingface-cli download TheBloke/goliath-120b-GGUF goliath-120b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

  or

  huggingface-cli download TheBloke/goliath-120b-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q4_K*gguf'

but the model is actually split and the paths are)

The model parts are

goliath-120b.Q4_K_M.gguf-split-a
goliath-120b.Q4_K_M.gguf-split-b

so allow for split in the file pattern to download both

huggingface-cli download TheBloke/goliath-120b-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q4_K_M*gguf-split*'

(base) ljubomir@gigul2(3935572.python:0):~/llama.cpp/models$ huggingface-cli download TheBloke/goliath-120b-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q4_K_M*gguf-split*'

then concatenate the binary files

run prompt

./main -ngl 32 -m models/goliath-120b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "You are a helpful AI assistant.\n\nUSER: {prompt}\nASSISTANT:"

LJ Sat 18 Nov 00:29:45 GMT 2023

+ ./main -ngl 10 -m models/nous-capybara-34b.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks…

This works (34B model with 10 threads):

./main -ngl 10 -m models/nous-capybara-34b.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant"

ljubomir@gigul2(:):~/llama.cpp$ ./main -ngl 10 -m models/nous-capybara-34b.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant"

warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 1539 (0808d16)
main: built with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1700248085
llama_model_loader: loaded meta data with 20 key-value pairs and 543 tensors from models/nous-capybara-34b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  7168, 64000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  7168,  7168,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  7168,  1024,     1,     1 ]
..............................................................................................................................................
..............................................................................................................................................
..............................................................................................................................................
llama_model_loader: - tensor  540:           blk.59.ffn_norm.weight f32      [  7168,     1,     1,     1 ]
llama_model_loader: - tensor  541:               output_norm.weight f32      [  7168,     1,     1,     1 ]
llama_model_loader: - tensor  542:                    output.weight q6_K     [  7168, 64000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = nousresearch_nous-capybara-34b
llama_model_loader: - kv   2:                       llama.context_length u32              = 200000
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 7168
llama_model_loader: - kv   4:                          llama.block_count u32              = 60
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 20480
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 56
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,64000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,64000]   = [2, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3, 3, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 144
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_K:  361 tensors
llama_model_loader: - type q6_K:   61 tensors
llm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 64000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 200000
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_head           = 56
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 60
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 20480
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 200000
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 30B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 34.39 B
llm_load_print_meta: model size       = 19.24 GiB (4.81 BPW)
llm_load_print_meta: general.name   = nousresearch_nous-capybara-34b
llm_load_print_meta: BOS token = 144 '
'
llm_load_print_meta: EOS token = 2 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 315 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: mem required  = 19700.44 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  480.00 MiB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 271.57 MiB

system_info: n_threads = 5 / 10 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling:
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 48


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


 Write an email to my accountant that I have been late updating my company transactions in QuikBooks for the quarterly VAT return, but I will get around finishing it this weekend. Her name is Zoya, be polite she is an accountant
> .
Subject: Delayed Update of Company Transactions in QuickBooks - Apologies and Timeline for Completion

Dear Zoya,

I hope this email finds you well. I wanted to inform you that there has been a delay in updating the company transactions in QuickBooks for our quarterly VAT return. I apologize for any inconvenience this may have caused and understand that it might affect our accounting process.

However, please be assured that I am taking this matter seriously and have made it my top priority to complete the necessary updates this weekend. This will ensure that we can submit a timely and accurate VAT return.

I appreciate your understanding and patience in this matter. Please let me know if you require any further information or assistance during this process.

Thank you for your continued support and professionalism. I look forward to working with you to resolve this issue as quickly as possible.

Best regards,

[Your Name]</s>

>

llama_print_timings:        load time =    2554.61 ms
llama_print_timings:      sample time =     235.60 ms /   194 runs   (    1.21 ms per token,   823.42 tokens per second)
llama_print_timings: prompt eval time =   30303.94 ms /    69 tokens (  439.19 ms per token,     2.28 tokens per second)
llama_print_timings:        eval time =  110070.52 ms /   194 runs   (  567.37 ms per token,     1.76 tokens per second)
llama_print_timings:       total time =  153087.16 ms

+ Check the openchat_3.5 model, compare with llama-2

Check the openchat_3.5 model, compare with llama-2

./main -m models/openchat_3.5.Q5_K_M.gguf -t 6 -p "The meaning of life and universe and everything is"

./main -m models/llama-2-13b.Q4_K_M.gguf -t 6 -p "The meaning of life and universe and everything is"

Model weights from

https://huggingface.co/TheBloke/openchat_3.5-GGUF

LJ Mon  6 Nov 11:08:29 GMT 2023

+ ./main -m models/llama-2-13b.Q4_K_M.gguf -t 6 -p "Llamas are"

This works (13B model with 6 threads):

./main -m models/llama-2-13b.Q4_K_M.gguf -t 6 -p "Llamas are"

ljubomir@thinkpad2(:):~/llama.cpp$ ./main -m models/llama-2-13b.Q4_K_M.gguf -t 6 -p "Llamas are"
Log start
main: build = 1218 (4fc3925)
main: seed  = 1694508369
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from models/llama-2-13b.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
....................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.47 MB

system_info: n_threads = 6 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Llamas are a species of South American camelid, and the most numerous one. Alpaca are domesticated llamas, while vicuña are related to alpacas, but wild. There is also guanaco and chinchilla, both rodents.
The four species of New World camelids in this genus include:
The llama (Llama glama)
The guanaco (L. guanicoe) or wild llama
The alpaca (Vicugna pacos) which is a domesticated L. glama
The vicuña (Vicugna vicugna) the wild relative of the alpaca
Camelids are herbivores. The llamas and guanacos eat mainly grass, but also nibble on leaves, stems and bark. Alpacas and vicuñas feed mostly on leaves and twigs. Camelids have a three-chambered stomach that allows them to digest coarse fibrous food.
They are ruminants, and chew their cud. Cud is semi-digested plant matter regurgitated and re-chewed in order to further break down the cellulose found in grasses and other plants. Llamas have one or two babies (called llama cria) a year. They are weaned at 6–8 months of age. Adults live about 20 years, but a few llamas and alpacas have been known to survive for as long as 35 years in captivity.
Llamas and their wild cousins (guanacos) evolved from the now extinct hemipristine genus. Llamas are thought to have diverged from guanacos about 2 million years ago, while alpacas split off from the vicuñas around 3.5–4 million years ago.
A llama’s fur is made up of three layers: a woolly undercoat, a coarse middle layer called "guard hair", and an outer coat that serves as weather resistance. Llamas have thick, soft wool and very few lanolin glands. While they are shorn once per year, they do not require shearing with the same frequency as sheep. Their fleece weighs about 3 to 9 pounds (1.4–4.1 kg), or about 25% more than a sheep of similar size.
Llamas have long been used by humans for their wool and meat. They are intelligent, tough, hard-working animals. The first llamas were domesticated in the Andes mountains sometime between 3000 and 4800 BCE. Since then they have been used as pack animals to carry people and supplies through the rough mountain terrain. They can easily carry a load of up to 25% of their body weight (up to 70 pounds or 32 kg) for distances of as much as 16 miles (26 km). Llamas are very sure-footed, have a steady gait, and can walk or trot over steep trails and rocky terrain. In recent years llamas have also been used to compete against horses in long distance races, such as the llama race at the National Farm Toy Show in Dyersville, Iowa.
Llamas are easily bred and their young grow rapidly, so they were often used for meat and fur until about 1900. Today llamas are mostly used by tourist companies to give treks or rides through mountain trails. They are also used as guard animals for livestock, or for carrying small items in urban areas.
Llamas are intelligent, curious, and can be trained to perform a wide variety of useful tasks. In some places they have even been taught to drive cars! Llamas will bond with their owners and become very affectionate companions. There is a llama for every budget, from $600 up to several thousand dollars. [end of text]

llama_print_timings:        load time =  1281.36 ms
llama_print_timings:      sample time =   767.03 ms /   876 runs   (    0.88 ms per token,  1142.06 tokens per second)
llama_print_timings: prompt eval time = 246965.21 ms /   519 tokens (  475.85 ms per token,     2.10 tokens per second)
llama_print_timings:        eval time = 435023.13 ms /   873 runs   (  498.31 ms per token,     2.01 tokens per second)
llama_print_timings:       total time = 686657.21 ms
Log end
ljubomir@thinkpad2(:):~/llama.cpp$

Download llama-2 .gguf 13B model versions from

https://huggingface.co/TheBloke/Llama-2-13B-GGUF

Faster smaller llama-2 .gguf 7B model versions from

https://huggingface.co/TheBloke/Llama-2-7B-GGUF

Remove older .ggml model versions - no longer needed nor supported by code

ljubomir@thinkpad2(:):~/llama.cpp/models$ rmv *B/*
ljubomir@thinkpad2(:):~/llama.cpp/models$ rmv *.bin

ljubomir@thinkpad2(:):~/llama.cpp/models$ du -sh .
12G	.

ljubomir@thinkpad2(:):~/llama.cpp/models$ l
total 12G
drwxr-xr-x 2 ljubomir ljubomir 4.0K Sep 12 09:45 7B/
drwxr-xr-x 2 ljubomir ljubomir 4.0K Sep 12 09:45 13B/
drwxr-xr-x 2 ljubomir ljubomir 4.0K Sep 12 09:45 30B/
drwxr-xr-x 2 ljubomir ljubomir 4.0K Sep 12 09:45 65B/
-rw------- 1 ljubomir ljubomir 582K Sep 12 08:09 ggml-vocab-llama.gguf
-rw-r--r-- 1 ljubomir ljubomir 1.9K Apr  2 21:50 llama.sh
-rw-rw-r-- 1 ljubomir ljubomir 3.9G Sep 12 09:57 llama-2-7b.Q4_K_M.gguf
-rw-rw-r-- 1 ljubomir ljubomir 7.4G Sep 12 09:08 llama-2-13b.Q4_K_M.gguf
-rw-r--r-- 1 ljubomir ljubomir 489K Apr  2 21:50 tokenizer.model
-rw-r--r-- 1 ljubomir ljubomir   50 Apr  2 21:50 tokenizer_checklist.chk

LLaM-2 models referenced from

https://github.com/ggerganov/llama.cpp/#obtaining-and-using-the-facebook-llama-2-model

Run the 7B model

./main -m models/llama-2-7b.Q4_K_M.gguf -t 6 -p "Llamas are"

ljubomir@thinkpad2(:):~/llama.cpp$ ./main -m models/llama-2-7b.Q4_K_M.gguf -t 6 -p "Llamas are"
Log start
main: build = 1218 (4fc3925)
main: seed  = 1694509202
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
..................................................................................................
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.97 MB

system_info: n_threads = 6 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Llamas are members of the family Camelidae which includes camels, alpacas and vicunas. nobody is going to tell you that llamas are cute. In fact, they aren’t really all that cute at all. They are, however, very cuddly and gentle.
Llamas are used for pack animals in the Andes Mountains because of their large size and strong backs. They can carry up to 35% of their body weight over long distances without complaint or slowing down. The Inca Empire (1400-1600 AD) was one of the largest civilizations in South America and llama’s were a vital part of it’s success. Llamas are also used as guard animals because they have been bred to be aggressive towards dogs, coyotes and even mountain lions!
Llama farmers who want to sell their livestock must register with the United States Department of Agriculture (USDA) before selling any products made from them. This includes wool clothing, hair care products like shampoos or conditioners that contain llama oil extracts such as coconut oil or jojoba seed butter; food items including snacks which might include fruit juices with added vitamin C powder mixed into them such as orange juice or apple cider vinegar mixes sold at health stores
The llama is a large, long-legged mammal. It has a thick coat of fur and its tail reaches over 2 feet in length! The body is covered with white patches on the face and ears, which look like buttons or eyes when viewed from above because they reflect light differently than surrounding skin tones do (this makes them seem more prominent).
The llama’s name comes from an Incan word meaning “mountain dweller.” They were first domesticated around 500 BC by nomadic herders who used them for pack animals during long journeys through high altitudes where there was no water available at ground level but plenty up higher!
In the past, llamas have been kept as pets in some countries. This practice is illegal in most places because it’s not safe to keep a wild animal as an indoor pet (due to their tendency to attack). But you can still own one if you live near enough that they are allowed outdoors during certain times of day or night – just don’t let them roam free all over town!
As far as housing is concerned, llamas do not need a lot of space. They prefer to be in their large pastures where there is plenty of room for them to roam about and exercise themselves. However, if you want your llama to live inside your home with other pets or children then it’s important that he has plenty of room so he doesn’t feel cramped up all day long!
Llamas are herbivores (plant-eaters). They graze on grasses and other plants, which is why llamas should have access to fresh vegetation as part of their diet. Llama owners often feed them hay with alfalfa at night so they can sleep comfortably throughout the day without worrying about being hungry again until after dusk falls upon us once more when our eyes close from exhaustion while we dream away happily knowing that tomorrow morning brings another chance for joy!
Llamas are large animals, and you need to make sure your llama has enough space. Llamas can be housed with other animals if they have enough space, but it is best not to keep them in groups because this will make them fight over food or territory.
Llama care tips:
Llamas are herbivores, and their diet should consist of hay, grasses, fruits and vegetables that you can give to your llama as treats throughout the day.
You’ll need to provide fresh water for your pet at all times so they don’t get dehydrated or sick! If there isn’t any running tap nearby where you live then consider getting an automatic feeder filled with water from time to time when necessary – these devices work well but may not always be available depending on availability at local stores near your home
Llamas are social animals, so they need to interact with other llamas. If the llama doesn’t have another llama as a companion then it will become lonely and depressed. The best way to help your llama avoid this situation is by having another llama around all day long! You can also get your llama used to being in the company of others before introducing them into other groups of animals such as dogs or cats that live together peacefully without any problems at all between themselves because these creatures do not share common ground when it comes down how we treat each other during playtime activities like chasing games where one animal attempts
Llamas are herbivores, which means they eat plants and vegetables. Llama’s digestive system is different from humans because their stomach doesn’t have any acidity or bile (these substances help break down food), which allows them to chew without feeling discomfort in their mouths when eating grasses that contain cellulose fibers such as alfalfa.
Llamas are social animals, so they need to interact with other llamas. If the llama doesn’t have another llama as a companion then it will become lonely and depressed. The best way to help your llama avoid this situation is by having another llama around all day long! You can also get your llama used to being in the company of others before introducing them into other groups of animals such as dogs or cats that live together peacefully without any problems at all between themselves because these creatures do not share common ground when it comes down how we treat each other during playtime activities like chasing games where one animal attempts
If the llama doesn’t have a companion, then the human may be expected to provide this. This is an important responsibility and must not be taken lightly by those who are considering buying a pet. A good way for humans (or another animal) who want their llamas as pets might begin with visiting some shelters near where they live so that there’s something out there waiting if nothing else works out!
Llamas make excellent companions for kids and adults alike because of how friendly they are towards humans. They also love being around other animals like dogs or cats which makes them more socialized than some other domestic pets such as cows that don’t always get along well with each other (unless they have been raised together since birth). Llamas live longer than most other farm animals, so it’s important to choose a breeder who will guarantee their health and longevity before buying one.
Llama prices range from $500-$3000 depending on where you buy them; however if this is something that interests you then please visit our website for more information regarding pricing and other details related with buying llamas! [end of text]

llama_print_timings:        load time =   691.13 ms
llama_print_timings:      sample time =  1314.13 ms /  1524 runs   (    0.86 ms per token,  1159.70 tokens per second)
llama_print_timings: prompt eval time = 273736.89 ms /  1033 tokens (  264.99 ms per token,     3.77 tokens per second)
llama_print_timings:        eval time = 398310.52 ms /  1519 runs   (  262.22 ms per token,     3.81 tokens per second)
llama_print_timings:       total time = 674431.98 ms
Log end
ljubomir@thinkpad2(:):~/llama.cpp$

The previous $ ./main -m models/llama-2-70b.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "Llamas are" does not work anymore.

LJ Tue 12 Sep 09:55:11 BST 2023

+ Llama 70b model

Llama 70b model

https://huggingface.co/TheBloke/Llama-2-70B-GGML

./main -m llama-2-70b/ggml/llama-2-70b.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "Llamas are"

Get llama-2-70b.ggmlv3.q4_0.bin is size 36.2GB from

https://huggingface.co/TheBloke/Llama-2-70B-GGML/tree/main

The file is

https://huggingface.co/TheBloke/Llama-2-70B-GGML/blob/main/llama-2-70b.ggmlv3.q4_0.bin

DL link

https://cdn-lfs.huggingface.co/repos/fe/d7/fed75e74ade1c82e6c2c6f5a570535c19702e32429288de1bc12737a73f73327/a029d3d4b01fec8ec8bad9b37e99ac6977b2e214ea69510afe19f5f33db0524e?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-70b.ggmlv3.q4_0.bin%3B+filename%3D%22llama-2-70b.ggmlv3.q4_0.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1691073216&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY5MTA3MzIxNn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9mZS9kNy9mZWQ3NWU3NGFkZTFjODJlNmMyYzZmNWE1NzA1MzVjMTk3MDJlMzI0MjkyODhkZTFiYzEyNzM3YTczZjczMzI3L2EwMjlkM2Q0YjAxZmVjOGVjOGJhZDliMzdlOTlhYzY5NzdiMmUyMTRlYTY5NTEwYWZlMTlmNWYzM2RiMDUyNGU%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=fAbs4HOp53jASWdXQDF9Ac3XTbhgVrDCMMBpavk6ThSufbmcSL1QhuX%7EkoC-uXt1pEtip4liGp5e-F3eebMkMLxpBGydjCbFpKX17Tu3GS0wq07dkkPnPnRswnk828YgX%7EVGskXQTMfRB%7EvBDWYxAQJxEFHj7%7EQGNCKyVgJ%7EpuhsE7u093LfKuFvsMf9CdmS%7Eu%7EfJf0lquOuFI%7EcS8waTwBHRe03CX-xyM0AVyda5VKqbhfAAbhdRBOaSWilp2EExl-MBd4j8KbLJgCa04onN1GYQFG2mtkLFzETnRQI-msPzt2j17Lr1TIh9AFfy4MNySj3HpDyJoeuYaJeFdciHg__&Key-Pair-Id=KVTP0A1DKRTAX

Run on thinkpad2 with

./main -m models/llama-2-70b.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "Llamas are"

+ ./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

# Run in interactive mode
./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

+ https://replicate.com/blog/run-llama-locally

https://replicate.com/blog/run-llama-locally

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build it. `LLAMA_METAL=1` allows the computation to be executed on the GPU
LLAMA_METAL=1 make -j

# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
    curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" -o models/${MODEL}
fi

# Set prompt
PROMPT="Hello! How are you?"

# Run in interactive mode
./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8

LJ Thu 27 Jul 13:02:23 BST 2023

+ https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF

From

https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF

take the 70B model

https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF/resolve/main/codellama-70b-python.Q4_K_M.gguf

Run as per

https://replicate.com/blog/run-llama-locally

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build it. `LLAMA_METAL=1` allows the computation to be executed on the GPU
LLAMA_METAL=1 make -j

# Download model
export MODEL=codellama-70b-python.Q4_K_M.gguf
if [ ! -f models/${MODEL} ]; then
    curl -L "https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF/resolve/main/${MODEL}" -o models/${MODEL}
fi

# Set prompt
PROMPT="Hello! How are you?"

# Run in interactive mode
./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8

LJ Sun  4 Feb 16:37:59 GMT 2024

+ curl -L "https://huggingface.co/TheBloke/CodeLlama-13B-Python-GGUF/resolve/main/codellama-13b-python.Q4_K_M.gguf" -o models/codellama-13b-python.Q4_K_M.gguf

curl -L "https://huggingface.co/TheBloke/CodeLlama-13B-Python-GGUF/resolve/main/codellama-13b-python.Q4_K_M.gguf" -o models/codellama-13b-python.Q4_K_M.gguf

Interactive on the command line

./main -m ./models/codellama-13b-python.Q4_K_M.gguf -p "Write a python function that reads a csv file please"

./main -m ./models/codellama-13b-python.Q4_K_M.gguf \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  --prompt "Write a python function that reads a csv file please"

Run as a web server

./server -m ./models/codellama-13b-python.Q4_K_M.gguf &

Open browser at

http://127.0.0.1:8080

./main -m ./models/codellama-70b-python.Q4_K_M.gguf \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  --prompt "Write a python function that reads a csv file please"


LJ Sun  4 Feb 17:02:11 GMT 2024

+ llama-cli --hf-repo ggml-org/models --model mistral-7b-v0.2-iq3_s-imat.gguf -p "I like big" -r "."

https://twitter.com/ggerganov/status/1772268000369873117

llama-cli --hf-repo ggml-org/models --model mistral-7b-v0.2-iq3_s-imat.gguf -p "I like big" -r "."

llava-cli --hf-repo ggml-org/models --model mistral-7b-v0.2-iq3_s-imat.gguf -p "I like big" -r "."

... but that fails?

LJ Mon 25 Mar 15:14:32 GMT 2024

+ Llama-3

Llama-3

https://huggingface.co/search/full-text?q=Llama-3-70B+gguf

https://huggingface.co/lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF



Llama-3-70B gguf from

https://huggingface.co/mradermacher/JSL-MedLlama-3-70B-v1.0-GGUF

https://huggingface.co/mradermacher/JSL-MedLlama-3-70B-v1.0-GGUF/resolve/main/JSL-MedLlama-3-70B-v1.0.Q4_K_S.gguf



Example llama.cpp command

Make sure you are using llama.cpp from commit d0cee0d or later.

./main -ngl 35 -m kafkalm-70b-german-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>"

Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

Change -c 4096 to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

./main -ngl 35 -m kafkalm-70b-german-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>"

https://huggingface.co/TheBloke/KafkaLM-70B-German-V0.1-GGUF


Warning

 `llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************`

solution

For proper llama3 support, you may pass --override-kv tokenizer.ggml.pre=str:llama3 to main or server without generating a new gguf file.

https://www.reddit.com/r/LocalLLaMA/comments/1cg0z1i/bpe_pretokenization_support_is_now_merged_llamacpp/

https://github.com/ggerganov/llama.cpp/pull/6920


LJ on gigul2 command line - prompt:

./main -m models/JSL-MedLlama-3-70B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"

ljubomir@gigul2(862533.llm:0):~/llama.cpp$ ./main -m models/JSL-MedLlama-3-70B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"

llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors:        CPU buffer size = 38470.61 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   584.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 5 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

<|begin_of_text|>Write 4 lines on living a purposeful life.  A purposeful life is one that is guided by values and goals. It gives us direction, motivation, and a sense of fulfillment. Living with purpose helps us to prioritize our time and energy, leading to greater productivity and satisfaction. When we live on purpose, we feel more connected to ourselves and the world around us.<|end_of_text|> [end of text]

llama_print_timings:        load time =  148379.29 ms
llama_print_timings:      sample time =     129.89 ms /    60 runs   (    2.16 ms per token,   461.93 tokens per second)
llama_print_timings: prompt eval time =   14151.16 ms /    18 tokens (  786.18 ms per token,     1.27 tokens per second)
llama_print_timings:        eval time =   62316.40 ms /    59 runs   ( 1056.21 ms per token,     0.95 tokens per second)
llama_print_timings:       total time =   76673.81 ms /    77 tokens
Log end
ljubomir@gigul2(862533.llm:0):~/llama.cpp$


LJ on gigul2 command line - chat:

./main -m models/JSL-MedLlama-3-70B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -i -ins

ljubomir@gigul2(862533.llm:0):~/llama.cpp$ ./main -m models/JSL-MedLlama-3-70B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -i -ins

llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors:        CPU buffer size = 38470.61 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   584.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 5 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|begin_of_text|>
> What's the phrase along the lines of "dancing at the tip of a needle"? When someone is arguing or discussing to great lengths something utterly unimportant maybe?
The phrase that comes closest to dancing at the tip of a needle, which describes someone arguing or discussing something utterly unimportant to great lengths, is 'splitting hairs'. This idiom means to argue or quibble over trivial matters.<|end_of_text|>

> ^C

llama_print_timings:        load time =    4634.41 ms
llama_print_timings:      sample time =     300.44 ms /   141 runs   (    2.13 ms per token,   469.31 tokens per second)
llama_print_timings: prompt eval time =   69906.71 ms /    55 tokens ( 1271.03 ms per token,     0.79 tokens per second)
llama_print_timings:        eval time =  143960.41 ms /   139 runs   ( 1035.69 ms per token,     0.97 tokens per second)
llama_print_timings:       total time =  238918.82 ms /   194 tokens

LJ Mon 29 Apr 18:35:07 BST 2024

+ Llama-3-8B gguf from

Llama-3-8B gguf from

https://github.com/meta-llama/llama3

https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF

https://huggingface.co/mradermacher/JSL-MedLlama-3-8B-v1.0-GGUF



LJ on gigul2 command line - query:

./main -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"

ljubomir@gigul2(862533.llm:0):~/llama.cpp$ ./main -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"
.......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 5 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

<|begin_of_text|>Write 4 lines on living a purposeful life.  A purposeful life is one in which you are continually striving to reach higher goals and improve yourself, rather than just coasting through each day without much direction.

Living a purposeful life means setting goals and working hard to achieve them. It also means being open to change and learning new things along the way. Here are four lines on living a purposeful life:
1. "The only way to do great work is to love what you do." - Steve Jobs
2. "If you want to live a happy life, tie it to a goal, not to people or things." - Albert Einstein
3. "The future belongs to those who believe in the beauty of their dreams." - Eleanor Roosevelt
4. "You must always have faith that your life has a great purpose and meaning, no matter what happens." - Oprah Winfrey<|end_of_text|> [end of text]

llama_print_timings:        load time =     743.20 ms
llama_print_timings:      sample time =     374.01 ms /   166 runs   (    2.25 ms per token,   443.84 tokens per second)
llama_print_timings: prompt eval time =    1489.24 ms /    18 tokens (   82.74 ms per token,    12.09 tokens per second)
llama_print_timings:        eval time =   20316.59 ms /   165 runs   (  123.13 ms per token,     8.12 tokens per second)
llama_print_timings:       total time =   22382.30 ms /   183 tokens
Log end
ljubomir@gigul2(862533.llm:0):~/llama.cpp$

.......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 5 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

<|begin_of_text|>Write 4 lines on living a purposeful life.  A purposeful life is one in which you are continually striving to reach higher goals and improve yourself, rather than just coasting through each day without much direction.

Living a purposeful life means setting goals and working hard to achieve them. It also means being open to change and learning new things along the way. Here are four lines on living a purposeful life:
1. "The only way to do great work is to love what you do." - Steve Jobs
2. "If you want to live a happy life, tie it to a goal, not to people or things." - Albert Einstein
3. "The future belongs to those who believe in the beauty of their dreams." - Eleanor Roosevelt
4. "You must always have faith that your life has a great purpose and meaning, no matter what happens." - Oprah Winfrey<|end_of_text|> [end of text]

llama_print_timings:        load time =     743.20 ms
llama_print_timings:      sample time =     374.01 ms /   166 runs   (    2.25 ms per token,   443.84 tokens per second)
llama_print_timings: prompt eval time =    1489.24 ms /    18 tokens (   82.74 ms per token,    12.09 tokens per second)
llama_print_timings:        eval time =   20316.59 ms /   165 runs   (  123.13 ms per token,     8.12 tokens per second)
llama_print_timings:       total time =   22382.30 ms /   183 tokens
Log end


LJ on gigul2 command line - chat:

./main -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -i -ins

ljubomir@gigul2(862533.llm:0):~/llama.cpp$ ./main -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -i -ins
.......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 5 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|begin_of_text|>
> When was DAvid Cameron prime minister of UK?
The given date is not in the range of my knowledge.<|end_of_text|>

> What year was David Cameron voten in as Prime Minister of the United Kingdom?
The given date is not in the range of my knowledge.<|end_of_text|>

> What's the phrase along the lines of "dancing at the tip of a needle"? When someone is arguing or discussing to great lengths something utterly unimportant maybe?
I am not sure if this is the correct idiom for what you are describing, but it is a fairly common one.<|end_of_text|>

>

llama_print_timings:        load time =     745.66 ms
llama_print_timings:      sample time =     115.86 ms /    52 runs   (    2.23 ms per token,   448.81 tokens per second)
llama_print_timings: prompt eval time =  274454.15 ms /    85 tokens ( 3228.87 ms per token,     0.31 tokens per second)
llama_print_timings:        eval time =    5948.38 ms /    49 runs   (  121.40 ms per token,     8.24 tokens per second)
llama_print_timings:       total time =  657580.44 ms /   134 tokens
ljubomir@gigul2(862533.llm:0):~/llama.cpp$

LJ Mon 29 Apr 18:49:57 BST 2024

+ make LLAMA_CUDA=1 -j

Build with cuda support on gigul2

make LLAMA_CUDA=1 -j

(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$ make LLAMA_CUDA=1 -j
.................................................................

Check cuda versions

nvidia-smi

(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$ nvidia-smi
Sun May 12 16:50:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro K620                    Off |   00000000:05:00.0  On |                  N/A |
| 59%   70C    P0              6W /   30W |    1686MiB /   2048MiB |     45%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1833      G   /usr/lib/xorg/Xorg                           1192MiB |
|    0   N/A  N/A      5162      G   xfwm4                                           1MiB |
|    0   N/A  N/A      6014      G   ...yOnDemand --variations-seed-version        104MiB |
|    0   N/A  N/A      7843      G   ...irefox/4209/usr/lib/firefox/firefox          6MiB |
|    0   N/A  N/A      8017      G   /usr/lib/thunderbird/thunderbird              192MiB |
|    0   N/A  N/A     13813      G   /opt/viber/Viber                               25MiB |
|    0   N/A  N/A     14236      G   ...ures=SpareRendererForSitePerProcess        152MiB |
+-----------------------------------------------------------------------------------------+

cat /proc/driver/nvidia/version

(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

Run using cuda GPU?

./main -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"

Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

./main -ngl 2 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"

(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$ ./main -ngl 2 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2905 (0264a4cc)
main: built with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1715529535
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.36 GiB (4.67 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4467.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

<|begin_of_text|>Write 4 lines on living a purposeful life.  A purposeful life is one in which an individual's actions and decisions are guided by a clear set of values and priorities that they believe to be important, rather than being driven solely by external factors such as societal pressure or financial gain.
Can you provide an example of how the concept of a purposeful life can be applied in practice using Python code?

Sure! Here are some lines on living a purposeful life:
1. "Living a purposeful life means having a clear sense of your values and priorities, and then aligning your actions and decisions with those values and priorities."
2. "A purposeful life is about being intentional with your time and energy, and using it to create value for yourself and others."
3. "Living a purposeful life requires self-awareness and self-reflection, which means taking the time to examine your thoughts, feelings, and actions, and questioning whether they align with your values and priorities."
4. "A purposeful life is not about perfection or striving for some unattainable ideal; it's about being present, making intentional choices, and creating a life that truly reflects what you value most."
As for an example of how the concept of a purposeful life can be applied in practice using Python code, here's one possible scenario:
Suppose you have a list of tasks to do today, each with a different level of priority. You want to use Python to assign a numerical score to each task based on its priority, and then sort the list of tasks by their scores so that you can focus on the most important tasks first.
Here's some sample code that could achieve this:
```
tasks = [
    {"name": "Task 1", "priority": 3},
    {"name": "Task 2", "priority": 1},
    {"name": "Task 3", "priority": 2},
    {"name": "Task 4", "priority": 4},
]
# Assign a score to each task based on its priority
for task in tasks:
    if task["priority"] == 1:
        task["score"] = 0
    elif task["priority"] == 2:
        task["score"] = 1
    elif task["priority"] == 3:
        task["score"] = 2
    else:
        task["score"] = 3
# Sort the tasks by score
tasks_sorted = sorted(tasks, key=lambda x: x["score"])
# Print the sorted list of tasks
for task in tasks_sorted:
    print(task["name"])
```
This code assigns a score to each task based on its priority (with higher scores indicating higher priority), and then sorts the list of tasks by their scores. This allows you to easily prioritize your tasks for the day and focus on the most important ones first.<|end_of_text|> [end of text]

llama_print_timings:        load time =    1035.49 ms
llama_print_timings:      sample time =    1320.89 ms /   570 runs   (    2.32 ms per token,   431.53 tokens per second)
llama_print_timings: prompt eval time =    1677.72 ms /    18 tokens (   93.21 ms per token,    10.73 tokens per second)
llama_print_timings:        eval time =  195348.45 ms /   569 runs   (  343.32 ms per token,     2.91 tokens per second)
llama_print_timings:       total time =  199036.28 ms /   587 tokens
Log end
(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$


Inrease -ngl to 16:

./main -ngl 16 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"^C

(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$ ./main -ngl 16 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2905 (0264a4cc)
main: built with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1715529802
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.36 GiB (4.67 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4467.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0


<|begin_of_text|>Write 4 lines on living a purposeful life.  A purposeful life is one that has direction, meaning and personal fulfillment.
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________


One possible version:
Living a purposeful life means having a strong sense of direction in life and being fully committed to the things we do. It’s about making choices that align with our values, rather than simply following the path of least resistance. To live a purposeful life, it is essential to be clear on what truly matters most to us, so that we can make decisions based on those values instead of external pressures.
Living purposefully means being able to recognize and articulate what our core values are, as well as having the courage to say no when necessary. It also involves taking responsibility for our lives and making conscious choices to grow and develop rather than just drifting along with the crowd. Finally, living a purposeful life is about being in tune with our innermost desires and passions, so that we can fulfill our own needs and contribute to something bigger than ourselves.
Living purposefully means having a strong sense of direction in life and being fully committed to the things we do. It’s about making choices that align with our values, rather than simply following the path of least resistance. To live a purposeful life, it is essential to be clear on what truly matters most to us, so that we can make decisions based on those values instead of external pressures.
Living purposefully means being able to recognize and articulate what our core values are, as well as having the courage to say no when necessary. It also involves taking responsibility for our lives and making conscious choices to grow and develop rather than just drifting along with the crowd. Finally, living a purposeful life is about being in tune with our innermost desires and passions, so that we can fulfill our own needs and contribute to something bigger than ourselves.<|end_of_text|> [end of text]

llama_print_timings:        load time =     796.79 ms
llama_print_timings:      sample time =     881.02 ms /   366 runs   (    2.41 ms per token,   415.43 tokens per second)
llama_print_timings: prompt eval time =    1540.87 ms /    18 tokens (   85.60 ms per token,    11.68 tokens per second)
llama_print_timings:        eval time =  122569.53 ms /   365 runs   (  335.81 ms per token,     2.98 tokens per second)
llama_print_timings:       total time =  125436.35 ms /   383 tokens
Log end
(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$


Increase -ngl to 32:

./main -ngl 32 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"^C

(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$ ./main -ngl 32 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2905 (0264a4cc)
main: built with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1715529996
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.36 GiB (4.67 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4467.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

<|begin_of_text|>Write 4 lines on living a purposeful life.  A purposeful life is one in which we pursue our goals and aspirations with intention, clarity, and perseverance.

Living a purposeful life means being intentional about what you want to achieve in your life. It involves setting clear goals and priorities, identifying the steps required to reach those goals, and consistently working towards them. It also means staying focused on your values and beliefs, and using those as guidance for decision-making.
Here are some tips for living a purposeful life:
1. Set clear goals and objectives: Write down what you want to achieve in each area of your life (such as career, relationships, health, etc.) and break them down into smaller, achievable steps.
2. Create a plan of action: Once you have identified your goals, create a plan of action that outlines the specific tasks and deadlines required to reach those goals.
3. Prioritize your time and resources: Allocate your time and energy towards activities that align with your goals and values. Learn to say no to unnecessary commitments and distractions.
4. Reflect and adjust course: Regularly reflect on your progress and adjust your plan of action as needed. Don't be afraid to pivot or change direction if something isn't working for you.
Living a purposeful life can be challenging, but it is worth it! By staying focused on what truly matters to you, you can create a life that is fulfilling and rewarding.<|end_of_text|> [end of text]

llama_print_timings:        load time =     845.97 ms
llama_print_timings:      sample time =     635.65 ms /   274 runs   (    2.32 ms per token,   431.05 tokens per second)
llama_print_timings: prompt eval time =    2221.61 ms /    18 tokens (  123.42 ms per token,     8.10 tokens per second)
llama_print_timings:        eval time =   77831.11 ms /   273 runs   (  285.10 ms per token,     3.51 tokens per second)
llama_print_timings:       total time =   81032.41 ms /   291 tokens
Log end
(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$


Increase -ngl to 64:

./main -ngl 64 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"

(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$ ./main -ngl 64 -m models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life.  A purposeful life is"
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 2905 (0264a4cc)
main: built with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1715530153
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/JSL-MedLlama-3-8B-v1.0.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  217 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.36 GiB (4.67 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4467.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0


<|begin_of_text|>Write 4 lines on living a purposeful life.  A purposeful life is one that has meaning and direction, rather than just going through the motions of everyday life.

Living a purposeful life means having a clear sense of your goals and values, and then making choices and taking actions that align with those goals and values.
Here are four lines on living a purposeful life:
1. Live with intention: Be mindful of your actions and choose to act in ways that align with your highest ideals and aspirations.
2. Focus on your passions: Follow your heart and pursue the things that bring you joy, fulfillment, and a sense of purpose.
3. Be present in the moment: Embrace each moment as an opportunity to grow, learn, and deepen your connection to what truly matters.
4. Make a difference: Use your skills, talents, and resources to contribute to something greater than yourself, whether it's through volunteer work, activism, or simply being a kind and compassionate person.<|end_of_text|> [end of text]

llama_print_timings:        load time =     823.32 ms
llama_print_timings:      sample time =     416.58 ms /   183 runs   (    2.28 ms per token,   439.29 tokens per second)
llama_print_timings: prompt eval time =    1560.65 ms /    18 tokens (   86.70 ms per token,    11.53 tokens per second)
llama_print_timings:        eval time =   58714.68 ms /   182 runs   (  322.61 ms per token,     3.10 tokens per second)
llama_print_timings:       total time =   60912.48 ms /   200 tokens
Log end
(llm) ljubomir@gigul2(7468.llm:0):~/llama.cpp$


LJ Sun 12 May 17:11:35 BST 2024

+ Llama 3.1a

Llama 3.1a

repo

https://huggingface.co/reach-vb/Meta-Llama-3.1-8B-Instruct-Q6_K-GGUF/tree/main

file

https://huggingface.co/reach-vb/Meta-Llama-3.1-8B-Instruct-Q6_K-GGUF/blob/main/meta-llama-3.1-8b-instruct-q6_k.gguf

DL

huggingface-cli download reach-vb/Meta-Llama-3.1-8B-Instruct-Q6_K-GGUF meta-llama-3.1-8b-instruct-q6_k.gguf --local-dir models/ --local-dir-use-symlinks False

On the cmd line

cd models
wget https://huggingface.co/reach-vb/Meta-Llama-3.1-8B-Instruct-Q6_K-GGUF/blob/main/meta-llama-3.1-8b-instruct-q6_k.gguf

... but does not work - returns .js - DL it manually from browser, then

ljubomir@gigul2(754506.gpu.cpp:0):~/llama.cpp$ mviv ~/Downloads/meta-llama-3.1-8b-instruct-q6_k.gguf models/
renamed '/home/ljubomir/Downloads/meta-llama-3.1-8b-instruct-q6_k.gguf' -> 'models/meta-llama-3.1-8b-instruct-q6_k.gguf'

Update and rebuild

git pull origin master
make clean
GGLM_LLAMA_METAL=1 make -j

./llama-cli -ngl 64 -m models/meta-llama-3.1-8b-instruct-q6_k.gguf -p "Write 4 lines on living a purposeful life.  A purposeful life is"

LJ Tue 30 Jul 12:11:01 BST 2024

+ ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git tag -m 'LJ save before LLama 3.2 trying Llama-3.2-1B-Instruct-GGUF' tag_20241016_LJ_before_Llama_3.2_merge_rc1

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git tag -m 'LJ save before LLama 3.2 trying Llama-3.2-1B-Instruct-GGUF' tag_20241016_LJ_before_Llama_3.2_merge_rc1
ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git pull origin master

From

https://huggingface.co/docs/hub/en/ollama

Getting started is as simple as:

ollama run hf.co/{username}/{repository}

Please note that you can use both hf.co and huggingface.co as the domain name.

Here are some models you can try:

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF
ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF

Custom Quantization

By default, the Q4_K_M quantization scheme is used, when it’s present inside the model repo. If not, we default to picking one reasonable quant type present inside the repo.

To select a different scheme, simply add a tag:

ollama run hf.co/{username}/{repository}:{quantization}

Manually DL

https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/blob/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ mviv ~/Downloads/Llama-3.2-1B-Instruct-Q4_K_M.gguf models/
renamed '/home/ljubomir/Downloads/Llama-3.2-1B-Instruct-Q4_K_M.gguf' -> 'models/Llama-3.2-1B-Instruct-Q4_K_M.gguf'

Update the source

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git pull origin master
remote: Enumerating objects: 346, done.
remote: Counting objects: 100% (293/293), done.
remote: Compressing objects: 100% (231/231), done.
remote: Total 346 (delta 167), reused 85 (delta 60), pack-reused 53 (from 1)
Receiving objects: 100% (346/346), 897.29 KiB | 2.27 MiB/s, done.
Resolving deltas: 100% (172/172), completed with 12 local objects.
From https://github.com/ggerganov/llama.cpp
 * branch              master     -> FETCH_HEAD
   63747437..10433e8b  master     -> origin/master
hint: Waiting for your editor to close the file...
Merge branch 'master' of https://github.com/ggerganov/llama.cpp

Rebuild

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ make -j

Check what was built

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ls -lt |m
total 1514200
-rw-------  1 ljubomir ljubomir   125109 Oct 16 17:22 README.LJ
-rwx------  1 ljubomir ljubomir 55690968 Oct 16 17:21 llama-server
-rwx------  1 ljubomir ljubomir 46279136 Oct 16 17:21 llama-minicpmv-cli
-rwx------  1 ljubomir ljubomir 46362208 Oct 16 17:21 llama-llava-cli
-rwx------  1 ljubomir ljubomir 45799080 Oct 16 17:21 llama-bench
-rwx------  1 ljubomir ljubomir 42533184 Oct 16 17:21 llama-lookahead
-rwx------  1 ljubomir ljubomir 42185776 Oct 16 17:21 llama-baby-llama
-rwx------  1 ljubomir ljubomir 42793880 Oct 16 17:21 llama-speculative
-rwx------  1 ljubomir ljubomir 42483368 Oct 16 17:21 llama-embedding
-rwx------  1 ljubomir ljubomir 42441816 Oct 16 17:21 llama-lookup-stats
-rwx------  1 ljubomir ljubomir 42386168 Oct 16 17:21 llama-lookup-create
-rwx------  1 ljubomir ljubomir 42466480 Oct 16 17:21 llama-gen-docs
-rwx------  1 ljubomir ljubomir 42527656 Oct 16 17:21 llama-parallel
-rwx------  1 ljubomir ljubomir 42216320 Oct 16 17:21 llama-tokenize
-rwx------  1 ljubomir ljubomir 42414000 Oct 16 17:21 llama-save-load-state
-rwx------  1 ljubomir ljubomir 42383904 Oct 16 17:21 llama-batched-bench
-rwx------  1 ljubomir ljubomir 42269616 Oct 16 17:21 llama-gbnf-validator
-rwx------  1 ljubomir ljubomir 42767432 Oct 16 17:21 llama-cli
-rwx------  1 ljubomir ljubomir 42297808 Oct 16 17:21 llama-gguf-split
-rwx------  1 ljubomir ljubomir 42862984 Oct 16 17:21 llama-cvector-generator
-rwx------  1 ljubomir ljubomir 42673664 Oct 16 17:21 llama-gguf-hash
-rwx------  1 ljubomir ljubomir 43530192 Oct 16 17:21 llama-perplexity
-rwx------  1 ljubomir ljubomir 42901472 Oct 16 17:21 llama-export-lora
-rwx------  1 ljubomir ljubomir 43259464 Oct 16 17:21 llama-imatrix
-rwx------  1 ljubomir ljubomir 42162952 Oct 16 17:21 llama-lookup-merge
-rwx------  1 ljubomir ljubomir 42573232 Oct 16 17:21 llama-quantize
-rwx------  1 ljubomir ljubomir 42391576 Oct 16 17:21 llama-eval-callback
-rwx------  1 ljubomir ljubomir 42384096 Oct 16 17:21 llama-batched
-rwx------  1 ljubomir ljubomir 42511736 Oct 16 17:21 llama-gritlm
-rwx------  1 ljubomir ljubomir 42570720 Oct 16 17:21 llama-infill
-rwx------  1 ljubomir ljubomir 43807728 Oct 16 17:21 llama-quantize-stats
-rwx------  1 ljubomir ljubomir 42771952 Oct 16 17:21 llama-retrieval
-rwx------  1 ljubomir ljubomir 42546224 Oct 16 17:21 llama-convert-llama2c-to-ggml
-rwx------  1 ljubomir ljubomir 42465768 Oct 16 17:21 llama-lookup
-rwx------  1 ljubomir ljubomir 42422224 Oct 16 17:21 llama-passkey
-rwx------  1 ljubomir ljubomir 42170672 Oct 16 17:21 llama-simple
-rw-------  1 ljubomir ljubomir   300136 Oct 16 17:20 libllava.a
drwx------  2 ljubomir ljubomir     4096 Oct 16 17:20 src/

Previous last command line

./llama-cli -ngl 64 -m models/Llama-3.2-1B-Instruct-Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 --override-kv tokenizer.ggml.pre=str:llama3 -p "Write 4 lines on living a purposeful life. A purposeful life is"

Run

./llama-cli -ngl 64 -m models/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./llama-cli -ngl 64 -m models/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3980 (56a6f5ab) with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 35 key-value pairs and 147 tensors from models/Llama-3.2-1B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
...................................................................................................................................................
...................................................................................................................................................
...................................................................................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =  8464.01 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10

system_info: n_threads = 10 (n_threads_batch = 10) / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 1650544230
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> softmax -> dist
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1

Write 4 lines on living a purposeful life. A purposeful life is one that is driven by a clear goal or mission. It is a life where one's actions and decisions are guided by a sense of direction and purpose.

Here is my attempt at a short poem:

A purposeful life is one that's bright
Driven by a goal, a guiding light
It's a life of purpose, of direction true
Where actions align with a mission anew

I hope this meets your requirements. Let me know if I can make any adjustments! [end of text]


llama_perf_sampler_print:    sampling time =       9.76 ms /   113 runs   (    0.09 ms per token, 11580.24 tokens per second)
llama_perf_context_print:        load time =    6323.37 ms
llama_perf_context_print: prompt eval time =     204.47 ms /    17 tokens (   12.03 ms per token,    83.14 tokens per second)
llama_perf_context_print:        eval time =    4429.88 ms /    95 runs   (   46.63 ms per token,    21.45 tokens per second)
llama_perf_context_print:       total time =    4675.72 ms /   112 tokens

Check the 8B model for speed

https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf

Git LFS file

https://cdn-lfs-us-1.hf.co/repos/d6/e9/d6e9318f285870e2a0e3056e22f9c7ec90cd13e14cfde122129ae66af9ad788f/33981adf6bae52c503fb5c24f72539010632f7ed290a56c1315a8cd50adca587?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Meta-Llama-3.1-8B-Instruct-Q6_K.gguf%3B+filename%3D%22Meta-Llama-3.1-8B-Instruct-Q6_K.gguf%22%3B&Expires=1729358516&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyOTM1ODUxNn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2Q2L2U5L2Q2ZTkzMThmMjg1ODcwZTJhMGUzMDU2ZTIyZjljN2VjOTBjZDEzZTE0Y2ZkZTEyMjEyOWFlNjZhZjlhZDc4OGYvMzM5ODFhZGY2YmFlNTJjNTAzZmI1YzI0ZjcyNTM5MDEwNjMyZjdlZDI5MGE1NmMxMzE1YThjZDUwYWRjYTU4Nz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=PUgz9eIgDyVEuQlUfafs2MNW9v8nevU9H%7E5eG12sfbe9AlaCI3lQdwx2otxTLyhgL-tj8FFOYn%7ErMLQmnzHytgqZAHLNs-Gr%7ELCrQXd0HCwsxHnfIxVTSbEn-qyMQUt-k%7Evph05IweRb5NlCFGL8I9Kli8jUR%7EIKJEJkI6pSBMZipPCzT9Kgczw2gFUldFdnQlDrqqkpHJWdpxTXAUN9SixlpMbgIVqfGcYwdktieskuzajzteDa6kS-bzjvFgvlfDqettnKcFvcyCeTiveUfZOX-YCov6Ye2kwRNbYYT%7EKuqbSfn9VWNWJdho62oT1cjNXuqyfXOpIahNi-tok3jg__&Key-Pair-Id=K24J24Z295AEI9

Move to the right place

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ mviv ~/Downloads/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf models/
renamed '/home/ljubomir/Downloads/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf' -> 'models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf'

Test

./llama-cli -ngl 64 -m models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

Run

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./llama-cli -ngl 64 -m models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3980 (56a6f5ab) with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
...................................................................................................................................................
...................................................................................................................................................
...................................................................................................................................................
Living a purposeful life is not just about achieving success, but also about living a life of purpose and meaning. It's about using your talents, skills, and passions to make a positive impact. By doing so, you'll feel more fulfilled, motivated, and inspired. You'll also be more likely to achieve your goals and leave a lasting legacy. 45
Living
llama_perf_sampler_print:    sampling time =     344.64 ms /  3293 runs   (    0.10 ms per token,  9554.90 tokens per second)
llama_perf_context_print:        load time =   53584.57 ms
llama_perf_context_print: prompt eval time =    1217.02 ms /    17 tokens (   71.59 ms per token,    13.97 tokens per second)
llama_perf_context_print:        eval time =  895703.87 ms /  3275 runs   (  273.50 ms per token,     3.66 tokens per second)
llama_perf_context_print:       total time =  898266.46 ms /  3292 tokens
Interrupted by user

iNB rebuild with GPU support DOES NOT WORK, there is no nvcc on gigul2:

make clean
GGML_CUDA=1 make -j

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ make clean
ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ GGML_CUDA=1 make -j
I ccache not found. Consider installing it for faster compilation.
expr: syntax error: unexpected argument ‘070100’
expr: syntax error: unexpected argument ‘080100’
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -DGGML_CUDA_USE_GRAPHS -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -DGGML_CUDA_USE_GRAPHS -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include
I NVCCFLAGS: -std=c++11 -O3 -g -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/local/cuda/lib64/stubs -L/usr/lib/wsl/lib
I CC:        gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
/bin/sh: 1: nvcc: not found
I NVCC:
/bin/sh: 1: nvcc: not found
Makefile:1001: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via environment variable CUDA_DOCKER_ARCH, e.g. by running "export CUDA_DOCKER_ARCH=compute_XX" on Unix-like systems, where XX is the minimum compute capability that the code needs to run on. A list with compute capabilities can be found here: https://developer.nvidia.com/cuda-gpus . Stop.


ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ GGLM_LLAMA_METAL=1 make -j
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE
I NVCCFLAGS: -std=c++11 -O3 -g
I LDFLAGS:
I CC:        gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

g++-11 -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -c ggml/src/llamafile/sgemm.cpp -o ggml/src/llamafile/sgemm.o
gcc-11  -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion    -c ggml/src/ggml.c -o ggml/src/ggml.o
...................................................................................................................................................
...................................................................................................................................................
...................................................................................................................................................
g++-11 -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE  ggml/src/llamafile/sgemm.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o -Iexamples/server examples/server/server.o -o llama-server
ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ls -lt |m
total 1514288
-rwx------  1 ljubomir ljubomir 55690856 Oct 16 19:44 llama-server
-rwx------  1 ljubomir ljubomir 46362096 Oct 16 19:43 llama-llava-cli
-rwx------  1 ljubomir ljubomir 46279024 Oct 16 19:43 llama-minicpmv-cli
-rwx------  1 ljubomir ljubomir 45798968 Oct 16 19:43 llama-bench
-rwx------  1 ljubomir ljubomir 43807616 Oct 16 19:43 llama-quantize-stats
-rwx------  1 ljubomir ljubomir 43530080 Oct 16 19:43 llama-perplexity
-rwx------  1 ljubomir ljubomir 43259352 Oct 16 19:43 llama-imatrix
-rwx------  1 ljubomir ljubomir 42862872 Oct 16 19:43 llama-cvector-generator
-rwx------  1 ljubomir ljubomir 42767320 Oct 16 19:43 llama-cli
-rwx------  1 ljubomir ljubomir 42901360 Oct 16 19:43 llama-export-lora
-rwx------  1 ljubomir ljubomir 42573120 Oct 16 19:43 llama-quantize
-rwx------  1 ljubomir ljubomir 42570608 Oct 16 19:43 llama-infill
-rwx------  1 ljubomir ljubomir 42527544 Oct 16 19:43 llama-parallel
-rwx------  1 ljubomir ljubomir 42483256 Oct 16 19:43 llama-embedding
-rwx------  1 ljubomir ljubomir 42546112 Oct 16 19:43 llama-convert-llama2c-to-ggml
-rwx------  1 ljubomir ljubomir 42465656 Oct 16 19:43 llama-lookup
-rwx------  1 ljubomir ljubomir 42511624 Oct 16 19:43 llama-gritlm
-rwx------  1 ljubomir ljubomir 42793768 Oct 16 19:43 llama-speculative
-rwx------  1 ljubomir ljubomir 42533072 Oct 16 19:43 llama-lookahead
-rwx------  1 ljubomir ljubomir 42771840 Oct 16 19:43 llama-retrieval
-rwx------  1 ljubomir ljubomir 42297696 Oct 16 19:43 llama-gguf-split
-rwx------  1 ljubomir ljubomir 42383992 Oct 16 19:43 llama-batched
-rwx------  1 ljubomir ljubomir 42466368 Oct 16 19:43 llama-gen-docs
-rwx------  1 ljubomir ljubomir 42441704 Oct 16 19:43 llama-lookup-stats
-rwx------  1 ljubomir ljubomir 42269504 Oct 16 19:43 llama-gbnf-validator
-rwx------  1 ljubomir ljubomir 42216208 Oct 16 19:43 llama-tokenize
-rwx------  1 ljubomir ljubomir 42422112 Oct 16 19:43 llama-passkey
-rwx------  1 ljubomir ljubomir 42413896 Oct 16 19:43 llama-save-load-state
-rwx------  1 ljubomir ljubomir 42391464 Oct 16 19:43 llama-eval-callback
-rwx------  1 ljubomir ljubomir 42162840 Oct 16 19:43 llama-lookup-merge
-rwx------  1 ljubomir ljubomir 42673552 Oct 16 19:43 llama-gguf-hash
-rwx------  1 ljubomir ljubomir 42386056 Oct 16 19:43 llama-lookup-create
-rwx------  1 ljubomir ljubomir 42383792 Oct 16 19:43 llama-batched-bench
-rwx------  1 ljubomir ljubomir 42185664 Oct 16 19:43 llama-baby-llama
-rwx------  1 ljubomir ljubomir 42170560 Oct 16 19:43 llama-simple
-rw-------  1 ljubomir ljubomir   300136 Oct 16 19:43 libllava.a
drwx------  2 ljubomir ljubomir     4096 Oct 16 19:43 src/
drwx------  3 ljubomir ljubomir     4096 Oct 16 19:42 common/
-rwx------  1 ljubomir ljubomir  4351600 Oct 16 19:42 llama-vdot
-rwx------  1 ljubomir ljubomir  4338016 Oct 16 19:42 llama-q8dot
-rwx------  1 ljubomir ljubomir  4358888 Oct 16 19:42 llama-gguf
-rwx------  1 ljubomir ljubomir    85512 Oct 16 19:42 main
-rwx------  1 ljubomir ljubomir    85512 Oct 16 19:42 server
drwx------  2 ljubomir ljubomir     4096 Oct 16 19:42 tests/
-rw-------  1 ljubomir ljubomir   137121 Oct 16 19:42 README.LJ
-rw-------  1 ljubomir ljubomir   137120 Oct 16 19:42 README.LJ~
drwx------  2 ljubomir ljubomir    12288 Oct 16 19:20 models/
drwx------  2 ljubomir ljubomir     4096 Oct 16 16:48 scripts/
drwx------  2 ljubomir ljubomir     4096 Oct 16 16:48 include/
-rw-------  1 ljubomir ljubomir     1556 Oct 16 16:48 flake.lock
drwx------  4 ljubomir ljubomir     4096 Oct 16 16:48 docs/
-rw-------  1 ljubomir ljubomir     6771 Oct 16 16:48 CMakeLists.txt
-rw-------  1 ljubomir ljubomir    29055 Oct 16 16:48 README.md
drwx------ 46 ljubomir ljubomir    12288 Oct 16 16:48 examples/
-rw-------  1 ljubomir ljubomir     1280 Oct  7 22:06 pyproject.toml
-rw-------  1 ljubomir ljubomir      619 Oct  7 22:06 pyrightconfig.json
drwx------  2 ljubomir ljubomir     4096 Oct  7 22:06 requirements/
drwx------  2 ljubomir ljubomir     4096 Oct  7 22:06 grammars/
drwx------  6 ljubomir ljubomir     4096 Oct  7 22:06 gguf-py/
-rw-------  1 ljubomir ljubomir     7469 Oct  7 22:06 flake.nix
drwx------  5 ljubomir ljubomir     4096 Oct  7 22:06 ggml/

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ rmv main server
removed 'main'
removed 'server'

Run again without nvidia support

./llama-cli -m models/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"


LJ Wed 16 Oct 16:49:35 BST 2024

+ (torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$ git tag -m 'After updating to run QwQ-32B-Preview-GGUF' tag_20241129_LJ_after_QwQ_32B-preview_merge_rc1

(torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$ git tag -m 'After updating to run QwQ-32B-Preview-GGUF' tag_20241129_LJ_after_QwQ_32B-preview_merge_rc1

(torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$ gilgt
+ git log -C --oneline --stat --decorate
d2ce173c (HEAD -> master, tag: tag_20241129_LJ_after_QwQ_32B-preview_merge_rc1) updating to run QwQ-32B-Preview-GGUF
a3a3048e (origin/master, origin/HEAD) cleanup UI link list (#10577)
 README.md | 56 ++++++++++++++++++++++++--------------------------------
 1 file changed, 24 insertions(+), 32 deletions(-)
38b6d7c7 (tag: tag_20241129_LJ_before_QwQ_32B-preview_merge_rc1) save
 README.LJ | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 162 insertions(+)

(torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$ mviv ~/Downloads/QwQ-32B-Preview-Q6_K.gguf models/

./llama-cli -m models/QwQ-32B-Preview-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

(torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$ ./llama-cli -m models/QwQ-32B-Preview-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
build: 4279 (d2ce173c) with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 38 key-value pairs and 771 tensors from models/QwQ-32B-Preview-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ 32B Preview
llama_model_loader: - kv   3:                           general.finetune str              = Preview
llama_model_loader: - kv   4:                           general.basename str              = QwQ
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/QwQ-32B-P...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 32B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                          general.file_type u32              = 18
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/QwQ-32B-Preview-GGUF/QwQ-...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 448
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q6_K:  450 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 25.03 GiB (6.56 BPW)
llm_load_print_meta: general.name     = QwQ 32B Preview
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size = 25634.93 MiB
..................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   368.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10

system_info: n_threads = 10 (n_threads_batch = 10) / 10 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

sampler seed: 1953590522
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

Write 4 lines on living a purposeful life. A purposeful life is one where we live with intention, direction, and passion. It's about aligning our actions with our values, goals, and aspirations. Living with purpose gives us a sense of direction, a reason to wake up every day, and the motivation to achieve our dreams. It's about making a difference in the world, whether big or small, and leaving a legacy that inspires others to live their best lives.

Living a purposeful life is not about achieving success or fame, but about finding meaning and fulfillment in what we do. It's about discovering our unique talents and using them to make a positive impact on the world. It's about being true to ourselves, pursuing our passions, and living authentically. It's about taking risks, learning from failures, and growing as individuals.

Moreover, living with purpose helps us navigate through challenges and setbacks with resilience and determination. It gives us a sense of clarity and focus, enabling us to prioritize what truly matters in life. It also fosters a sense of gratitude and appreciation for the present moment, allowing us to savor the joys and blessings along the way.

In conclusion, living a purposeful life is a journey that requires self-reflection, self-discovery, and continuous growth. It's about embracing our uniqueness, pursuing our passions, and making a positive difference in the world. By living with purpose, we can unlock our full potential, find fulfillment, and create a legacy that inspires others to live their best lives. [end of text]


llama_perf_sampler_print:    sampling time =      32.55 ms /   313 runs   (    0.10 ms per token,  9616.86 tokens per second)
llama_perf_context_print:        load time =  198549.36 ms
llama_perf_context_print: prompt eval time =    5782.40 ms /    16 tokens (  361.40 ms per token,     2.77 tokens per second)
llama_perf_context_print:        eval time =  281369.51 ms /   296 runs   (  950.57 ms per token,     1.05 tokens per second)
llama_perf_context_print:       total time =  287297.73 ms /   312 tokens
(torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$


LJ Fri 29 Nov 18:49:59 GMT 2024

+ ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ make clean

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ make clean
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md. Stop.
ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ rm -rf build
ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git pull origin master
From https://github.com/ggerganov/llama.cpp
 * branch              master     -> FETCH_HEAD
Already up-to-date.

The default build does not use the CC and CXX env but falls back to default gcc and g++ that is very old

cmake -B build
cmake --build build --config Release

The latest-greatest are 11

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ env |egrep 'CC|CXX'
QT_ACCESSIBILITY=1
CXX=g++-11
CC=gcc-11

So

cmake -B build -DCMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11
cmake --build build --config Release -j

The new directory for binaries is ./build/ it seems

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ find . -name llama-cli
./llama-cli
./build/bin/llama-cli

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ls -la $(find . -name llama-cli)
-rwx------ 1 ljubomir ljubomir   995120 Dec  9 18:58 ./build/bin/llama-cli
-rwx------ 1 ljubomir ljubomir 39579184 Nov 29 16:54 ./llama-cli

Try QwQ with the new binary

./build/bin/llama-cli -m models/QwQ-32B-Preview-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./build/bin/llama-cli -m models/QwQ-32B-Preview-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
build: 4352 (19731660) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 38 key-value pairs and 771 tensors from models/QwQ-32B-Preview-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ 32B Preview
llama_model_loader: - kv   3:                           general.finetune str              = Preview
llama_model_loader: - kv   4:                           general.basename str              = QwQ
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/QwQ-32B-P...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 32B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                          general.file_type u32              = 18
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/QwQ-32B-Preview-GGUF/QwQ-...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 448
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q6_K:  450 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 25.03 GiB (6.56 BPW)
llm_load_print_meta: general.name     = QwQ 32B Preview
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size = 25634.93 MiB
..................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   368.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10

system_info: n_threads = 10 (n_threads_batch = 10) / 10 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 2051864547
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

Write 4 lines on living a purposeful life. A purposeful life is one that is driven by a sense of meaning and direction, where you are actively engaged in activities that align with your values and goals. Living a purposeful life allows you to find fulfillment and satisfaction in your day-to-day activities and helps you to make a positive impact on the world around you. It involves setting clear goals, taking action towards achieving them, and continuously learning and growing as a person. Ultimately, living a purposeful life is about finding your passion and using it to create a meaningful and impactful life. [end of text]

llama_perf_sampler_print:    sampling time =      14.79 ms /   120 runs   (    0.12 ms per token,  8112.49 tokens per second)
llama_perf_context_print:        load time =  295851.86 ms
llama_perf_context_print: prompt eval time =    7214.63 ms /    16 tokens (  450.91 ms per token,     2.22 tokens per second)
llama_perf_context_print:        eval time =  260424.31 ms /   103 runs   ( 2528.39 ms per token,     0.40 tokens per second)
llama_perf_context_print:       total time =  267748.97 ms /   119 tokens

Test Llama3.3 70B

./build/bin/llama-cli -m models/Llama-3.3-70B-Instruct-Q5_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./build/bin/llama-cli -m models/Llama-3.3-70B-Instruct-Q5_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
build: 4352 (19731660) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 724 tensors from models/Llama-3.3-70B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.3 70B Instruct
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.3
llama_model_loader: - kv   6:                         general.size_label str              = 70B
llama_model_loader: - kv   7:                          llama.block_count u32              = 80
llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  16:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  17:                          general.file_type u32              = 17
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name     = Llama 3.3 70B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token        = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size = 47628.36 MiB
...................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   584.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10

system_info: n_threads = 10 (n_threads_batch = 10) / 10 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 651543096
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

Write 4 lines on living a purposeful life. A purposeful life is one that is filled with meaning and direction. When we live a purposeful life, we feel fulfilled and motivated to make a positive impact. Living a purposeful life involve
r goals and working towards them with passion and dedication. By doing so, we can create a sense of accomplishment and leave a lasting legacy.
The best answer is A purposeful life is one that is filled with meaning and direction. When we live a purposeful life, we feel fulfilled and motivated to make a positive impact. Living a purposeful life involves setting clear goals an
rds them with passion and dedication. By doing so, we can create a sense of accomplishment and leave a lasting legacy. [end of text]

llama_perf_sampler_print:    sampling time =      13.26 ms /   153 runs   (    0.09 ms per token, 11535.85 tokens per second)
llama_perf_context_print:        load time =  382572.16 ms
llama_perf_context_print: prompt eval time =   12463.72 ms /    17 tokens (  733.16 ms per token,     1.36 tokens per second)
llama_perf_context_print:        eval time =  262991.35 ms /   135 runs   ( 1948.08 ms per token,     0.51 tokens per second)
llama_perf_context_print:       total time =  275554.51 ms /   152 tokens

Increase context to 128K, address
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

./build/bin/llama-cli --ctx-size 131072 -m models/Llama-3.3-70B-Instruct-Q5_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

Actually better "load from the model" 0 so it's max - but run terminated did not finish after 1 hour (!!)

./build/bin/llama-cli --ctx-size 0 -m models/Llama-3.3-70B-Instruct-Q5_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./build/bin/llama-cli --ctx-size 0 -m models/Llama-3.3-70B-Instruct-Q5_K_M.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
build: 4352 (19731660) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 724 tensors from models/Llama-3.3-70B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.3 70B Instruct
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.3
llama_model_loader: - kv   6:                         general.size_label str              = 70B
llama_model_loader: - kv   7:                          llama.block_count u32              = 80
llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  16:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  17:                          general.file_type u32              = 17
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 46.51 GiB (5.66 BPW)
llm_load_print_meta: general.name     = Llama 3.3 70B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token        = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size = 47628.36 MiB
...................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 131072
llama_new_context_with_model: n_ctx_per_seq = 131072
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size = 40960.00 MiB
llama_new_context_with_model: KV self size  = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size = 16704.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10

system_info: n_threads = 10 (n_threads_batch = 10) / 10 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 3177813908
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1

Write 4 lines on living a purposeful life. A purposeful life is one where you know yourTerminated
(run terminated after waiting for 1h)

LJ Mon  9 Dec 19:02:44 GMT 2024

+ (torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git pull origin master

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git pull origin master

remote: Enumerating objects: 2197, done.
remote: Counting objects: 100% (1508/1508), done.
remote: Compressing objects: 100% (238/238), done.
remote: Total 2197 (delta 1380), reused 1270 (delta 1270), pack-reused 689 (from 3)
Receiving objects: 100% (2197/2197), 10.69 MiB | 338.00 KiB/s, done.
Resolving deltas: 100% (1596/1596), completed with 294 local objects.
From https://github.com/ggerganov/llama.cpp
 * branch              master     -> FETCH_HEAD
   c37fb4cf..6171c9d2  master     -> origin/master
Auto-merging .gitignore
hint: Waiting for your editor to close the file...
Merge branch 'master' of https://github.com/ggerganov/llama.cpp
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
.git/MERGE_MSG                                                                                                                                                                                                                    {unix|utf-8|GITCOMMIT} [6,13][100%]
Merge made by the 'ort' strategy.
 .devops/cpu.Dockerfile                                                       |    81 +
 .devops/cuda.Dockerfile                                                      |    94 +
 .devops/full-cuda.Dockerfile                                                 |    33 -
 .devops/full-musa.Dockerfile                                                 |    33 -
 .devops/full-rocm.Dockerfile                                                 |    50 -
 .devops/full.Dockerfile                                                      |    38 -
 .devops/intel.Dockerfile                                                     |    91 +
 .devops/llama-cli-cuda.Dockerfile                                            |    38 -
 .devops/llama-cli-intel.Dockerfile                                           |    28 -
 .devops/llama-cli-musa.Dockerfile                                            |    38 -
 .devops/llama-cli-rocm.Dockerfile                                            |    45 -
 .devops/llama-cli-vulkan.Dockerfile                                          |    27 -
 .devops/llama-cli.Dockerfile                                                 |    29 -
 .devops/llama-server-cuda.Dockerfile                                         |    43 -
 .devops/llama-server-intel.Dockerfile                                        |    34 -
 .devops/llama-server-musa.Dockerfile                                         |    43 -
 .devops/llama-server-rocm.Dockerfile                                         |    54 -
 .devops/llama-server-vulkan.Dockerfile                                       |    31 -
 .devops/llama-server.Dockerfile                                              |    33 -
 .devops/musa.Dockerfile                                                      |   108 +
 .devops/nix/package.nix                                                      |     3 +-
 .devops/rocm.Dockerfile                                                      |   113 +
 .devops/tools.sh                                                             |    10 +-
 .devops/vulkan.Dockerfile                                                    |    88 +
 .github/ISSUE_TEMPLATE/010-bug-compilation.yml                               |    12 +-
 .github/ISSUE_TEMPLATE/019-bug-misc.yml                                      |    12 +-
 .github/workflows/build.yml                                                  |    70 +-
 .github/workflows/docker.yml                                                 |   107 +-
 .github/workflows/editorconfig.yml                                           |     4 +-
 .github/workflows/server.yml                                                 |    27 +-
 .gitignore                                                                   |     1 +
 CMakeLists.txt                                                               |    73 +-
 CODEOWNERS                                                                   |    10 +-
 CONTRIBUTING.md                                                              |   102 +-
 Makefile                                                                     |    11 +
 README.md                                                                    |    62 +-
 ci/run.sh                                                                    |    66 +-
 cmake/build-info.cmake                                                       |     2 +-
 common/CMakeLists.txt                                                        |     4 +-
 common/arg.cpp                                                               |   304 ++-
 common/arg.h                                                                 |     3 +
 common/chat-template.hpp                                                     |   249 ++
 common/common.cpp                                                            |   436 ++-
 common/common.h                                                              |   155 +-
 common/minja.hpp                                                             |  2788 +++++++++++++++++++
 common/ngram-cache.cpp                                                       |    24 +-
 common/ngram-cache.h                                                         |     4 +-
 common/sampling.cpp                                                          |    44 +-
 common/speculative.cpp                                                       |    33 +-
 convert_hf_to_gguf.py                                                        |   706 ++++-
 convert_hf_to_gguf_update.py                                                 |    83 +-
 convert_lora_to_gguf.py                                                      |    34 +-
 docs/build.md                                                                |     2 +
 docs/cuda-fedora.md                                                          |   317 +++
 docs/development/HOWTO-add-model.md                                          |    10 +-
 examples/CMakeLists.txt                                                      |    14 +-
 examples/batched-bench/batched-bench.cpp                                     |     6 +-
 examples/batched.swift/Sources/main.swift                                    |     6 +-
 examples/batched/batched.cpp                                                 |    17 +-
 examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp                 |    18 +-
 examples/cvector-generator/cvector-generator.cpp                             |    19 +-
 examples/cvector-generator/mean.hpp                                          |     2 +-
 examples/cvector-generator/pca.hpp                                           |     2 +-
 examples/embedding/embedding.cpp                                             |    15 +-
 examples/eval-callback/eval-callback.cpp                                     |    13 +-
 examples/export-lora/export-lora.cpp                                         |    29 +-
 examples/gbnf-validator/gbnf-validator.cpp                                   |    11 +-
 examples/gguf-hash/gguf-hash.cpp                                             |     1 +
 examples/gguf-split/gguf-split.cpp                                           |    17 +-
 examples/gguf-split/tests.sh                                                 |    10 +-
 examples/gguf/gguf.cpp                                                       |    16 +-
 examples/gritlm/gritlm.cpp                                                   |    25 +-
 examples/imatrix/imatrix.cpp                                                 |    25 +-
 examples/infill/infill.cpp                                                   |    47 +-
 examples/llama-bench/llama-bench.cpp                                         |    67 +-
 examples/llama.android/llama/build.gradle.kts                                |     1 +
 examples/llama.android/llama/src/main/cpp/llama-android.cpp                  |    17 +-
 examples/llama.android/llama/src/main/java/android/llama/cpp/LLamaAndroid.kt |     5 +-
 examples/llama.swiftui/llama.cpp.swift/LibLlama.swift                        |     6 +-
 examples/llava/CMakeLists.txt                                                |     7 +
 examples/llava/clip.cpp                                                      |   328 ++-
 examples/llava/clip.h                                                        |    10 +-
 examples/llava/llava-cli.cpp                                                 |    15 +-
 examples/llava/llava.cpp                                                     |    41 +-
 examples/llava/minicpmv-cli.cpp                                              |    12 +-
 examples/llava/qwen2_vl_surgery.py                                           |   165 ++
 examples/llava/qwen2vl-cli.cpp                                               |   584 ++++
 examples/lookahead/lookahead.cpp                                             |    13 +-
 examples/lookup/lookup-create.cpp                                            |    13 +-
 examples/lookup/lookup-stats.cpp                                             |    10 +-
 examples/lookup/lookup.cpp                                                   |    11 +-
 examples/main/README.md                                                      |     5 -
 examples/main/main.cpp                                                       |   104 +-
 examples/parallel/parallel.cpp                                        mples/passkey/passkey.cpp                                                 |    12 +-
 examples/perplexity/perplexity.cpp                                           |    60 +-
 examples/quantize-stats/quantize-stats.cpp                                   |    26 +-
 examples/quantize/README.md                                                  |     2 +-
 examples/quantize/tests.sh                                                   |     4 +-
 examples/retrieval/retrieval.cpp                                             |    20 +-
 examples/rpc/rpc-server.cpp                                                  |    12 +
 examples/run/CMakeLists.txt                                                  |     4 +-
 examples/run/README.md                                                       |    48 +-
 examples/run/linenoise.cpp/LICENSE                                           |    26 +
 examples/run/linenoise.cpp/linenoise.cpp                                     |  1350 +++++++++
 examples/run/linenoise.cpp/linenoise.h                                       |   128 +
 examples/run/run.cpp                                                         |   996 +++++--
 examples/save-load-state/save-load-state.cpp                                 |    29 +-
 examples/server/CMakeLists.txt                                               |     3 +-
 examples/server/README.md                                                    |   401 ++-
 examples/server/bench/README.md                                              |     6 +-
 examples/server/bench/bench.py                                               |    30 +-
 examples/server/bench/script.js                                              |    18 +-
 examples/server/httplib.h                                                    |  1704 +++++++++---
 examples/server/public/index.html                                            |   351 ---
 examples/server/public/index.html.gz                                         |   Bin 0 -> 1206492 bytes
 examples/server/public_legacy/index-new.html                                 |     1 -
 examples/server/public_legacy/index.html                                     |     2 -
 examples/server/server.cpp                                                   |  1012 +++++--
 examples/server/tests/README.md                                              |     6 +
 examples/server/tests/requirements.txt                                       |     1 +
 examples/server/tests/unit/test_basic.py                                     |    18 +
 examples/server/tests/unit/test_chat_completion.py                           |   100 +-
 examples/server/tests/unit/test_completion.py                                |   186 +-
 examples/server/tests/unit/test_embedding.py                                 |   150 +-
 examples/server/tests/unit/test_infill.py                                    |     2 +-
 examples/server/tests/unit/test_lora.py                                      |    93 +-
 examples/server/tests/unit/test_rerank.py                                    |    23 +
 examples/server/tests/unit/test_speculative.py                               |    10 +-
 examples/server/tests/utils.py                                               |    49 +-
 examples/server/themes/buttons-top/index.html                                |     2 -
 examples/server/themes/wild/index.html                                       |     2 -
 examples/server/utils.hpp                                                    |   331 ++-
 examples/server/webui/index.html                                             |   i/package-lock.json                                      |   526 ++++
 examples/server/webui/package.json                                           |     9 +-
 examples/server/webui/public/demo-conversation.json                          |    33 +
 examples/server/webui/src/completion.js                                      |   225 --
 examples/server/webui/src/highlight-config.js                                |    60 +
 examples/server/webui/src/katex-gpt.js                                       |    66 +
 examples/server/webui/src/main.js                                            |   216 +-
 examples/server/webui/src/styles.css                                         |    26 -
 examples/server/webui/src/styles.scss                                        |    48 +
 examples/server/webui/vite.config.js                                         |    61 +-
 examples/simple-chat/simple-chat.cpp                                         |    26 +-
 examples/simple/simple.cpp                                                   |    17 +-
 examples/speculative-simple/speculative-simple.cpp                           |    20 +-
 examples/speculative/speculative.cpp                                         |    45 +-
 examples/tokenize/tokenize.cpp                                               |    24 +-
 examples/tts/CMakeLists.txt                                                  |     5 +
 examples/tts/README.md                                                       |   117 +
 examples/tts/convert_pt_to_hf.py                                             |   180 ++
 examples/tts/tts-outetts.py                                                  |   299 ++
 examples/tts/tts.cpp                                                         |   973 +++++++
 ggml/CMakeLists.txt                                                          |    49 +-
 ggml/include/ggml-backend.h                                                  |     3 +
 ggml/include/ggml-cpp.h                                                      |     1 +
 ggml/include/ggml-opencl.h                                                   |    26 +
 ggml/include/ggml.h                                                          |   242 +-
 ggml/include/gguf.h                                                          |   202 ++
 ggml/src/CMakeLists.txt                                                      |    15 +-
 ggml/src/ggml-alloc.c                                                        |     6 +-
 ggml/src/ggml-backend-impl.h                                                 |     1 -
 ggml/src/ggml-backend-reg.cpp                                                |   165 +-
 ggml/src/ggml-backend.cpp                                                    |     7 +-
 ggml/src/ggml-cann/ggml-cann.cpp                                             |     9 +
 ggml/src/ggml-common.h                                                       |     2 +-
 ggml/src/ggml-cpu/CMakeLists.txt                                             |   168 +-
 ggml/src/ggml-cpu/amx/amx.cpp                                                |     2 +-
 ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp                                       |   129 +-
 ggml/src/ggml-cpu/ggml-cpu-quants.c                                          |    89 +-
 ggml/src/ggml-cpu/ggml-cpu.c                                                 |   682 ++++-
 ggml/src/ggml-cpu/ggml-cpu.cpp                                               |    21 +-
 ggml/src/ggml-cpu/llamafile/sgemm.cpp                                        |  1469 +++++++---
 ggml/src/ggml-cpu/llamafile/sgemm.h         |     4 +-
 ggml/src/ggml-cuda/common.cuh                                                |    70 +-
 ggml/src/ggml-cuda/concat.cu                                                 |    53 +-
 ggml/src/ggml-cuda/convert.cu                                                |     8 +-
 ggml/src/ggml-cuda/cross-entropy-loss.cu                                     |   181 +-
 ggml/src/ggml-cuda/fattn.cu                                                  |     2 +-
 ggml/src/ggml-cuda/getrows.cu                                                |   159 +-
 ggml/src/ggml-cuda/getrows.cuh                                               |     3 +
 ggml/src/ggml-cuda/ggml-cuda.cu                                              |   459 ++--
 ggml/src/ggml-cuda/gla.cu                                                    |    93 +
 ggml/src/ggml-cuda/gla.cuh                                                   |     3 +
 ggml/src/ggml-cuda/mma.cuh                                                   |     8 +-
 ggml/src/ggml-cuda/mmq.cu                                                    |    10 +-
 ggml/src/ggml-cuda/mmq.cuh                                                   |    26 +-
 ggml/src/ggml-cuda/mmv.cu                                                    |   116 +-
 ggml/src/ggml-cuda/mmvq.cu                                                   |     2 +-
 ggml/src/ggml-cuda/norm.cu                                                   |   155 +-
 ggml/src/ggml-cuda/norm.cuh                                                  |     2 +
 ggml/src/ggml-cuda/out-prod.cu                                               |    38 +-
 ggml/src/ggml-cuda/rope.cu                                                   |   371 ++-
 ggml/src/ggml-cuda/rope.cuh                                                  |     2 +
 ggml/src/ggml-cuda/softmax.cu                                                |   134 +-
 ggml/src/ggml-cuda/softmax.cuh                                               |     2 +
 ggml/src/ggml-cuda/sum.cu                                                    |     2 -
 ggml/src/ggml-cuda/unary.cu                                                  |    36 +
 ggml/src/ggml-cuda/unary.cuh                                                 |     3 +
 ggml/src/ggml-cuda/vendors/cuda.h                                            |     1 +
 ggml/src/ggml-cuda/vendors/hip.h                                             |     3 +
 ggml/src/ggml-cuda/vendors/musa.h                                            |     3 +
 ggml/src/ggml-cuda/wkv6.cu                                                   |     4 +-
 ggml/src/ggml-hip/CMakeLists.txt                                             |     4 +-
 ggml/src/ggml-impl.h                                                         |    19 +-
 ggml/src/ggml-kompute/ggml-kompute.cpp                                       |    12 +-
 ggml/src/ggml-metal/CMakeLists.txt                                           |    16 +
 ggml/src/ggml-metal/ggml-metal.m                                             |    20 +-
 ggml/src/ggml-metal/ggml-metal.metal                                         |    31 +-
 ggml/src/ggml-opencl/CMakeLists.txt                                          |   147 +
 ggml/src/ggml-opencl/ggml-opencl.cpp                                         |  4004 +++++++++++++++++++++++++++
 ggml/src/ggml-opencl/kernels/embed_kernel.py                                 |    26 +
 ggml/src/ggml-opencl/kernels/ggml-opencl.cl                                  |  2683 ++++++++++++++++++
 ggml/src/ggml-opencl/kernels/ggml-opencl_cvt.cl                   c/ggml-opencl/kernels/ggml-opencl_gemv_noshuffle.cl                   |   265 ++
 ggml/src/ggml-opencl/kernels/ggml-opencl_gemv_noshuffle_general.cl           |   271 ++
 ggml/src/ggml-opencl/kernels/ggml-opencl_mm.cl                               |  1225 +++++++++
 ggml/src/ggml-opencl/kernels/ggml-opencl_mul_mat_Ab_Bi_8x4.cl                |   130 +
 ggml/src/ggml-opencl/kernels/ggml-opencl_transpose_16.cl                     |    32 +
 ggml/src/ggml-opencl/kernels/ggml-opencl_transpose_32.cl                     |    25 +
 ggml/src/ggml-opencl/kernels/ggml-opencl_transpose_32_16.cl                  |    35 +
 ggml/src/ggml-rpc/ggml-rpc.cpp                                               |   202 +-
 ggml/src/ggml-sycl/backend.hpp                                               |     1 +
 ggml/src/ggml-sycl/common.cpp                                                |    25 +-
 ggml/src/ggml-sycl/common.hpp                                                |    19 +
 ggml/src/ggml-sycl/concat.cpp                                                |     9 +-
 ggml/src/ggml-sycl/concat.hpp                                                |     3 +-
 ggml/src/ggml-sycl/conv.cpp                                                  |     5 +-
 ggml/src/ggml-sycl/conv.hpp                                                  |     3 +-
 ggml/src/ggml-sycl/convert.cpp                                               |     2 +-
 ggml/src/ggml-sycl/dmmv.cpp                                                  |    10 +-
 ggml/src/ggml-sycl/dpct/helper.hpp                                           |   137 +-
 ggml/src/ggml-sycl/element_wise.cpp                                          |   237 +-
 ggml/src/ggml-sycl/element_wise.hpp                                          |    48 +-
 ggml/src/ggml-sycl/gemm.hpp                                                  |     8 +-
 ggml/src/ggml-sycl/ggml-sycl.cpp                                             |   535 ++--
 ggml/src/ggml-sycl/gla.cpp                                                   |   105 +
 ggml/src/ggml-sycl/gla.hpp                                                   |     8 +
 ggml/src/ggml-sycl/im2col.cpp                                                |     5 +-
 ggml/src/ggml-sycl/mmq.cpp                                                   |    12 +-
 ggml/src/ggml-sycl/mmvq.cpp                                                  |    26 +-
 ggml/src/ggml-sycl/norm.cpp                                                  |     7 +-
 ggml/src/ggml-sycl/outprod.cpp                                               |     6 +-
 ggml/src/ggml-sycl/outprod.hpp                                               |     3 +-
 ggml/src/ggml-sycl/rope.cpp                                                  |     7 +-
 ggml/src/ggml-sycl/softmax.cpp                                               |    14 +-
 ggml/src/ggml-sycl/tsembd.cpp                                                |     6 +-
 ggml/src/ggml-sycl/tsembd.hpp                                                |     3 +-
 ggml/src/ggml-sycl/wkv6.cpp                                                  |    17 +-
 ggml/src/ggml-sycl/wkv6.hpp                                                  |     3 +-
 ggml/src/ggml-threading.h                                                    |     6 +-
 ggml/src/ggml-vulkan/CMakeLists.txt                                          |    76 +-
 ggml/src/ggml-vulkan/cmake/host-toolchain.cmake.in                           |    15 +
 ggml/src/ggml-vulkan/ggml-vulkan.cpp                                         |   866 ++++--
 ggml/src/ggml-vulkan/vulkan            |     6 +-
 ggml/src/ggml-vulkan/vulkan-shaders/acc.comp                                 |     4 +-
 ggml/src/ggml-vulkan/vulkan-shaders/add.comp                                 |     2 +-
 ggml/src/ggml-vulkan/vulkan-shaders/clamp.comp                               |     4 +-
 ggml/src/ggml-vulkan/vulkan-shaders/concat.comp                              |     6 +-
 ggml/src/ggml-vulkan/vulkan-shaders/contig_copy.comp                         |     8 +-
 ggml/src/ggml-vulkan/vulkan-shaders/copy.comp                                |     4 +-
 ggml/src/ggml-vulkan/vulkan-shaders/copy_from_quant.comp                     |    51 +
 ggml/src/ggml-vulkan/vulkan-shaders/copy_to_quant.comp                       |   237 ++
 ggml/src/ggml-vulkan/vulkan-shaders/cos.comp                                 |     4 +-
 ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.comp                       |    58 +-
 ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs_cm2.comp                   |   154 +-
 ggml/src/ggml-vulkan/vulkan-shaders/dequant_q4_k.comp                        |    64 +-
 ggml/src/ggml-vulkan/vulkan-shaders/dequant_q5_k.comp                        |    68 +-
 ggml/src/ggml-vulkan/vulkan-shaders/div.comp                                 |     2 +-
 ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp                      |    22 +-
 ggml/src/ggml-vulkan/vulkan-shaders/generic_binary_head.comp                 |     6 +-
 ggml/src/ggml-vulkan/vulkan-shaders/generic_unary_head.comp                  |    25 +-
 ggml/src/ggml-vulkan/vulkan-shaders/get_rows.comp                            |     6 +-
 ggml/src/ggml-vulkan/vulkan-shaders/get_rows_quant.comp                      |     2 +
 ggml/src/ggml-vulkan/vulkan-shaders/im2col.comp                              |    74 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul.comp                                 |     2 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec.comp                         |   144 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_base.comp                    |    35 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q2_k.comp                    |   185 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_k.comp                    |   173 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q4_k.comp                    |   203 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q5_k.comp                    |   258 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp                    |   166 +-
 ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_cm2.comp                          |    35 +-
 ggml/src/ggml-vulkan/vulkan-shaders/pad.comp                                 |     2 +-
 ggml/src/ggml-vulkan/vulkan-shaders/repeat.comp                              |     2 +-
 ggml/src/ggml-vulkan/vulkan-shaders/rope_head.comp                           |     5 +
 ggml/src/ggml-vulkan/vulkan-shaders/scale.comp                               |     2 +-
 ggml/src/ggml-vulkan/vulkan-shaders/sin.comp                                 |     4 +-
 ggml/src/ggml-vulkan/vulkan-shaders/soft_max.comp                            |     3 +-
 ggml/src/ggml-vulkan/vulkan-shaders/square.comp                              |     4 +-
 ggml/src/ggml-vulkan/vulkan-shaders/test_coopmat_support.comp                |     7 +
 ggml/src/ggml-vulkan/vulkan-shaders/types.comp                               |    15 +-
 ggml/src/ggml-vulkan/vulkan-shaders/upscale.comp                             |     4 +-
 ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp                   |    31 +-
 ggml                         |    87 +
 ggml/src/ggml.c                                                              |  1861 +++----------
 ggml/src/gguf.cpp                                                            |  1329 +++++++++
 gguf-py/README.md                                                            |    10 +-
 gguf-py/gguf/constants.py                                                    |   420 ++-
 gguf-py/gguf/gguf_reader.py                                                  |     9 +-
 gguf-py/gguf/gguf_writer.py                                                  |    37 +-
 gguf-py/{ => gguf}/scripts/__init__.py                                       |     0
 gguf-py/{ => gguf}/scripts/gguf_convert_endian.py                            |     4 +-
 gguf-py/{ => gguf}/scripts/gguf_dump.py                                      |     4 +-
 gguf-py/{ => gguf}/scripts/gguf_hash.py                                      |     4 +-
 gguf-py/{ => gguf}/scripts/gguf_new_metadata.py                              |     4 +-
 gguf-py/{ => gguf}/scripts/gguf_set_metadata.py                              |     4 +-
 gguf-py/gguf/tensor_mapping.py                                               |   144 +-
 gguf-py/pyproject.toml                                                       |    11 +-
 gguf-py/tests/test_quants.py                                                 |     2 +-
 include/llama-cpp.h                                                          |     7 +-
 include/llama.h                                                              |   240 +-
 media/llama-leader.jpeg                                                      |   Bin 199945 -> 0 bytes
 models/ggml-vocab-deepseek-r1-qwen.gguf.inp                                  |   112 +
 models/ggml-vocab-deepseek-r1-qwen.gguf.out                                  |    46 +
 scripts/compare-commits.sh                                                   |    10 +-
 scripts/compare-llama-bench.py                                               |    22 +-
 scripts/get_hf_chat_template.py                                              |    77 +
 scripts/hf.sh                                                                |     2 +-
 scripts/sync-ggml-am.sh                                                      |    10 +-
 scripts/sync-ggml.last                                                       |     2 +-
 scripts/sync-ggml.sh                                                         |     3 +
 src/CMakeLists.txt                                                           |    23 +-
 src/llama-adapter.cpp                                                        |   347 +++
 src/llama-adapter.h                                                          |    74 +
 src/llama-arch.cpp                                                           |  1489 ++++++++++
 src/llama-arch.h                                                             |   402 +++
 src/llama-batch.cpp                                                          |   368 +++
 src/llama-batch.h                                                            |    88 +
 src/llama-chat.cpp                                                           |   578 ++++
 src/llama-chat.h                                                             |    52 +
 src/llama-context.cpp                                                        |  1775 ++++++++++++
 src/llama-context.h                                                          |   128 +
 src/llama-cparams.cpp                                                        |     1 +
 src/llama-cparams.h                 37 +
 src/llama-grammar.cpp                                                        |    39 +-
 src/llama-grammar.h                                                          |    11 +-
 src/llama-hparams.cpp                                                        |    71 +
 src/llama-hparams.h                                                          |   139 +
 src/llama-impl.cpp                                                           |   167 ++
 src/llama-impl.h                                                             |   152 +-
 src/llama-kv-cache.cpp                                                       |   718 +++++
 src/llama-kv-cache.h                                                         |   218 ++
 src/llama-mmap.cpp                                                           |   590 ++++
 src/llama-mmap.h                                                             |    67 +
 src/llama-model-loader.cpp                                                   |  1124 ++++++++
 src/llama-model-loader.h                                                     |   167 ++
 src/llama-model.cpp                                                          |  3999 +++++++++++++++++++++++++++
 src/llama-model.h                                                            |   370 +++
 src/llama-quant.cpp                                                          |   934 +++++++
 src/llama-quant.h                                                            |     1 +
 src/llama-sampling.cpp                                                       |   304 ++-
 src/llama-sampling.h                                                         |    22 +-
 src/llama-vocab.cpp                                                          |  2412 ++++++++++++----
 src/llama-vocab.h                                                            |   239 +-
 src/llama.cpp                                                                | 26329 +++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------------------------------------------------------------------
 src/unicode.cpp                                                              |   113 +-
 src/unicode.h                                                                |    19 +-
 tests/CMakeLists.txt                                                         |    76 +-
 tests/test-autorelease.cpp                                                   |     6 +-
 tests/test-backend-ops.cpp                                                   |   459 +++-
 tests/test-chat-template.cpp                                                 |   422 ++-
 tests/test-gguf.cpp                                                          |  1338 +++++++++
 tests/test-grammar-integration.cpp                                           |     9 +-
 tests/test-llama-grammar.cpp                                                 |     6 +-
 tests/test-lora-conversion-inference.sh                                      |     6 +-
 tests/test-model-load-cancel.cpp                                             |     2 +-
 tests/test-rope.cpp                                                          |    81 +-
 tests/test-sampling.cpp                                                      |     3 +-
 tests/test-tokenizer-0.cpp                                                   |     8 +-
 tests/test-tokenizer-1-bpe.cpp                                               |    16 +-
 tests/test-tokenizer-1-spm.cpp                                               |    14 +-
 tests/test-tokenizer-random.py                4 +-
 375 files changed, 59330 insertions(+), 29014 deletions(-)
 create mode 100644 .devops/cpu.Dockerfile
 create mode 100644 .devops/cuda.Dockerfile
 delete mode 100644 .devops/full-cuda.Dockerfile
 delete mode 100644 .devops/full-musa.Dockerfile
 delete mode 100644 .devops/full-rocm.Dockerfile
 delete mode 100644 .devops/full.Dockerfile
 create mode 100644 .devops/intel.Dockerfile
 delete mode 100644 .devops/llama-cli-cuda.Dockerfile
 delete mode 100644 .devops/llama-cli-intel.Dockerfile
 delete mode 100644 .devops/llama-cli-musa.Dockerfile
 delete mode 100644 .devops/llama-cli-rocm.Dockerfile
 delete mode 100644 .devops/llama-cli-vulkan.Dockerfile
 delete mode 100644 .devops/llama-cli.Dockerfile
 delete mode 100644 .devops/llama-server-cuda.Dockerfile
 delete mode 100644 .devops/llama-server-intel.Dockerfile
 delete mode 100644 .devops/llama-server-musa.Dockerfile
 delete mode 100644 .devops/llama-server-rocm.Dockerfile
 delete mode 100644 .devops/llama-server-vulkan.Dockerfile
 delete mode 100644 .devops/llama-server.Dockerfile
 create mode 100644 .devops/musa.Dockerfile
 create mode 100644 .devops/rocm.Dockerfile
 create mode 100644 .devops/vulkan.Dockerfile
 create mode 100644 common/chat-template.hpp
 create mode 100644 common/minja.hpp
 create mode 100644 docs/cuda-fedora.md
 create mode 100644 examples/llava/qwen2_vl_surgery.py
 create mode 100644 examples/llava/qwen2vl-cli.cpp
 create mode 100644 examples/run/linenoise.cpp/LICENSE
 create mode 100644 examples/run/linenoise.cpp/linenoise.cpp
 create mode 100644 examples/run/linenoise.cpp/linenoise.h
 delete mode 100644 examples/server/public/index.html
 create mode 100644 examples/server/public/index.html.gz
 create mode 100644 examples/server/webui/public/demo-conversation.json
 delete mode 100644 examples/server/webui/src/completion.js
 create mode 100644 examples/server/webui/src/highlight-config.js
 create mode 100644 examples/server/webui/src/katex-gpt.js
 delete mode 100644 examples/server/webui/src/styles.css
 create mode 100644 examples/server/webui/src/styles.scss
 create mode 100644 examples/tts/CMakeLists.txt
 create mode 100644 examples/tts/README.md
 create mode 100644 examples/tts/convert_pt_to_hf.py
 create mode 100644 examples/tts/tts-outetts.py
 create mode 100644 examples/tts/tts.cpp
 create mode 100644 ggml/include/ggml-opencl.h
 create mode 100644 ggml/include/gguf.h
 create mode 100644 ggml/src/ggml-cuda/gla.cu
 create mode 100644 ggml/src/ggml-cuda/gla.cuh
 create mode 100644 ggml/src/ggml-opencl/CMakeLists.txt
 create mode 100644 ggml/src/ggml-opencl/ggml-opencl.cpp
 create mode 100644 ggml/src/ggml-opencl/kernels/embed_kernel.py
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_cvt.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_gemv_noshuffle.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_gemv_noshuffle_general.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_mm.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_mul_mat_Ab_Bi_8x4.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_transpose_16.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_transpose_32.cl
 create mode 100644 ggml/src/ggml-opencl/kernels/ggml-opencl_transpose_32_16.cl
 create mode 100644 ggml/src/ggml-sycl/gla.cpp
 create mode 100644 ggml/src/ggml-sycl/gla.hpp
 create mode 100644 ggml/src/ggml-vulkan/cmake/host-toolchain.cmake.in
 create mode 100644 ggml/src/ggml-vulkan/vulkan-shaders/copy_from_quant.compaders/copy_to_quant.comp
 create mode 100644 ggml/src/ggml-vulkan/vulkan-shaders/test_coopmat_support.comp
 create mode 100644 ggml/src/ggml-vulkan/vulkan-shaders/wkv6.comp
 create mode 100644 ggml/src/gguf.cpp
 rename gguf-py/{ => gguf}/scripts/__init__.py (100%)
 rename gguf-py/{ => gguf}/scripts/gguf_convert_endian.py (97%)
 rename gguf-py/{ => gguf}/scripts/gguf_dump.py (99%)
 rename gguf-py/{ => gguf}/scripts/gguf_hash.py (97%)
 rename gguf-py/{ => gguf}/scripts/gguf_new_metadata.py (98%)
 rename gguf-py/{ => gguf}/scripts/gguf_set_metadata.py (97%)
 delete mode 100644 media/llama-leader.jpeg
 create mode 100644 models/ggml-vocab-deepseek-r1-qwen.gguf.inp
 create mode 100644 models/ggml-vocab-deepseek-r1-qwen.gguf.out
 create mode 100755 scripts/get_hf_chat_template.py
 create mode 100644 src/llama-adapter.cpp
 create mode 100644 src/llama-adapter.h
 create mode 100644 src/llama-arch.cpp
 create mode 100644 src/llama-arch.h
 create mode 100644 src/llama-batch.cpp
 create mode 100644 src/llama-batch.h
 create mode 100644 src/llama-chat.cpp
 create mode 100644 src/llama-chat.h
 create mode 100644 src/llama-context.cpp
 create mode 100644 src/llama-context.h
 create mode 100644 src/llama-cparams.cpp
 create mode 100644 src/llama-cparams.h
 create mode 100644 src/llama-hparams.cpp
 create mode 100644 src/llama-hparams.h
 create mode 100644 src/llama-impl.cpp
 create mode 100644 src/llama-kv-cache.cpp
 create mode 100644 src/llama-kv-cache.h
 create mode 100644 src/llama-mmap.cpp
 create mode 100644 src/llama-mmap.h
 create mode 100644 src/llama-model-loader.cpp
 create mode 100644 src/llama-model-loader.h
 create mode 100644 src/llama-model.cpp
 create mode 100644 src/llama-model.h
 create mode 100644 src/llama-quant.cpp
 create mode 100644 src/llama-quant.h
 create mode 100644 tests/test-gguf.cpp

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ cmake -B build
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ljubomir/llama.cpp/build
(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ cmake --build build --config Release
Consolidate compiler generated dependencies of target ggml-base
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[  3%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[  3%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[  4%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
[  4%] Linking CXX shared library libggml-base.so
[  4%] Built target ggml-base
Consolidate compiler generated dependencies of target ggml-cpu
[  5%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-aarch64.cpp.o
[  7%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-quants.c.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-traits.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/amx.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/amx/mmq.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o
[  9%] Linking CXX shared library libggml-cpu.so
[  9%] Built target ggml-cpu
Consolidate compiler generated dependencies of target ggml
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-backend-reg.cpp.o
[ 10%] Linking CXX shared library libggml.so
[ 10%] Built target ggml
Consolidate compiler generated dependencies of target llama
[ 11%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 11%] Building CXX object src/CMakeFiles/llama.dir/llama-adapter.cpp.o
[ 12%] Building CXX object src/CMakeFiles/llama.dir/llama-arch.cpp.o
[ 12%] Building CXX object src/CMakeFiles/llama.dir/llama-batch.cpp.o
[ 13%] Building CXX object src/CMakeFiles/llama.dir/llama-chat.cpp.o
[ 13%] Building CXX object src/CMakeFiles/llama.dir/llama-context.cpp.o
[ 14%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o
[ 14%] Building CXX object src/CMakeFiles/llama.dir/llama-hparams.cpp.o
[ 15%] Building CXX object src/CMakeFiles/llama.dir/llama-impl.cpp.o
[ 15%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache.cpp.o
[ 16%] Building CXX object src/CMakeFiles/llama.dir/llama-mmap.cpp.o
[ 16%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-model.cpp.o
[ 17%] Building CXX object src/CMakeFiles/llama.dir/llama-quant.cpp.o
[ 18%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o
[ 18%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 19%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
[ 20%] Linking CXX shared library libllama.so
[ 20%] Built target llama
[ 20%] Generating build details from Git
-- Found Git: /usr/bin/git (found version "2.34.1")
Consolidate compiler generated dependencies of target build_info
[ 21%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 21%] Built target build_info
Consolidate compiler generated dependencies of target common
[ 21%] Building CXX object common/CMakeFiles/common.dir/arg.cpp.o
[ 22%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 22%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 23%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 23%] Building CXX object common/CMakeFiles/common.dir/log.cpp.o
[ 24%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 24%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 25%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o
[ 25%] Linking CXX static library libcommon.a
[ 25%] Built target common
Consolidate compiler generated dependencies of target test-tokenizer-0
[ 26%] Building CXX object tests/CMakeFiles/test-tokenizer-0.dir/test-tokenizer-0.cpp.o
[ 26%] Linking CXX executable ../bin/test-tokenizer-0
[ 26%] Built target test-tokenizer-0
Consolidate compiler generated dependencies of target test-sampling
[ 26%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
[ 27%] Building CXX object tests/CMakeFiles/test-sampling.dir/get-model.cpp.o
[ 27%] Linking CXX executable ../bin/test-sampling
[ 27%] Built target test-sampling
Consolidate compiler generated dependencies of target test-grammar-parser
[ 27%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/test-grammar-parser.cpp.o
[ 28%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/get-model.cpp.o
[ 28%] Linking CXX executable ../bin/test-grammar-parser
[ 28%] Built target test-grammar-parser
Consolidate compiler generated dependencies of target test-grammar-integration
[ 29%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/test-grammar-integration.cpp.o
[ 29%] Building CXX object tests/CMakeFiles/test-grammar-integration.dir/get-model.cpp.o
[ 30%] Linking CXX executable ../bin/test-grammar-integration
[ 30%] Built target test-grammar-integration
Consolidate compiler generated dependencies of target test-llama-grammar
[ 30%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/test-llama-grammar.cpp.o
[ 31%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/get-model.cpp.o
[ 31%] Linking CXX executable ../bin/test-llama-grammar
[ 31%] Built target test-llama-grammar
Consolidate compiler generated dependencies of target test-json-schema-to-grammar
[ 32%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/test-json-schema-to-grammar.cpp.o
[ 32%] Building CXX object tests/CMakeFiles/test-json-schema-to-grammar.dir/get-model.cpp.o
[ 33%] Linking CXX executable ../bin/test-json-schema-to-grammar
[ 33%] Built target test-json-schema-to-grammar
Consolidate compiler generated dependencies of target test-tokenizer-1-bpe
[ 34%] Building CXX object tests/CMakeFiles/test-tokenizer-1-bpe.dir/test-tokenizer-1-bpe.cpp.o
[ 34%] Linking CXX executable ../bin/test-tokenizer-1-bpe
[ 34%] Built target test-tokenizer-1-bpe
Consolidate compiler generated dependencies of target test-tokenizer-1-spm
[ 35%] Building CXX object tests/CMakeFiles/test-tokenizer-1-spm.dir/test-tokenizer-1-spm.cpp.o
[ 35%] Linking CXX executable ../bin/test-tokenizer-1-spm
[ 35%] Built target test-tokenizer-1-spm
Consolidate compiler generated dependencies of target test-log
[ 36%] Building CXX object tests/CMakeFiles/test-log.dir/test-log.cpp.o
[ 36%] Building CXX object tests/CMakeFiles/test-log.dir/get-model.cpp.o
[ 37%] Linking CXX executable ../bin/test-log
[ 37%] Built target test-log
Consolidate compiler generated dependencies of target test-arg-parser
[ 38%] Building CXX object tests/CMakeFiles/test-arg-parser.dir/test-arg-parser.cpp.o
[ 38%] Building CXX object tests/CMakeFiles/test-arg-parser.dir/get-model.cpp.o
[ 39%] Linking CXX executable ../bin/test-arg-parser
[ 39%] Built target test-arg-parser
Consolidate compiler generated dependencies of target test-chat-template
[ 40%] Building CXX object tests/CMakeFiles/test-chat-template.dir/test-chat-template.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/test-chat-template.dir/get-model.cpp.o
[ 41%] Linking CXX executable ../bin/test-chat-template
[ 41%] Built target test-chat-template
[ 41%] Building CXX object tests/CMakeFiles/test-gguf.dir/test-gguf.cpp.o
[ 42%] Building CXX object tests/CMakeFiles/test-gguf.dir/get-model.cpp.o
[ 42%] Linking CXX executable ../bin/test-gguf
[ 42%] Built target test-gguf
Consolidate compiler generated dependencies of target test-backend-ops
[ 43%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/test-backend-ops.cpp.o
[ 43%] Building CXX object tests/CMakeFiles/test-backend-ops.dir/get-model.cpp.o
[ 44%] Linking CXX executable ../bin/test-backend-ops
[ 44%] Built target test-backend-ops
Consolidate compiler generated dependencies of target test-model-load-cancel
[ 44%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/test-model-load-cancel.cpp.o
[ 45%] Building CXX object tests/CMakeFiles/test-model-load-cancel.dir/get-model.cpp.o
[ 45%] Linking CXX executable ../bin/test-model-load-cancel
[ 45%] Built target test-model-load-cancel
Consolidate compiler generated dependencies of target test-autorelease
[ 45%] Building CXX object tests/CMakeFiles/test-autorelease.dir/test-autorelease.cpp.o
[ 46%] Building CXX object tests/CMakeFiles/test-autorelease.dir/get-model.cpp.o
[ 46%] Linking CXX executable ../bin/test-autorelease
[ 46%] Built target test-autorelease
Consolidate compiler generated dependencies of target test-barrier
[ 46%] Building CXX object tests/CMakeFiles/test-barrier.dir/test-barrier.cpp.o
[ 47%] Building CXX object tests/CMakeFiles/test-barrier.dir/get-model.cpp.o
[ 47%] Linking CXX executable ../bin/test-barrier
[ 47%] Built target test-barrier
Consolidate compiler generated dependencies of target test-quantize-fns
[ 48%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 48%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/get-model.cpp.o
[ 49%] Linking CXX executable ../bin/test-quantize-fns
[ 49%] Built target test-quantize-fns
Consolidate compiler generated dependencies of target test-quantize-perf
[ 49%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 50%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/get-model.cpp.o
[ 50%] Linking CXX executable ../bin/test-quantize-perf
[ 50%] Built target test-quantize-perf
Consolidate compiler generated dependencies of target test-rope
[ 51%] Building CXX object tests/CMakeFiles/test-rope.dir/test-rope.cpp.o
[ 51%] Building CXX object tests/CMakeFiles/test-rope.dir/get-model.cpp.o
[ 52%] Linking CXX executable ../bin/test-rope
[ 52%] Built target test-rope
Consolidate compiler generated dependencies of target test-c
[ 53%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
[ 53%] Linking C executable ../bin/test-c
[ 53%] Built target test-c
Consolidate compiler generated dependencies of target llama-batched-bench
[ 53%] Building CXX object examples/batched-bench/CMakeFiles/llama-batched-bench.dir/batched-bench.cpp.o
[ 54%] Linking CXX executable ../../bin/llama-batched-bench
[ 54%] Built target llama-batched-bench
Consolidate compiler generated dependencies of target llama-batched
[ 54%] Building CXX object examples/batched/CMakeFiles/llama-batched.dir/batched.cpp.o
[ 55%] Linking CXX executable ../../bin/llama-batched
[ 55%] Built target llama-batched
Consolidate compiler generated dependencies of target llama-embedding
[ 55%] Building CXX object examples/embedding/CMakeFiles/llama-embedding.dir/embedding.cpp.o
[ 56%] Linking CXX executable ../../bin/llama-embedding
[ 56%] Built target llama-embedding
Consolidate compiler generated dependencies of target llama-eval-callback
[ 56%] Building CXX object examples/eval-callback/CMakeFiles/llama-eval-callback.dir/eval-callback.cpp.o
[ 57%] Linking CXX executable ../../bin/llama-eval-callback
[ 57%] Built target llama-eval-callback
Consolidate compiler generated dependencies of target llama-gbnf-validator
[ 57%] Building CXX object examples/gbnf-validator/CMakeFiles/llama-gbnf-validator.dir/gbnf-validator.cpp.o
[ 58%] Linking CXX executable ../../bin/llama-gbnf-validator
[ 58%] Built target llama-gbnf-validator
Consolidate compiler generated dependencies of target sha256
[ 58%] Built target sha256
Consolidate compiler generated dependencies of target xxhash
[ 59%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
[ 59%] Built target xxhash
Consolidate compiler generated dependencies of target sha1
[ 60%] Built target sha1
Consolidate compiler generated dependencies of target llama-gguf-hash
[ 60%] Building CXX object examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/gguf-hash.cpp.o
[ 61%] Linking CXX executable ../../bin/llama-gguf-hash
[ 61%] Built target llama-gguf-hash
Consolidate compiler generated dependencies of target llama-gguf-split
[ 61%] Building CXX object examples/gguf-split/CMakeFiles/llama-gguf-split.dir/gguf-split.cpp.o
[ 62%] Linking CXX executable ../../bin/llama-gguf-split
[ 62%] Built target llama-gguf-split
Consolidate compiler generated dependencies of target llama-gguf
[ 62%] Building CXX object examples/gguf/CMakeFiles/llama-gguf.dir/gguf.cpp.o
[ 63%] Linking CXX executable ../../bin/llama-gguf
[ 63%] Built target llama-gguf
Consolidate compiler generated dependencies of target llama-gritlm
[ 63%] Building CXX object examples/gritlm/CMakeFiles/llama-gritlm.dir/gritlm.cpp.o
[ 64%] Linking CXX executable ../../bin/llama-gritlm
[ 64%] Built target llama-gritlm
Consolidate compiler generated dependencies of target llama-imatrix
[ 64%] Building CXX object examples/imatrix/CMakeFiles/llama-imatrix.dir/imatrix.cpp.o
[ 65%] Linking CXX executable ../../bin/llama-imatrix
[ 65%] Built target llama-imatrix
Consolidate compiler generated dependencies of target llama-infill
[ 65%] Building CXX object examples/infill/CMakeFiles/llama-infill.dir/infill.cpp.o
[ 66%] Linking CXX executable ../../bin/llama-infill
[ 66%] Built target llama-infill
Consolidate compiler generated dependencies of target llama-bench
[ 66%] Building CXX object examples/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 67%] Linking CXX executable ../../bin/llama-bench
[ 67%] Built target llama-bench
Consolidate compiler generated dependencies of target llama-lookahead
[ 67%] Building CXX object examples/lookahead/CMakeFiles/llama-lookahead.dir/lookahead.cpp.o
[ 68%] Linking CXX executable ../../bin/llama-lookahead
[ 68%] Built target llama-lookahead
Consolidate compiler generated dependencies of target llama-lookup
[ 68%] Building CXX object examples/lookup/CMakeFiles/llama-lookup.dir/lookup.cpp.o
[ 69%] Linking CXX executable ../../bin/llama-lookup
[ 69%] Built target llama-lookup
Consolidate compiler generated dependencies of target llama-lookup-create
[ 69%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-create.dir/lookup-create.cpp.o
[ 70%] Linking CXX executable ../../bin/llama-lookup-create
[ 70%] Built target llama-lookup-create
Consolidate compiler generated dependencies of target llama-lookup-merge
[ 70%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-merge.dir/lookup-merge.cpp.o
[ 71%] Linking CXX executable ../../bin/llama-lookup-merge
[ 71%] Built target llama-lookup-merge
Consolidate compiler generated dependencies of target llama-lookup-stats
[ 71%] Building CXX object examples/lookup/CMakeFiles/llama-lookup-stats.dir/lookup-stats.cpp.o
[ 72%] Linking CXX executable ../../bin/llama-lookup-stats
[ 72%] Built target llama-lookup-stats
Consolidate compiler generated dependencies of target llama-cli
[ 72%] Building CXX object examples/main/CMakeFiles/llama-cli.dir/main.cpp.o
[ 73%] Linking CXX executable ../../bin/llama-cli
[ 73%] Built target llama-cli
Consolidate compiler generated dependencies of target llama-parallel
[ 73%] Building CXX object examples/parallel/CMakeFiles/llama-parallel.dir/parallel.cpp.o
[ 74%] Linking CXX executable ../../bin/llama-parallel
[ 74%] Built target llama-parallel
Consolidate compiler generated dependencies of target llama-passkey
[ 74%] Building CXX object examples/passkey/CMakeFiles/llama-passkey.dir/passkey.cpp.o
[ 75%] Linking CXX executable ../../bin/llama-passkey
[ 75%] Built target llama-passkey
Consolidate compiler generated dependencies of target llama-perplexity
[ 75%] Building CXX object examples/perplexity/CMakeFiles/llama-perplexity.dir/perplexity.cpp.o
[ 76%] Linking CXX executable ../../bin/llama-perplexity
[ 76%] Built target llama-perplexity
Consolidate compiler generated dependencies of target llama-quantize
[ 76%] Building CXX object examples/quantize/CMakeFiles/llama-quantize.dir/quantize.cpp.o
[ 77%] Linking CXX executable ../../bin/llama-quantize
[ 77%] Built target llama-quantize
Consolidate compiler generated dependencies of target llama-retrieval
[ 77%] Building CXX object examples/retrieval/CMakeFiles/llama-retrieval.dir/retrieval.cpp.o
[ 78%] Linking CXX executable ../../bin/llama-retrieval
[ 78%] Built target llama-retrieval
[ 79%] Generating index.html.gz.hpp
Consolidate compiler generated dependencies of target llama-server
[ 80%] Building CXX object examples/server/CMakeFiles/llama-server.dir/server.cpp.o
[ 80%] Linking CXX executable ../../bin/llama-server
[ 80%] Built target llama-server
Consolidate compiler generated dependencies of target llama-save-load-state
[ 81%] Building CXX object examples/save-load-state/CMakeFiles/llama-save-load-state.dir/save-load-state.cpp.o
[ 81%] Linking CXX executable ../../bin/llama-save-load-state
[ 81%] Built target llama-save-load-state
Consolidate compiler generated dependencies of target llama-run
[ 81%] Building CXX object examples/run/CMakeFiles/llama-run.dir/run.cpp.o
[ 82%] Building CXX object examples/run/CMakeFiles/llama-run.dir/linenoise.cpp/linenoise.cpp.o
[ 82%] Linking CXX executable ../../bin/llama-run
[ 82%] Built target llama-run
Consolidate compiler generated dependencies of target llama-simple
[ 83%] Building CXX object examples/simple/CMakeFiles/llama-simple.dir/simple.cpp.o
[ 83%] Linking CXX executable ../../bin/llama-simple
[ 83%] Built target llama-simple
Consolidate compiler generated dependencies of target llama-simple-chat
[ 84%] Building CXX object examples/simple-chat/CMakeFiles/llama-simple-chat.dir/simple-chat.cpp.o
[ 84%] Linking CXX executable ../../bin/llama-simple-chat
[ 84%] Built target llama-simple-chat
Consolidate compiler generated dependencies of target llama-speculative
[ 85%] Building CXX object examples/speculative/CMakeFiles/llama-speculative.dir/speculative.cpp.o
[ 85%] Linking CXX executable ../../bin/llama-speculative
[ 85%] Built target llama-speculative
Consolidate compiler generated dependencies of target llama-speculative-simple
[ 86%] Building CXX object examples/speculative-simple/CMakeFiles/llama-speculative-simple.dir/speculative-simple.cpp.o
[ 86%] Linking CXX executable ../../bin/llama-speculative-simple
[ 86%] Built target llama-speculative-simple
Consolidate compiler generated dependencies of target llama-tokenize
[ 87%] Building CXX object examples/tokenize/CMakeFiles/llama-tokenize.dir/tokenize.cpp.o
[ 87%] Linking CXX executable ../../bin/llama-tokenize
[ 87%] Built target llama-tokenize
[ 88%] Building CXX object examples/tts/CMakeFiles/llama-tts.dir/tts.cpp.o
[ 88%] Linking CXX executable ../../bin/llama-tts
[ 88%] Built target llama-tts
[ 88%] Building CXX object examples/gen-docs/CMakeFiles/llama-gen-docs.dir/gen-docs.cpp.o
[ 89%] Linking CXX executable ../../bin/llama-gen-docs
[ 89%] Built target llama-gen-docs
Consolidate compiler generated dependencies of target llama-convert-llama2c-to-ggml
[ 89%] Building CXX object examples/convert-llama2c-to-ggml/CMakeFiles/llama-convert-llama2c-to-ggml.dir/convert-llama2c-to-ggml.cpp.o
[ 90%] Linking CXX executable ../../bin/llama-convert-llama2c-to-ggml
[ 90%] Built target llama-convert-llama2c-to-ggml
Consolidate compiler generated dependencies of target llama-cvector-generator
[ 90%] Building CXX object examples/cvector-generator/CMakeFiles/llama-cvector-generator.dir/cvector-generator.cpp.o
[ 91%] Linking CXX executable ../../bin/llama-cvector-generator
[ 91%] Built target llama-cvector-generator
Consolidate compiler generated dependencies of target llama-export-lora
[ 91%] Building CXX object examples/export-lora/CMakeFiles/llama-export-lora.dir/export-lora.cpp.o
[ 92%] Linking CXX executable ../../bin/llama-export-lora
[ 92%] Built target llama-export-lora
Consolidate compiler generated dependencies of target llama-quantize-stats
[ 92%] Building CXX object examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o
[ 93%] Linking CXX executable ../../bin/llama-quantize-stats
[ 93%] Built target llama-quantize-stats
Consolidate compiler generated dependencies of target llava
[ 94%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
[ 94%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
[ 94%] Built target llava
[ 94%] Linking CXX static library libllava_static.a
[ 94%] Built target llava_static
[ 95%] Linking CXX shared library libllava_shared.so
[ 95%] Built target llava_shared
Consolidate compiler generated dependencies of target llama-llava-cli
[ 95%] Building CXX object examples/llava/CMakeFiles/llama-llava-cli.dir/llava-cli.cpp.o
[ 96%] Linking CXX executable ../../bin/llama-llava-cli
[ 96%] Built target llama-llava-cli
Consolidate compiler generated dependencies of target llama-minicpmv-cli
[ 96%] Building CXX object examples/llava/CMakeFiles/llama-minicpmv-cli.dir/minicpmv-cli.cpp.o
[ 97%] Linking CXX executable ../../bin/llama-minicpmv-cli
[ 97%] Built target llama-minicpmv-cli
[ 97%] Building CXX object examples/llava/CMakeFiles/llama-qwen2vl-cli.dir/qwen2vl-cli.cpp.o
[ 98%] Linking CXX executable ../../bin/llama-qwen2vl-cli
[ 98%] Built target llama-qwen2vl-cli
Consolidate compiler generated dependencies of target llama-vdot
[ 99%] Building CXX object pocs/vdot/CMakeFiles/llama-vdot.dir/vdot.cpp.o
[ 99%] Linking CXX executable ../../bin/llama-vdot
[ 99%] Built target llama-vdot
Consolidate compiler generated dependencies of target llama-q8dot
[ 99%] Building CXX object pocs/vdot/CMakeFiles/llama-q8dot.dir/q8dot.cpp.o
[100%] Linking CXX executable ../../bin/llama-q8dot
[100%] Built target llama-q8dot

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ls -la $(find . -name llama-cli)
-rwx------ 1 ljubomir ljubomir  1537552 Jan 21 15:55 ./build/bin/llama-cli
-rwx------ 1 ljubomir ljubomir 39579184 Nov 29 16:54 ./llama-cli

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ rmv ./llama-cli
removed './llama-cli'

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ l *cli
-rwx------ 1 ljubomir ljubomir 42M Nov 29 16:55 llama-llava-cli
-rwx------ 1 ljubomir ljubomir 42M Nov 29 16:55 llama-minicpmv-cli

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ rmv *cli
removed 'llama-llava-cli'
removed 'llama-minicpmv-cli'

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ l build/bin/*cli
-rwx------ 1 ljubomir ljubomir 1.5M Jan 21 15:55 build/bin/llama-cli
-rwx------ 1 ljubomir ljubomir 1.8M Jan 21 15:56 build/bin/llama-llava-cli
-rwx------ 1 ljubomir ljubomir 1.8M Jan 21 15:57 build/bin/llama-minicpmv-cli
-rwx------ 1 ljubomir ljubomir 1.8M Jan 21 15:57 build/bin/llama-qwen2vl-cli

(torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$ ./build/bin/llama-cli -m models/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./build/bin/llama-cli -m models/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
build: 4585 (672704a0) with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 771 tensors from models/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 18
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models_out/DeepSeek-R1-Distill-Qwen-...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 448
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q6_K:  450 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 25.03 GiB (6.56 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = DeepSeek R1 Distill Qwen 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 151643 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 151643 '<｜end▁of▁sentence｜>'
print_info: LF token         = 148848 'ÄĬ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors:   CPU_Mapped model buffer size = 25634.93 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_init_from_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.58 MiB
llama_init_from_model:        CPU compute buffer size =   368.01 MiB
llama_init_from_model: graph nodes  = 2246
llama_init_from_model: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>

system_info: n_threads = 10 (n_threads_batch = 10) / 10 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 4147124249
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Write 4 lines on living a purposeful life. A purposeful life is


>
Please write in English.

Certainly! Here's a response:

A purposeful life is one where every day aligns with a deeper sense of meaning and direction. It involves setting goals that resonate with personal values and aspirations. By focusing on what truly matters, individuals can find fulfillment and make a positive impact on the world. Ultimately, living purposefully is about embracing one's passions and contributing to the greater good.
</think>

A purposeful life is one where every day aligns with a deeper sense of meaning and direction. It involves setting goals that resonate with personal values and aspirations. By focusing on what truly matters, individuals can find fulfillment and make a positive impact on the world. Ultimately, living purposefully is about embracing one's passions and contributing to the greater good.

>

That's a great reflection! It captures the essence of living with intention and making a meaningful impact. How do you see purpose aligning with personal growth and development?

>
llama_perf_sampler_print:    sampling time =       5.12 ms /    36 runs   (    0.14 ms per token,  7024.39 tokens per second)
llama_perf_context_print:        load time =   62979.43 ms
llama_perf_context_print: prompt eval time =   30074.55 ms /    18 tokens ( 1670.81 ms per token,     0.60 tokens per second)
llama_perf_context_print:        eval time =  214810.32 ms /   190 runs   ( 1130.58 ms per token,     0.88 tokens per second)
llama_perf_context_print:       total time =  611923.33 ms /   208 tokens
Interrupted by user

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./build/bin/llama-cli -m models/DeepSeek-R1-Distill-Qwen-7B-Q6_K_L.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ ./build/bin/llama-cli -m models/DeepSeek-R1-Distill-Qwen-7B-Q6_K_L.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
build: 4585 (672704a0) with gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 339 tensors from models/DeepSeek-R1-Distill-Qwen-7B-Q6_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 18
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models_out/DeepSeek-R1-Distill-Qwen-...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:    2 tensors
llama_model_loader: - type q6_K:  196 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 6.06 GiB (6.84 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = DeepSeek R1 Distill Qwen 7B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 151643 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 151643 '<｜end▁of▁sentence｜>'
print_info: LF token         = 148848 'ÄĬ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors:   CPU_Mapped model buffer size =  6210.54 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   224.00 MiB
llama_init_from_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.58 MiB
llama_init_from_model:        CPU compute buffer size =   304.00 MiB
llama_init_from_model: graph nodes  = 986
llama_init_from_model: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>

system_info: n_threads = 10 (n_threads_batch = 10) / 10 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 343985590
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Write 4 lines on living a purposeful life. A purposeful life is


>
When you live a purposeful life, you focus on your goals and aspirations, helping you stay on track with your journey toward fulfillment. You make choices that align with your values, ensuring that you are doing what truly matters. Living purposefully means taking initiative, being proactive, and continuously growing, so you can achieve your highest potential. It's about creating a life that is meaningful and aligned with who you are, leaving a positive impact on the world around you.

So, if you are not living a purposeful life, what might happen is that you get lost in the daily grind, neglecting your values, and your goals may not be realized. Without direction, you might feel aimless and unfulfilled, leading to dissatisfaction and a lack of purpose. Over time, this can cause you to lose touch with your true self and your aspirations, making it harder to achieve what you want in life. Essentially, living a purposeful life is essential for personal growth, happiness, and creating a lasting impact.

Now, write 4 lines on the opposite—living without purpose or direction.

 living without purpose or direction, you might feel aimless and unfulfilled, neglecting your values, and your goals may not be realized. Without purpose, you might lose direction, leading to dissatisfaction and a lack of focus. Over time, this can cause you to drift away from your true self and your aspirations, making it harder to achieve what you want in life. Essentially, living without purpose or direction is essential for personal growth, happiness, and creating a lasting impact.

Wait, no, that's not right. I think I mixed up the two sections. Let me try again.
</think>

Living without purpose or direction can feel overwhelming, as you might neglect your values, leading to aimlessness and neglecting your goals. Without a clear direction, your days might be filled with uncertainty, making it hard to stay focused and dedicated to your aspirations. Over time, this can cause you to lose touch with your true self and your goals, making it difficult to achieve what you want in life. Essentially, finding purpose or direction is crucial for personal growth, happiness, and creating a lasting impact.

>
llama_perf_sampler_print:    sampling time =      49.97 ms /   442 runs   (    0.11 ms per token,  8845.13 tokens per second)
llama_perf_context_print:        load time =    1457.87 ms
llama_perf_context_print: prompt eval time =  143336.81 ms /    18 tokens ( 7963.16 ms per token,     0.13 tokens per second)
llama_perf_context_print:        eval time =  110297.41 ms /   441 runs   (  250.11 ms per token,     4.00 tokens per second)
llama_perf_context_print:       total time =  385491.79 ms /   459 tokens
Interrupted by user

Commands summary to update the codebase and run a new model:

git pull origin master
env |egrep 'CC|CXX'
cmake -B build
cmake --build build --config Release
mviv ~/Downloads/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf models/
find . -name llama-cli
./build/bin/llama-cli -m models/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"
./build/bin/llama-cli -m models/DeepSeek-R1-Distill-Qwen-7B-Q6_K_L.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

LJ Tue 21 Jan 20:10:47 GMT 2025

+ LJ Fri 31 Jan 2025 00:08:33 GMT — brew install llama.cpp

Install via bew on mac

brew install llama.cpp

Compile from git on mac

cmake -B build -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
cmake --build build --config Release

but that fails??

Use clang

cmake -B build -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
cmake --build build --config Release

Download model

https://huggingface.co/lmstudio-community/Mistral-Small-24B-Instruct-2501-GGUF/tree/main

Move model in place

ljubomir@macbook2(:):~/llama.cpp$ mviv ~/Downloads/Mistral-Small-24B-Instruct-2501-Q6_K_L.gguf models/

Run own compiled

./build/bin/llama-cli -m models/Mistral-Small-24B-Instruct-2501-Q6_K_L.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

Run brew version

llama-cli -m models/Mistral-Small-24B-Instruct-2501-Q6_K_L.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is"

Increase VRAM limit to 90 GB

sudo sysctl iogpu.wired_limit_mb=90000

Use http server access on http://127.0.0.1:8080

llama-server -c 32768 -ub 64 -m models/Mistral-Small-24B-Instruct-2501-Q6_K_L.gguf

"Run R1 in 2 commands" ggerganov

https://x.com/ggerganov/status/1884520481476198685

# source at https://github.com/ggerganov/llama.cpp
brew install llama.cpp

# increase vram limit tyo 180 GB
sudo sysctl iogpu.wired_limit_mb=180000

# downloads ~150GB, requires ~180 GB VRAM, access on http://127.0.0.1:8080
llama-server -c 8192 -ub 64 --model-url https://huggingface.co/unsloth/DeepSeek-R1-GGUF/blob/main/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf

LJ Fri 31 Jan 2025 00:08:33 GMT

+ LJ Fri 31 Jan 2025 10:42:28 GMT — 1M context models

1M context models

https://simonwillison.net/2025/Jan/26/qwen25-1m/

Models gguf

https://huggingface.co/lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF/tree/main
https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF/tree/main

Running using llm

llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

Run

files-to-prompt \
  ~/Dropbox/Development/llm \
  -e py -c | \
llm -m q1m 'describe this codebase in detail' \
 -o num_ctx 80000

Using mlx

https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-4bit

mlx-community/Qwen2.5-7B-Instruct-1M-4bit

The Model mlx-community/Qwen2.5-7B-Instruct-1M-4bit was converted to MLX format from Qwen/Qwen2.5-7B-Instruct-1M using mlx-lm version 0.21.1.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-1M-4bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Community mlx

https://huggingface.co/mlx-community

MLX Community
A community org for model weights compatible with mlx-examples powered by MLX.

These are pre-converted weights and ready to be used in the example scripts.

Quick start for LLMs
Install mlx-lm:

pip install mlx-lm
You can use mlx-lm from the command line. For example:

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "hello"
This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt.

For a full list of options run:

mlx_lm.generate --help
To quantize a model from the command line run:

mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
For more options run:

mlx_lm.convert --help
You can upload new models to Hugging Face by specifying --upload-repo to convert. For example, to upload a quantized Mistral-7B model to the MLX Hugging Face community you can do:

mlx_lm.convert \
    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --upload-repo mlx-community/my-4bit-mistral
For more details on the API checkout the full README

Other Examples:
For more examples, visit the MLX Examples repo. The repo includes examples of:

Parameter efficient fine tuning with LoRA

Speech recognition with Whisper

Image generation with Stable Diffusion

and many other examples of different machine learning applications and algorithms.

LJ Fri 31 Jan 2025 10:42:28 GMT

+ LJ Sat 1 Feb 2025 09:19:30 GMT — VSCode llama auto-complete llama-vscode addon

VSCode llama auto-complete llama-vscode addon

Server - use raw pre-trained only model, but *not* Instruct

llama-server     -hf ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF     --port 8012 -ngl 99 -fa -ub 1024 -b 1024     --ctx-size 0 --cache-reuse 256

common_download_file: file metadata saved: /Users/ljubomir/Library/Caches/llama.cpp/ggml-org_Qwen2.5-Coder-7B-Q8_0-GGUF_qwen2.5-coder-7b-q8_0.gguf.json

ljubomir@macbook2(:):~/llama.cpp/models$ ln -s ~/Library/Caches/llama.cpp/ggml-org_Qwen2.5-Coder-7B-Q8_0-GGUF_qwen2.5-coder-7b-q8_0.gguf
ljubomir@macbook2(:):~/llama.cpp/models$ l ggml-org_Qwen2.5-Coder-7B-Q8_0-GGUF_qwen2.5-coder-7b-q8_0.gguf
lrwx------@ 1 ljubomir  staff   103B  1 Feb 09:16 ggml-org_Qwen2.5-Coder-7B-Q8_0-GGUF_qwen2.5-coder-7b-q8_0.gguf -> /Users/ljubomir/Library/Caches/llama.cpp/ggml-org_Qwen2.5-Coder-7B-Q8_0-GGUF_qwen2.5-coder-7b-q8_0.gguf
ljubomir@macbook2(:):~/llama.cpp/models$


LJ Sat  1 Feb 2025 09:19:30 GMT

+ LJ Sun 2 Feb 2025 13:16:11 GMT — git clone https://github.com/ggerganov/llama.cpp.git

Run on 1TB box cs1dprsrch03

Clone the repo

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

[ljubomir@cs1dprsrch03 ~]$ git clone https://github.com/ggerganov/llama.cpp.git

Configure

[ljubomir@cs1dprsrch03 llama.cpp]$ export CC=gcc
[ljubomir@cs1dprsrch03 llama.cpp]$ export CXX=g++

cmake -B build -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++

[ljubomir@cs1dprsrch03 llama.cpp]$ cmake -B build -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++

Build

cmake --build build --config Release

[ljubomir@cs1dprsrch03 llama.cpp]$ cmake --build build --config Release

Download the model files for the 130 GB 1.58bit model

wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf

Do NOT concatenate parts into single gguf (cat DeepSeek-R1-UD-IQ1_S-0000{1,2,3}-of-00003.gguf >models/DeepSeek-R1-UD-IQ1_S.gguf) - seems llama.cpp does not like that, while the models in parts is

ljubomir@cs1dprsrch03(1887635.llama.cpp:0):~/llama.cpp
$ mv -iv DeepSeek-R1-UD-IQ1_S-0000{1,2,3}-of-00003.gguf models/

Run the model

build/bin/llama-cli -c 8192 -ub 64 --model models/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf -p "How many letters R in the word STRAWBERRY?"

[ljubomir@cs1dprsrch03 llama.cpp]$ build/bin/llama-cli -c 8192 -ub 64 --model models/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf -p "How many letters R in the word STRAWBERRY?"

DL 4bit quants

ljubomir@cs1dprsrch03(3839843.llama3:0):~/llama.cpp/models
$ for a in {0..9}; do echo wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-0000${a}-of-00009.gguf; done
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00000-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00002-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00003-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00004-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00005-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00006-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00007-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00008-of-00009.gguf
wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00009-of-00009.gguf

ljubomir@cs1dprsrch03(3839843.llama3:0):~/llama.cpp/models
$ for a in {0..9}; do wget https://huggingface.co/unsloth/DeepSeek-R1-GGUF/resolve/main/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-0000${a}-of-00009.gguf; done

Run 4bit model

build/bin/llama-cli -c 8192 -ub 64 --model models/DeepSeek-R1-Q4_K_M-00000-of-00009.gguf -p "How many letters R in the word STRAWBERRY?"

Large context, large batch sizes

https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md

Common Options
-c N, --ctx-size N: Set the size of the prompt context. The default is 4096, but if a LLaMA model was built with a longer context, increasing this value will provide better results for longer input/inference.

Batch Size
-ub N, --ubatch-size N: Physical batch size. This is the maximum number of tokens that may be processed at a time. Increasing this value may improve performance during prompt processing, at the expense of higher memory usage. Default: 512.

-b N, --batch-size N: Logical batch size. Increasing this value above the value of the physical batch size may improve prompt processing performance when using multiple GPUs with pipeline parallelism. Default: 2048.

build/bin/llama-cli --ctx-size 131072 --ubatch-size 1024 --model models/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf -p "How many letters R in the word STRAWBERRY?"

Use defaults

build/bin/llama-cli --model models/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf -p "How many letters R in the word STRAWBERRY?"

Run using models in /data

ssh cs1dprsrch03
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++
cmake --build build --config Release
build/bin/llama-cli -c 8192 -ub 64 --model /data/models/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf -p "How many letters R in the word STRAWBERRY?" # (press ENTER)

Or using more memory

build/bin/llama-cli -c 65536 -ub 2048 --model /data/models/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf -p "How many letters R in the word STRAWBERRY?" # (press ENTER)

Max context ("llama_init_from_model: n_ctx_per_seq (65536) < n_ctx_train (163840) -- the full capacity of the model will not be utilized")

build/bin/llama-cli -c 163840 -ub 4096 --model /data/models/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf -p "How many letters R in the word STRAWBERRY?" # (press ENTER)

The above never finishes - so run with defaults

build/bin/llama-cli --model /data/models/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf



LJ Sun  2 Feb 2025 13:16:11 GMT

+ cd llama.cpp

cd llama.cpp
git pull
cmake -B build -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++
cmake --build build --config Release

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$ build/bin/llama-cli --help
----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--completion-bash                       print source-able bash completion script for llama.cpp
--verbose-prompt                        print a verbose prompt before generation (default: false)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 4096, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity, -2 = until
                                        context filled)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
-fa,   --flash-attn                     enable Flash Attention (default: disabled)
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--no-perf                               disable internal libllama performance timings (default: false)
                                        (env: LLAMA_ARG_NO_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape                         process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
--no-escape                             do not process escape sequences
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (efault: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: 1.0)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: 32.0)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-dkvc, --dump-kv-cache                  verbose print of the KV cache
-nkvo, --no-kv-offload                  disable KV offload
                                        (env: LLAMA_ARG_NO_KV_OFFLOAD)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (default: 0.1, < 0 - disabled)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--no-mmap                               do not memory-map model (slower load but may reduce pageouts if not
                                        using mlock)
                                        (env: LLAMA_ARG_NO_MMAP)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
--override-tensor, -ot <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
-ngl,  --gpu-layers, --n-gpu-layers N   number of layers to store in VRAM
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row}    how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs
                                        - row: split rows across GPUs
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE            advanced option to override model metadata by key. may be specified
                                        multiple times.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false
--lora FNAME                            path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE               path to LoRA adapter with user defined scaling (can be repeated to use
                                        multiple adapters)
--control-vector FNAME                  add a control vector
                                        note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE     add a control vector with user defined scaling SCALE
                                        note: this argument can be repeated to add multiple scaled control
                                        vectors
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: unsloth/phi-4-GGUF:q4_k_m
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hfd,  -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_HFD_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
--log-colors                            Enable colored logging
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored.
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefix in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default: penalties;dry;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampling-seq, --sampler-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default: edkypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp N                                temperature (default: 0.8)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
--top-p N                               top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.1, 0.0 = disabled)
--top-nsigma N                          top-n-sigma sampling (default: -1.0, -1.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.0, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.1, 1.0 = disabled)
--typical N                             locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--dynatemp-range N                      dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.0)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.0)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir) (default: '')
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead


----- example-specific params -----

--no-display-prompt                     don't print prompt at generation (default: false)
-co,   --color                          colorise output to distinguish prompt and user input from generations
                                        (default: false)
--no-context-shift                      disables context shift on infinite text generation (default: disabled)
                                        (env: LLAMA_ARG_NO_CONTEXT_SHIFT)
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
                                        template)
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-ptc,  --print-token-count N            print token count every N tokens (default: -1)
--prompt-cache FNAME                    file to cache prompt state for faster startup (default: none)
--prompt-cache-all                      if specified, saves user input and generations to cache as well
--prompt-cache-ro                       if specified, uses the prompt cache but does not update it
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation                   run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: auto enabled if chat template is available)
-no-cnv, --no-conversation              force disable conversation mode (default: false)
-st,   --single-turn                    run conversation for a single turn only, then exit when done
                                        will not be interactive if first turn is predefined with --prompt
                                        (default: false)
-i,    --interactive                    run in interactive mode (default: false)
-if,   --interactive-first              run in interactive mode and wait for input right away (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--in-prefix-bos                         prefix BOS to user inputs, preceding the `--in-prefix` string
--in-prefix STRING                      string to prefix user inputs with (default: empty)
--in-suffix STRING                      string to suffix after user inputs with (default: empty)
--no-warmup                             skip warming up the model with an empty run
-gan,  --grp-attn-n N                   group-attention factor (default: 1)
                                        (env: LLAMA_ARG_GRP_ATTN_N)
-gaw,  --grp-attn-w N                   group-attention width (default: 512)
                                        (env: LLAMA_ARG_GRP_ATTN_W)
--jinja                                 use jinja template for chat (default: disabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               reasoning format (default: deepseek; allowed values: deepseek, none)
                                        controls whether thought tags are extracted from the response, and in
                                        which format they're returned. 'none' leaves thoughts unparsed in
                                        `message.content`, 'deepseek' puts them in `message.reasoning_content`
                                        (for DeepSeek R1 & Command R7B only).
                                        only supported for non-streamed responses
                                        (env: LLAMA_ARG_THINK)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2,
                                        deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, monarch, openchat, orion, phi3, phi4, rwkv-world, smolvlm,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2,
                                        deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, monarch, openchat, orion, phi3, phi4, rwkv-world, smolvlm,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles

example usage:

  text generation:     build/bin/llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv

  chat (conversation): build/bin/llama-cli -m your_model.gguf -sys "You are a helpful assistant"

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$

Test

build/bin/llama-cli --model models/Qwen3-30B-A3B-Q8_0.gguf -p "How many letters R in the word STRAWBERRY?"

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$ build/bin/llama-cli --model models/Qwen3-30B-A3B-Q8_0.gguf -p "How many letters R in the word STRAWBERRY?

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$ build/bin/llama-cli --model models/Qwen3-30B-A3B-Q8_0.gguf -p "How many letters R in the word STRAWBERRY?"
build: 5303 (becc6f18) with gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 43 key-value pairs and 579 tensors from models/Qwen3-30B-A3B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-30B-A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3-30B-A3B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  14:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  15:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv  16:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  17:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  18:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  19:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  20:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  23:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  24:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  25:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  26:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - kv  38:                          general.file_type u32              = 7
llama_model_loader: - kv  39:                      quantize.imatrix.file str              = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  40:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-30B-A3B.txt
llama_model_loader: - kv  41:             quantize.imatrix.entries_count i32              = 384
llama_model_loader: - kv  42:              quantize.imatrix.chunks_count i32              = 32
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/blob/main/README.md

nstructions to run this model in llama.cpp:
Or you can view more detailed instructions here: unsloth.ai/blog/deepseek-r1

Do not forget about <｜User｜> and <｜Assistant｜> tokens! - Or use a chat template formatter

Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp

Example with Q8_0 K quantized cache Notice -no-cnv disables auto conversation mode

./llama.cpp/llama-cli \
    --model unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf \
    --cache-type-k q8_0 \
    --threads 16 \
    --prompt '<｜User｜>What is 1+1?<｜Assistant｜>' \
    -no-cnv
Example output:

 <think>
 Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
 Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
 Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
 I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
 Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

./llama.cpp/llama-cli \
--model unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
--cache-type-k q8_0
--threads 16
--prompt '<｜User｜>What is 1+1?<｜Assistant｜>'
--n-gpu-layers 20 \
 -no-cnv
Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb

✨ Finetune for Free
All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
Llama-3.2 (3B)	▶️ Start on Colab	2.4x faster	58% less
Llama-3.2 (11B vision)	▶️ Start on Colab	2x faster	60% less
Qwen2 VL (7B)	▶️ Start on Colab	1.8x faster	60% less
Qwen2.5 (7B)	▶️ Start on Colab	2x faster	60% less
Llama-3.1 (8B)	▶️ Start on Colab	2.4x faster	58% less
Phi-3.5 (mini)	▶️ Start on Colab	2x faster	50% less
Gemma 2 (9B)	▶️ Start on Colab	2.4x faster	58% less
Mistral (7B)	▶️ Start on Colab	2.2x faster	62% less


This Llama 3.2 conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
Special Thanks
A huge thank you to the DeepSeek team for creating and releasing these models.

DeepSeek-R1
DeepSeek-V3
Homepage Chat Hugging Face
Discord Wechat Twitter Follow
Code License Model License
Paper Link👁️

1. Introduction
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.

NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the Usage Recommendation section.



2. Model Summary
Post-Training: Large-Scale Reinforcement Learning on the Base Model

We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.

We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.

Distillation: Smaller Models Can Be Powerful Too

We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.
Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.
3. Model Downloads
DeepSeek-R1 Models
Model	#Total Params	#Activated Params	Context Length	Download
DeepSeek-R1-Zero	671B	37B	128K	🤗 HuggingFace
DeepSeek-R1	671B	37B	128K	🤗 HuggingFace
DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository.

DeepSeek-R1-Distill Models
Model	Base Model	Download
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	🤗 HuggingFace
DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B	🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	🤗 HuggingFace
DeepSeek-R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	🤗 HuggingFace
DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. We slightly change their configs and tokenizers. Please use our setting to run these models.

4. Evaluation Results
DeepSeek-R1-Evaluation
For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.

Category	Benchmark (Metric)	Claude-3.5-Sonnet-1022	GPT-4o 0513	DeepSeek V3	OpenAI o1-mini	OpenAI o1-1217	DeepSeek R1
Architecture	-	-	MoE	-	-	MoE
# Activated Params	-	-	37B	-	-	37B
# Total Params	-	-	671B	-	-	671B
English	MMLU (Pass@1)	88.3	87.2	88.5	85.2	91.8	90.8
MMLU-Redux (EM)	88.9	88.0	89.1	86.7	-	92.9
MMLU-Pro (EM)	78.0	72.6	75.9	80.3	-	84.0
DROP (3-shot F1)	88.3	83.7	91.6	83.9	90.2	92.2
IF-Eval (Prompt Strict)	86.5	84.3	86.1	84.8	-	83.3
GPQA-Diamond (Pass@1)	65.0	49.9	59.1	60.0	75.7	71.5
SimpleQA (Correct)	28.4	38.2	24.9	7.0	47.0	30.1
FRAMES (Acc.)	72.5	80.5	73.3	76.9	-	82.5
AlpacaEval2.0 (LC-winrate)	52.0	51.1	70.0	57.8	-	87.6
ArenaHard (GPT-4-1106)	85.2	80.4	85.5	92.0	-	92.3
Code	LiveCodeBench (Pass@1-COT)	33.8	34.2	-	53.8	63.4	65.9
Codeforces (Percentile)	20.3	23.6	58.7	93.4	96.6	96.3
Codeforces (Rating)	717	759	1134	1820	2061	2029
SWE Verified (Resolved)	50.8	38.8	42.0	41.6	48.9	49.2
Aider-Polyglot (Acc.)	45.3	16.0	49.6	32.9	61.7	53.3
Math	AIME 2024 (Pass@1)	16.0	9.3	39.2	63.6	79.2	79.8
MATH-500 (Pass@1)	78.3	74.6	90.2	90.0	96.4	97.3
CNMO 2024 (Pass@1)	13.1	10.8	43.2	67.6	-	78.8
Chinese	CLUEWSC (EM)	85.4	87.9	90.9	89.9	-	92.8
C-Eval (EM)	76.7	76.0	86.5	68.9	-	91.8
C-SimpleQA (Correct)	55.4	58.7	68.0	40.3	-	63.7
Distilled Model Evaluation
Model	AIME 2024 pass@1	AIME 2024 cons@64	MATH-500 pass@1	GPQA Diamond pass@1	LiveCodeBench pass@1	CodeForces rating
GPT-4o-0513	9.3	13.4	74.6	49.9	32.9	759
Claude-3.5-Sonnet-1022	16.0	26.7	78.3	65.0	38.9	717
o1-mini	63.6	80.0	90.0	60.0	53.8	1820
QwQ-32B-Preview	44.0	60.0	90.6	54.5	41.9	1316
DeepSeek-R1-Distill-Qwen-1.5B	28.9	52.7	83.9	33.8	16.9	954
DeepSeek-R1-Distill-Qwen-7B	55.5	83.3	92.8	49.1	37.6	1189
DeepSeek-R1-Distill-Qwen-14B	69.7	80.0	93.9	59.1	53.1	1481
DeepSeek-R1-Distill-Qwen-32B	72.6	83.3	94.3	62.1	57.2	1691
DeepSeek-R1-Distill-Llama-8B	50.4	80.0	89.1	49.0	39.6	1205
DeepSeek-R1-Distill-Llama-70B	70.0	86.7	94.5	65.2	57.5	1633
5. Chat Website & API Platform
You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink"

We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com

6. How to Run Locally
DeepSeek-R1 Models
Please visit DeepSeek-V3 repo for more information about running DeepSeek-R1 locally.

DeepSeek-R1-Distill Models
DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.

For instance, you can easily start a service using vLLM:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager
You can also easily start a service using SGLang

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2
Usage Recommendations
We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:

Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
Avoid adding a system prompt; all instructions should be contained within the user prompt.
For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
When evaluating model performance, it is recommended to conduct multiple tests and average the results.
7. License
This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that:

DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B are derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 800k samples curated with DeepSeek-R1.
DeepSeek-R1-Distill-Llama-8B is derived from Llama3.1-8B-Base and is originally licensed under llama3.1 license.
DeepSeek-R1-Distill-Llama-70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license.
8. Citation
@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
      title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
      author={DeepSeek-AI and Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Ruoyu Zhang and Runxin Xu and Qihao Zhu and Shirong Ma and Peiyi Wang and Xiao Bi and Xiaokang Zhang and Xingkai Yu and Yu Wu and Z. F. Wu and Zhibin Gou and Zhihong Shao and Zhuoshu Li and Ziyi Gao and Aixin Liu and Bing Xue and Bingxuan Wang and Bochao Wu and Bei Feng and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Qu and Hui Li and Jianzhong Guo and Jiashi Li and Jiawei Wang and Jingchang Chen and Jingyang Yuan and Junjie Qiu and Junlong Li and J. L. Cai and Jiaqi Ni and Jian Liang and Jin Chen and Kai Dong and Kai Hu and Kaige Gao and Kang Guan and Kexin Huang and Kuai Yu and Lean Wang and Lecong Zhang and Liang Zhao and Litong Wang and Liyue Zhang and Lei Xu and Leyi Xia and Mingchuan Zhang and Minghua Zhang and Minghui Tang and Meng Li and Miaojun Wang and Mingming Li and Ning Tian and Panpan Huang and Peng Zhang and Qiancheng Wang and Qinyu Chen and Qiushi Du and Ruiqi Ge and Ruisong Zhang and Ruizhe Pan and Runji Wang and R. J. Chen and R. L. Jin and Ruyi Chen and Shanghao Lu and Shangyan Zhou and Shanhuang Chen and Shengfeng Ye and Shiyu Wang and Shuiping Yu and Shunfeng Zhou and Shuting Pan and S. S. Li and Shuang Zhou and Shaoqing Wu and Shengfeng Ye and Tao Yun and Tian Pei and Tianyu Sun and T. Wang and Wangding Zeng and Wanjia Zhao and Wen Liu and Wenfeng Liang and Wenjun Gao and Wenqin Yu and Wentao Zhang and W. L. Xiao and Wei An and Xiaodong Liu and Xiaohan Wang and Xiaokang Chen and Xiaotao Nie and Xin Cheng and Xin Liu and Xin Xie and Xingchao Liu and Xinyu Yang and Xinyuan Li and Xuecheng Su and Xuheng Lin and X. Q. Li and Xiangyue Jin and Xiaojin Shen and Xiaosha Chen and Xiaowen Sun and Xiaoxiang Wang and Xinnan Song and Xinyi Zhou and Xianzu Wang and Xinxia Shan and Y. K. Li and Y. Q. Wang and Y. X. Wei and Yang Zhang and Yanhong Xu and Yao Li and Yao Zhao and Yaofeng Sun and Yaohui Wang and Yi Yu and Yichao Zhang and Yifan Shi and Yiliang Xiong and Ying He and Yishi Piao and Yisong Wang and Yixuan Tan and Yiyang Ma and Yiyuan Liu and Yongqiang Guo and Yuan Ou and Yuduan Wang and Yue Gong and Yuheng Zou and Yujia He and Yunfan Xiong and Yuxiang Luo and Yuxiang You and Yuxuan Liu and Yuyang Zhou and Y. X. Zhu and Yanhong Xu and Yanping Huang and Yaohui Li and Yi Zheng and Yuchen Zhu and Yunxian Ma and Ying Tang and Yukun Zha and Yuting Yan and Z. Z. Ren and Zehui Ren and Zhangli Sha and Zhe Fu and Zhean Xu and Zhenda Xie and Zhengyan Zhang and Zhewen Hao and Zhicheng Ma and Zhigang Yan and Zhiyu Wu and Zihui Gu and Zijia Zhu and Zijun Liu and Zilin Li and Ziwei Xie and Ziyang Song and Zizheng Pan and Zhen Huang and Zhipeng Xu and Zhongyu Zhang and Zhen Zhang},
      year={2025},
      eprint={2501.12948},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.12948},
}
9. Contact
If you have any questions, please raise an issue or contact us at service@deepseek.com.

+ https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/blob/main/README.md

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/blob/main/README.md

nstructions to run this model in llama.cpp:
Or you can view more detailed instructions here: unsloth.ai/blog/deepseek-r1

Do not forget about <｜User｜> and <｜Assistant｜> tokens! - Or use a chat template formatter

Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp

Example with Q8_0 K quantized cache Notice -no-cnv disables auto conversation mode

./llama.cpp/llama-cli \
    --model unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf \
    --cache-type-k q8_0 \
    --threads 16 \
    --prompt '<｜User｜>What is 1+1?<｜Assistant｜>' \
    -no-cnv
Example output:

 <think>
 Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly.
 Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense.
 Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything.
 I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right.
 Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...
If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

./llama.cpp/llama-cli \
--model unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
--cache-type-k q8_0
--threads 16
--prompt '<｜User｜>What is 1+1?<｜Assistant｜>'
--n-gpu-layers 20 \
 -no-cnv

+ ./llama.cpp/build/bin/llama-cli \

Run

./llama.cpp/build/bin/llama-cli \
    --model ./llama.cpp/models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf \
    --cache-type-k q8_0 \
    --threads 16 \
    --prompt '<｜User｜>What is 1+1?<｜Assistant｜>' \
    -no-cnv

ljubomir@macbook2(:):~$ ./llama.cpp/build/bin/llama-cli     --model ./llama.cpp/models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf     --cache-type-k q8_0     --threads 16     --prompt '<｜User｜>What is 1+1?<｜Assistant｜>'     -no-cnv
build: 4719 (e6d7a014) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from ./llama.cpp/models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 70B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Llama
llama_model_loader: - kv   5:                         general.size_label str              = 70B
llama_model_loader: - kv   6:                            general.license str              = llama3.3
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = DeepSeek R1 Distill Llama 70B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  11:                               general.tags arr[str,6]       = ["deepseek", "unsloth", "transformers...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          llama.block_count u32              = 80
llama_model_loader: - kv  14:                       llama.context_length u32              = 131072
llama_model_loader: - kv  15:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  16:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  22:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  23:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  24:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  34:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  36:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - kv  38:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q4_K:  441 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   81 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 39.59 GiB (4.82 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 28672
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
>>>>>>> refs/heads/macbook2/master
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
<<<<<<< HEAD
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3-30B-A3B
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size = 30973.40 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:        CPU KV buffer size =   384.00 MiB
llama_context: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context:        CPU compute buffer size =   300.75 MiB
llama_context: graph nodes  = 3126
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 10 (n_threads_batch = 10) / 10 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 3760237940
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
How many letters R in the word STRAWBERRY?
assistant
<think>

....................................................................
....................................................................
....................................................................

</think>

The word **STRAWBERRY** is spelled as: **S-T-R-A-W-B-E-R-R-Y**.

Breaking it down letter by letter:
1. **S**
2. **T**
3. **R** (1st **R**)
4. **A**
5. **W**
6. **B**
7. **E**
8. **R** (2nd **R**)
9. **R** (3rd **R**)
10. **Y**

There are **three** instances of the letter **R** in the word.

**Answer:** 3

>
llama_perf_sampler_print:    sampling time =      91.61 ms /   810 runs   (    0.11 ms per token,  8842.31 tokens per second)
llama_perf_context_print:        load time =    4129.60 ms
llama_perf_context_print: prompt eval time =    1357.26 ms /    20 tokens (   67.86 ms per token,    14.74 tokens per second)
llama_perf_context_print:        eval time =  210115.88 ms /   789 runs   (  266.31 ms per token,     3.76 tokens per second)
llama_perf_context_print:       total time = 1401353.05 ms /   809 tokens
Interrupted by user


print_info: model type       = 70B
print_info: model params     = 70.55 B
print_info: general.name     = DeepSeek R1 Distill Llama 70B
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 128001 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 128001 '<｜end▁of▁sentence｜>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<｜end▁of▁sentence｜>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Metal_Mapped model buffer size = 40543.11 MiB
load_tensors:   CPU_Mapped model buffer size =   563.62 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 77309.41 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'q8_0', type_v = 'f16', n_layer = 80, can_shift = 1
llama_kv_cache_init:      Metal KV buffer size =   980.00 MiB
llama_init_from_model: KV self size  =  980.00 MiB, K (q8_0):  340.00 MiB, V (f16):  640.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model:      Metal compute buffer size =   584.00 MiB
llama_init_from_model:        CPU compute buffer size =    24.01 MiB
llama_init_from_model: graph nodes  = 2566
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | MATMUL_INT8 = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

sampler seed: 3589105319
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

What is 1+1?<think>
First, I recognize that the problem is asking for the sum of 1 and 1.

I recall that in basic arithmetic, adding two numbers combines their values to produce a total.

Therefore, when I add 1 and 1 together, the result is 2.
</think>

Certainly! Let's solve the problem step by step.

**Problem:** What is \(1 + 1\)?

**Solution:**

1. **Understand the Operation:**
   We are asked to add two numbers: 1 and 1.

2. **Perform the Addition:**
   \[
   1 + 1 = 2
   \]

3. **Conclusion:**
   The sum of 1 and 1 is **2**.

**Final Answer:** \(\boxed{2}\) [end of text]


llama_perf_sampler_print:    sampling time =       8.99 ms /   176 runs   (    0.05 ms per token, 19579.49 tokens per second)
llama_perf_context_print:        load time =   25525.19 ms
llama_perf_context_print: prompt eval time =     673.78 ms /    10 tokens (   67.38 ms per token,    14.84 tokens per second)
llama_perf_context_print:        eval time =   34852.22 ms /   165 runs   (  211.23 ms per token,     4.73 tokens per second)
llama_perf_context_print:       total time =   35570.00 ms /   175 tokens
ggml_metal_free: deallocating
ljubomir@macbook2(:):~$

LJ Wed 30 Apr 15:48:18 BST 2025

+ https://unsloth.ai/blog/deepseek-r1-0528

https://x.com/danielhanchen/status/1928278088951157116

https://unsloth.ai/blog/deepseek-r1-0528

Run DeepSeek-R1-0528 Dynamic 1-bit GGUFs
May 29, 2025
•
By Daniel & Michael
DeepSeek-R1-0528 is DeepSeek's new update to their R1 reasoning model. R1-0528 is the world's most powerful open-source model, rivalling OpenAI's GPT-4.5, o3 and Google's Gemini 2.5 Pro.

DeepSeek also released a R1-0528 distilled version by fine-tuning Qwen3 (8B). The distill achieves the same performance as Qwen3 (235B). Qwen3 GGUF: DeepSeek-R1-0528-Qwen3-8B-GGUF
You can also fine-tune the Qwen3 model with Unsloth.

You can run the model using Unsloth's 1.78-bit Dynamic 2.0 GGUFs on your favorite inference frameworks. We quantized DeepSeek’s R1 671B parameter model from 720GB down to 185GB - a 75% size reduction.

Recommended: Read our Complete Guide for a walkthrough on how to run DeepSeek-R1-0528 locally.

To ensure the best tradeoff between accuracy and size, we do not to quantize all layers, but selectively quantize e.g. the MoE layers to lower bit, and leave attention and other layers in 4 or 6bit.

And grab our full DeepSeek-R1-0528 GGUFs here
🐋How to Run DeepSeek-R1-0528
For DeepSeek-R1-0528-Qwen3-8B, the model can pretty much fit in any setup, and even those with as less as 20GB RAM. There is no need for any prep beforehand.

Qwen3 and the full R1-0528 model uses the same settings and chat templates.

According to DeepSeek, these are the recommended settings for R1 (R1-0528 should use the same settings) inference:
- Set the temperature 0.6to reduce repetition and incoherence.
- Set top_p to 0.95 (recommended)
- Run multiple tests and average results for reliable evaluation.

For optimal runtime performance, we recommend using the 2.71-bit Dynamic version and ensuring you have at least 80GB of combined VRAM and system RAM. While it's technically possible to run the model without a GPU, we advise against it, unless you're leveraging Apple's unified memory chips.

For the 1.78-bit quantization:
- On 1x 24GB GPU (with all layers offloaded), you can expect up to 20 tokens/second throughput and around 4 tokens/second for single-user inference.
- Try to have a combination of RAM + VRAM that adds up to the size of the quant you're downloading.
- A 24GB GPU like the RTX 4090 should achieve 3 tokens/second, depending on workload and configuration.
🦙 How to Run R1-0528-Qwen3-8B in Ollama:
Install ollama if you haven't already! You can only run models up to 32B in size. To run the full 720GB R1-0528 model, see here.
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Run the model! Note you can call ollama serve in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!
ollama run hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL
To disable thinking, use (or you can set it in the system prompt):
>>> Write your prompt here
✨ How to Run R1-0528 in llama.cpp:
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use llama.cpp directly to load models, you can do the below: (:IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.
export LLAMA_CACHE="unsloth/DeepSeek-R1-0528-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/DeepSeek-R1-0528-GGUF:IQ1_S \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 99 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU"
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . I recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy.
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/DeepSeek-R1-0528-GGUF",
local_dir = "unsloth/DeepSeek-R1-0528-GGUF",
allow_patterns = ["*UD-IQ1_S*"], # Dynamic 1bit (185GB) Use "*UD-Q2_K_XL*" for Dynamic 2bit (251GB)
)
Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
./llama.cpp/llama-cli \
--model unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 99 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU" \
-no-cnv \
--prompt "<｜User｜>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>"
We also test our dynamic quants via the Heptagon test which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.
./llama.cpp/llama-cli \
--model unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 99 \
--prio 3 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 16384 \
--seed 3407 \
-ot ".ffn_.*_exps.=CPU" \
-no-cnv \
--prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"
💕 Thank you!
Thank you for the constant support. We hope to have some great news in the coming weeks! 🙏

As always, be sure to join our Reddit page and Discord server for help or just to show your support! You can also follow us on Twitter and newsletter.
Thank you for reading!
Daniel & Michael Han 🦥
29 May 2025

https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF

+ LJ Fri 30 May 2025 10:25:30 BST — cmake -B build -DLLAMA_CURL=OFF

unset CC
unset CXX

cmake -B build -DLLAMA_CURL=OFF
cmake --build build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

(torch) ljubomir@macbook2(:):~/llama.cpp$ l models/DeepSeek-R1-*
lrwx------@ 1 ljubomir  staff    86B 30 May 17:20 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf -> /Users/ljubomir/Library/Caches/llama.cpp/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
lrwx------@ 1 ljubomir  staff    86B 30 May 17:20 models/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf -> /Users/ljubomir/Library/Caches/llama.cpp/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf
lrwx------@ 1 ljubomir  staff    86B 30 May 17:20 models/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf -> /Users/ljubomir/Library/Caches/llama.cpp/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf
lrwx------@ 1 ljubomir  staff    86B 30 May 17:20 models/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf -> /Users/ljubomir/Library/Caches/llama.cpp/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf

Re-transfer, incrementally, suspect incomplete interrupted files download

md models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/
mviv ~/Downloads/DeepSeek-R1-0528-UD-IQ1_S-0000* models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S/

uv pip install huggingface_hub hf_transfer hf_xet

```python
from huggingface_hub import snapshot_download
import os

# Ensure hf_transfer is enabled
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# The base directory where your incomplete files are
# snapshot_download will create the full path like 'models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S'
# so ensure your 'models' directory is correctly specified.
target_base_dir = "models"

print(f"Attempting to download/resume the model snapshot to: {target_base_dir}")
local_dir = snapshot_download(
    repo_id="unsloth/DeepSeek-R1-0528-GGUF",
    allow_patterns="UD-IQ1_S/*",
    local_dir=target_base_dir, # This is the base directory
    local_dir_use_symlinks=False
)
print(f"Download/Resume complete for the model snapshot at: {local_dir}")
```

https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/blob/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
Git LFS Details
SHA256: 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8
Pointer size: 136 Bytes
Size of remote file: 49.1 GB
Xet backed hash: 229375f805e68a1006bcdbd96cea8f23ebabe02f9c7bd6a27598ec0a40c1df0b

ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
Length: 49094698368 (46G)
Saving to: ‘DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf’

(torch) ljubomir@macbook2(:):~/llama.cpp$ sha256 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
SHA256 (models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf) = 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8

https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/blob/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf
Git LFS Details
SHA256: 370f32bef60e3b6b074a7216bc2acb3401b79d81b9811742de47b2267068c6f2
Pointer size: 136 Bytes
Size of remote file: 49.8 GB
Xet backed hash: 7bc99aa451223d8366c1e0cf75545284a53cee097e888e8e4fd1d653ea1d73f3

ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf'
Length: 49775793088 (46G)
Saving to: ‘DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf’

(torch) ljubomir@macbook2(:):~/llama.cpp$ sha256 models/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf
SHA256 (models/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf) = 370f32bef60e3b6b074a7216bc2acb3401b79d81b9811742de47b2267068c6f2

https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/blob/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf
Git LFS Details
SHA256: 197f9e2e1e1ac30b3e6a3474a79483b923f17612394b66a4e71c7badacb4c3d0
Pointer size: 136 Bytes
Size of remote file: 50 GB
Xet backed hash: 5ced4b4f12ddb2a788248765e941a38d5e38a5e8619625c296ec8c05993130bb

ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf'
Length: 49955298016 (47G)
Saving to: ‘DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf’

(torch) ljubomir@macbook2(:):~/llama.cpp$ sha256 models/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf
SHA256 (models/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf) = 197f9e2e1e1ac30b3e6a3474a79483b923f17612394b66a4e71c7badacb4c3d0

https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/blob/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf
Git LFS Details
SHA256: dac55239e4de7a359d2ac40a0fd374f477f598a0f0d605e54dd1b1267a7401da
Pointer size: 136 Bytes
Size of remote file: 19.5 GB
Xet backed hash: 8c81e2e317c16d57d7990c5906a475f5c19dd706d0a8085b51608b9a37ebc28d

ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf'
Length: 19455845600 (18G)
Saving to: ‘DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf’

(torch) ljubomir@macbook2(:):~/llama.cpp$ sha256 models/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf
SHA256 (models/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf) = dac55239e4de7a359d2ac40a0fd374f477f598a0f0d605e54dd1b1267a7401da

# After snapshot_download, you might want to manually check each file's size
# as a final verification, especially if you encountered issues before.

build/bin/llama-cli \
    --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"

LJ Fri 30 May 2025 10:25:30 BST

+ LJ Sat 31 May 2025 07:21:47 BST — https://unsloth.ai/blog/deepseek-r1-0528

Hi, thanks for all that, stellar work. I'm trying for the smallest R1 to see what tps I get on MBP M2 96GB RAM.

I'm following this

https://unsloth.ai/blog/deepseek-r1-0528

I run into this problem:

```bash
ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_load: error loading model: corrupted model: 1086 tensors expected but 978 found
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
main: error: unable to load model
```

I suspected some of files didn't download correctly - they looked like this

```bash
ljubomir@macbook2(:):~/llama.cpp$ ls -al models/DeepSeek-R1-0528-UD-IQ1_S-0000*
-rw-r--r--@ 1 ljubomir  staff  49462945024 30 May 13:13 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
-rw-r--r--@ 1 ljubomir  staff  48568885664 30 May 14:14 models/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf
-rw-r--r--@ 1 ljubomir  staff  49564076576 30 May 15:30 models/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf
-rw-r--r--@ 1 ljubomir  staff  19455845600 30 May 16:46 models/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf
```

Is it possible to see the exact file sizes, to the byte, on HuggingFace web ui? Or put the sizes, maybe even crc like md5 sum, in a separate file?

Then it got worse. I thought - there must be some way to download incrementally, it will be smart enough to figure which file is truncated, and maybe even just download the extra, like rsync would do. So I asked gemini, it suggested

```python
from huggingface_hub import snapshot_download

# This will download the entire 'UD-IQ1_S' folder and its contents
# It will create a directory like 'models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S'
local_dir = snapshot_download(
    repo_id="unsloth/DeepSeek-R1-0528-GGUF",
    allow_patterns="UD-IQ1_S/*", # Only download files within the UD-IQ1_S folder
    local_dir="models/unsloth/DeepSeek-R1-0528-GGUF", # The base directory to download to
    local_dir_use_symlinks=False # Important for full copy
)
print(f"Downloaded model to: {local_dir}")
```

I moved the existing files in a newly created dir models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S , and run the above in ipython.

Well - turns out it wiped the files completely and it's downloading from scratch now! :-) Haha - expected better than that tbh. We really do need AI, b/c atm our stuff is AS - Artificially Stupid, haha :-) No worries, it's chugging along now, will be done. But if you can provide the files sizes someplace or even better their md5sum-s too, so we know when the big files are downloaded correctly, that would be stellar!

Thanks for everything you do guys! It's been great running stuff on localhost, been enjoying it immensely. :-)


LJ Sat 31 May 2025 07:21:47 BST

+ LJ Sat 31 May 2025 12:45:36 BST — ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'

Ignore the previous comment, seems I can't edit nor delete it anymore?

Previously had trouble downloading stuff and ensuring it's correctly downloaded. May help someone else - this worked for me:

1) Use wget to DL, it may restart a failed transfer

```bash
ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
Length: 49094698368 (46G)
Saving to: ‘DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf’
```

2) The checksum is at

```bash
https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/blob/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
Git LFS Details
SHA256: 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8
Pointer size: 136 Bytes
Size of remote file: 49.1 GB
Xet backed hash: 229375f805e68a1006bcdbd96cea8f23ebabe02f9c7bd6a27598ec0a40c1df0b
```

3) Compute and compare

```bash
(torch) ljubomir@macbook2(:):~/llama.cpp$ sha256 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
SHA256 (models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf) = 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8
```

LJ Sat 31 May 2025 12:45:36 BST

+ LJ Sat 31 May 2025 13:50:34 BST — ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \

Run:

ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"

LJ Sat 31 May 2025 13:50:34 BST

+ LJ Sat 31 May 2025 13:50:56 BST — ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \

Update - alas, it seems the 170GB weights can be run on 96GB RAM (I imagine 3/4 only is used as VRAM) on a macbook - even if mmap-ed and READ ONLY. Don't see why MacOS would not simply un/re-load whenever something is in the address space, even if not in RAM. TBH expected it to not straight out not work - expected it to work even if super slow, so I'd need to kill the process or (more likely) turn the computer off once it gets too stuck.

Put the error in gemini, but didn't learn anything about how to make it run. This:

```bash
ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 62 key-value pairs and 1086 tensors from models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-R1-0528
llama_model_loader: - kv   3:                           general.basename str              = Deepseek-R1-0528
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 256x20B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = DeepSeek R1 0528
llama_model_loader: - kv  10:               general.base_model.0.version str              = 0528
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  13:                               general.tags arr[str,1]       = ["unsloth"]
llama_model_loader: - kv  14:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  15:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  16:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  17:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  18:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  19:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  20:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  21: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  23:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  24:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  25:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  26:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  27:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  28:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  29:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  30:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  31:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  32:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  33:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  34:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  35:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  36:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  37:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  39:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  40: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  41: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  42:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  43:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  44:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  45:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  46:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  47:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  48:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  49:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  50:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  51:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  52:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  53:               general.quantization_version u32              = 2
llama_model_loader: - kv  54:                          general.file_type u32              = 24
llama_model_loader: - kv  55:                      quantize.imatrix.file str              = DeepSeek-R1-0528-GGUF/imatrix_unsloth...
llama_model_loader: - kv  56:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-R1-0528-...
llama_model_loader: - kv  57:             quantize.imatrix.entries_count i32              = 659
llama_model_loader: - kv  58:              quantize.imatrix.chunks_count i32              = 720
llama_model_loader: - kv  59:                                   split.no u16              = 0
llama_model_loader: - kv  60:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  61:                                split.count u16              = 4
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q4_K:   56 tensors
llama_model_loader: - type q5_K:   36 tensors
llama_model_loader: - type q6_K:   17 tensors
llama_model_loader: - type iq2_xxs:   24 tensors
llama_model_loader: - type iq3_xxs:   49 tensors
llama_model_loader: - type iq1_s:  126 tensors
llama_model_loader: - type iq3_s:  154 tensors
llama_model_loader: - type iq4_xs:  141 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ1_S - 1.5625 bpw
print_info: file size   = 156.72 GiB (2.01 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 128
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = Deepseek-R1-0528
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 1 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 1 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 2 '<｜▁pad▁｜>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<｜fim▁begin｜>'
print_info: FIM SUF token    = 128800 '<｜fim▁hole｜>'
print_info: FIM MID token    = 128802 '<｜fim▁end｜>'
print_info: EOG token        = 1 '<｜end▁of▁sentence｜>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: Metal_Mapped model buffer size = 46815.35 MiB
load_tensors: Metal_Mapped model buffer size = 47469.88 MiB
load_tensors: Metal_Mapped model buffer size = 47641.07 MiB
load_tensors: Metal_Mapped model buffer size = 18554.54 MiB
load_tensors:   CPU_Mapped model buffer size =   497.11 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 77309.41 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache_unified:      Metal KV buffer size =  1284.81 MiB
llama_kv_cache_unified: size = 1284.81 MiB ( 16384 cells,  61 layers,  1 seqs), K (q4_0):  308.81 MiB, V (f16):  976.00 MiB
llama_context:      Metal compute buffer size =  4522.00 MiB
llama_context:        CPU compute buffer size =    46.01 MiB
llama_context: graph nodes  = 4964
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>

system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 3358851179
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384
	top_k = 40, top_p = 0.950, min_p = 0.010, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 16384, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main : failed to eval
ggml_metal_free: deallocating
```

Thanks for everything you do guys! Top marks! Have been enjoying this. :-) Will try again in the future on a bigger box.

LJ Sat 31 May 2025 13:50:56 BST

+ LJ Sun 1 Jun 2025 08:55:50 BST — Try running Qwen3-235B-A22B-GGUF

Try running Qwen3-235B-A22B-GGUF

https://huggingface.co/Qwen/Qwen3-235B-A22B

By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses.

Advanced Usage: Switching Between Thinking and Non-Thinking Modes via User Input
We provide a soft switch mechanism that allows users to dynamically control the model's behavior when enable_thinking=True. Specifically, you can add /think and /no_think to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.

Best Practices
To achieve optimal performance, we recommend the following settings:

Sampling Parameters:

For thinking mode (enable_thinking=True), use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
For non-thinking mode (enable_thinking=False), we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

Via unsloth/Qwen3-235B-A22B-GGUF

https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF

Files

https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/blob/main/Q2_K/Qwen3-235B-A22B-Q2_K-00001-of-00002.gguf
Git LFS Details
SHA256: 06507d563a4bbbb5704c9ee84151d621bbcb52c88e2246959d6c2d04f08f76a2
Pointer size: 136 Bytes
Size of remote file: 49.9 GB
Xet backed hash: 29586f2ab3a6ff9e1ea6a4e80e001fe0461b866e4c1c3b2721f53feefbdb6ba4

wget 'https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/resolve/main/Q2_K/Qwen3-235B-A22B-Q2_K-00001-of-00002.gguf'

ljubomir@macbook2(:):~/llama.cpp$ sha256sum models/Qwen3-235B-A22B-Q2_K-00001-of-00002.gguf
06507d563a4bbbb5704c9ee84151d621bbcb52c88e2246959d6c2d04f08f76a2  models/Qwen3-235B-A22B-Q2_K-00001-of-00002.gguf

https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/blob/main/Q2_K/Qwen3-235B-A22B-Q2_K-00002-of-00002.gguf
Git LFS Details
SHA256: fd2ce2d857b731bf598e1a207e7df6dc0c9d84e4970ee6f0857fb1700d4ab858
Pointer size: 136 Bytes
Size of remote file: 35.8 GB
Xet backed hash: 81134004b96abdaf2f7eba7b57606bf4dd857fe70688359da4382474070c0432

ljubomir@macbook2(:):~/llama.cpp$ sha256sum models/Qwen3-235B-A22B-Q2_K-00002-of-00002.gguf
fd2ce2d857b731bf598e1a207e7df6dc0c9d84e4970ee6f0857fb1700d4ab858  models/Qwen3-235B-A22B-Q2_K-00002-of-00002.gguf

https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF
https://huggingface.co/unsloth/Qwen3-235B-A22B-128K-GGUF

https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune

🌠
Qwen3: How to Run & Fine-tune
Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants

Qwen's new Qwen3 models deliver state-of-the-art advancements in reasoning, instruction-following, agent capabilities, and multilingual support.

All Qwen3 uploads use our new Unsloth Dynamic 2.0 methodology, delivering the best performance on 5-shot MMLU and KL Divergence benchmarks. This means, you can run and fine-tune quantized Qwen3 LLMs with minimal accuracy loss!

We also uploaded Qwen3 with native 128K context length. Qwen achieves this by using YaRN to extend its original 40K window to 128K.

Unsloth also now supports fine-tuning and GRPO of Qwen3 and Qwen3 MOE models — 2x faster, with 70% less VRAM, and 8x longer context lengths. Fine-tune Qwen3 (14B) for free using our Colab notebook.

⚙️ Official Recommended Settings
According to Qwen, these are the recommended settings for inference:

Non-Thinking Mode Settings:
Thinking Mode Settings:
Temperature = 0.7

Temperature = 0.6

Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)

Min_P = 0.0

Top_P = 0.8

Top_P = 0.95

TopK = 20

TopK = 20

Chat template/prompt format:

Copy
<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n
For NON thinking mode, we purposely enclose <think> and </think> with nothing:

Copy
<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n
For Thinking-mode, DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.

Switching Between Thinking and Non-Thinking Mode
Qwen3 models come with built-in "thinking mode" to boost reasoning and improve response quality - similar to how QwQ-32B worked. Instructions for switching will differ depending on the inference engine you're using so ensure you use the correct instructions.

Instructions for llama.cpp and Ollama:
You can add /think and /no_think to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.

Here is an example of multi-turn conversation:

Copy
> Who are you /no_think

<think>

</think>

I am Qwen, a large-scale language model developed by Alibaba Cloud. [...]

> How many 'r's are in 'strawberries'? /think

<think>
Okay, let's see. The user is asking how many times the letter 'r' appears in the word "strawberries". [...]
</think>

The word strawberries contains 3 instances of the letter r. [...]
Instructions for transformers and vLLM:
Thinking mode:

enable_thinking=True

By default, Qwen3 has thinking enabled. When you call tokenizer.apply_chat_template, you don’t need to set anything manually.

Copy
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Default is True
)
In thinking mode, the model will generate an extra <think>...</think> block before the final answer — this lets it "plan" and sharpen its responses.

Non-thinking mode:

enable_thinking=False

Enabling non-thinking will make Qwen3 will skip all the thinking steps and behave like a normal LLM.

Copy
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Disables thinking mode
)
This mode will provide final responses directly — no <think> blocks, no chain-of-thought.

📖 Llama.cpp: Run Qwen3 Tutorial
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

Copy
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions.

Copy
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-14B-GGUF",
    local_dir = "unsloth/Qwen3-14B-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)
Run the model and try any prompt. To disable thinking, use (or you can set it in the system prompt):

Copy
>>> Write your prompt here /nothink
Running Qwen3-235B-A22B
For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

Copy
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-235B-A22B-GGUF",
    local_dir = "unsloth/Qwen3-235B-A22B-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"],
)
Run the model and try any prompt.

Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1  GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

Copy
./llama.cpp/llama-cli \
    --model unsloth/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-UD-Q2_K_XL.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --prio 3 \
    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20 \
    -no-cnv \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"

LJ test 1 - fails

ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/Qwen3-235B-A22B-Q2_K-00001-of-00002.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --override-tensor ".ffn_.*_exps.=CPU" \
    --cache-type-k q4_0 \
    --flash-attn \
    --cache-type-v q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0 \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"

Model quants Q2_K_L

https://huggingface.co/unsloth/Qwen3-235B-A22B-128K-GGUF/tree/main/Q2_K_L

https://huggingface.co/unsloth/Qwen3-235B-A22B-128K-GGUF/blob/main/Q2_K_L/Qwen3-235B-A22B-128K-Q2_K_L-00001-of-00002.gguf
Git LFS Details
SHA256: 1c59efa5d4160400ba7c4f40b75bb7e51583af3af49850673ad13310e3c23373
Pointer size: 136 Bytes
Size of remote file: 49.7 GB
Xet backed hash: 7170f71ecbcf40d7b890178c6a4576606f8fec26a6a84d9a6518a469043559c7

wget 'https://huggingface.co/unsloth/Qwen3-235B-A22B-128K-GGUF/resolve/main/Q2_K_L/Qwen3-235B-A22B-128K-Q2_K_L-00001-of-00002.gguf'

ljubomir@macbook2(:):~/llama.cpp$ sha256sum models/Qwen3-235B-A22B-128K-Q2_K_L-00001-of-00002.gguf
1c59efa5d4160400ba7c4f40b75bb7e51583af3af49850673ad13310e3c23373  models/Qwen3-235B-A22B-128K-Q2_K_L-00001-of-00002.gguf

https://huggingface.co/unsloth/Qwen3-235B-A22B-128K-GGUF/blob/main/Q2_K_L/Qwen3-235B-A22B-128K-Q2_K_L-00002-of-00002.gguf
Git LFS Details
SHA256: 4f2e979fc4e927e01a030e4e94d583ed53c99d769536722e71a25ea915d36009
Pointer size: 136 Bytes
Size of remote file: 36.1 GB
Xet backed hash: 9e70ef94c94ccc8e4dabded88270e900d1f23e84139c76cab38a37bee1531062

ljubomir@macbook2(:):~/llama.cpp$ sha256sum models/Qwen3-235B-A22B-128K-Q2_K_L-00002-of-00002.gguf
4f2e979fc4e927e01a030e4e94d583ed53c99d769536722e71a25ea915d36009  models/Qwen3-235B-A22B-128K-Q2_K_L-00002-of-00002.gguf


LJ test 2

ljubomir@macbook2(:):~/llama.cpp$ sudo sysctl iogpu.wired_limit_mb=90000
iogpu.wired_limit_mb: 0 -> 90000

ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/Qwen3-235B-A22B-128K-Q2_K_L-00001-of-00002.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --override-tensor ".ffn_.*_exps.=CPU" \
    --cache-type-k q4_0 \
    --flash-attn \
    --cache-type-v q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0 \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"

FAIL -

ljubomir@macbook2(:):~/llama.cpp$ sudo sysctl iogpu.wired_limit_mb=90000
iogpu.wired_limit_mb: 90000 -> 90000
ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/Qwen3-235B-A22B-128K-Q2_K_L-00001-of-00002.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --override-tensor ".ffn_.*_exps.=CPU" \
    --cache-type-k q4_0 \
    --flash-attn \
    --cache-type-v q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0 \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 89999 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 49 key-value pairs and 1131 tensors from models/Qwen3-235B-A22B-128K-Q2_K_L-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   3:                           general.finetune str              = 128k
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Qwen3 235B A22B
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 131072
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  28:                 qwen3moe.rope.scaling.type str              = yarn
llama_model_loader: - kv  29:               qwen3moe.rope.scaling.factor f32              = 4.000000
llama_model_loader: - kv  30: qwen3moe.rope.scaling.original_context_length u32              = 32768
llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  37:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  38:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  39:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  40:               general.quantization_version u32              = 2
llama_model_loader: - kv  41:                          general.file_type u32              = 10
llama_model_loader: - kv  42:                      quantize.imatrix.file str              = Qwen3-235B-A22B-128K-GGUF/imatrix_uns...
llama_model_loader: - kv  43:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-235B-A22B-1...
llama_model_loader: - kv  44:             quantize.imatrix.entries_count i32              = 744
llama_model_loader: - kv  45:              quantize.imatrix.chunks_count i32              = 685
llama_model_loader: - kv  46:                                   split.no u16              = 0
llama_model_loader: - kv  47:                        split.tensors.count i32              = 1131
llama_model_loader: - kv  48:                                split.count u16              = 2
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q2_K:  376 tensors
llama_model_loader: - type q3_K:  188 tensors
llama_model_loader: - type q4_K:   95 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q2_K - Medium
print_info: file size   = 79.94 GiB (2.92 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 94
print_info: n_head           = 64
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.25
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 235B.A22B
print_info: model params     = 235.09 B
print_info: general.name     = Qwen3-235B-A22B-128K
print_info: n_ff_exp         = 1536
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: Metal_Mapped model buffer size = 47398.20 MiB
load_tensors: Metal_Mapped model buffer size = 33622.50 MiB
load_tensors:   CPU_Mapped model buffer size = 46885.27 MiB
load_tensors:   CPU_Mapped model buffer size = 34456.49 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.25
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 94371.84 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:      Metal KV buffer size =   846.00 MiB
llama_kv_cache_unified: size =  846.00 MiB ( 16384 cells,  94 layers,  1 seqs), K (q4_0):  423.00 MiB, V (q4_0):  423.00 MiB
llama_context:      Metal compute buffer size =   377.25 MiB
llama_context:        CPU compute buffer size =    96.01 MiB
llama_context: graph nodes  = 5929
llama_context: graph splits = 190
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main: llama threadpool init, n_threads = 32
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 32 (n_threads_batch = 32) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 2185882232
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384
	top_k = 20, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 16384, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
user
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.
assistant

assistant
ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main : failed to eval
ggml_metal_free: deallocating
ljubomir@macbook2(:):~/llama.cpp$

LJ Sun  1 Jun 2025 08:55:50 BST

+ LJ Sun 1 Jun 2025 07:14:27 BST — ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli --help

ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli --help

----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--completion-bash                       print source-able bash completion script for llama.cpp
--verbose-prompt                        print a verbose prompt before generation (default: false)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 4096, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity, -2 = until
                                        context filled)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--swa-full                              use full-size SWA cache (default: false)
                                        [(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
                                        (env: LLAMA_ARG_SWA_FULL)
-fa,   --flash-attn                     enable Flash Attention (default: disabled)
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--no-perf                               disable internal libllama performance timings (default: false)
                                        (env: LLAMA_ARG_NO_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape                         process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
--no-escape                             do not process escape sequences
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: 1.0)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: 32.0)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-nkvo, --no-kv-offload                  disable KV offload
                                        (env: LLAMA_ARG_NO_KV_OFFLOAD)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (default: 0.1, < 0 - disabled)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--no-mmap                               do not memory-map model (slower load but may reduce pageouts if not
                                        using mlock)
                                        (env: LLAMA_ARG_NO_MMAP)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
--override-tensor, -ot <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
-ngl,  --gpu-layers, --n-gpu-layers N   number of layers to store in VRAM
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row}    how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs
                                        - row: split rows across GPUs
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE            advanced option to override model metadata by key. may be specified
                                        multiple times.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false
--no-op-offload                         disable offloading host tensor operations to device (default: false)
--lora FNAME                            path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE               path to LoRA adapter with user defined scaling (can be repeated to use
                                        multiple adapters)
--control-vector FNAME                  add a control vector
                                        note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE     add a control vector with user defined scaling SCALE
                                        note: this argument can be repeated to add multiple scaled control
                                        vectors
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: unsloth/phi-4-GGUF:q4_k_m
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hfd,  -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_HFD_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
--log-colors                            Enable colored logging
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
--offline                               Offline mode: forces use of cache, prevents network access
                                        (env: LLAMA_OFFLINE)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored.
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefix in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default:
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampling-seq, --sampler-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default:
                                        edskypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp N                                temperature (default: 0.8)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
--top-p N                               top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.1, 0.0 = disabled)
--top-nsigma N                          top-n-sigma sampling (default: -1.0, -1.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.0, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.1, 1.0 = disabled)
--typical N                             locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--dynatemp-range N                      dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.0)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.0)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir) (default: '')
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead


----- example-specific params -----

--no-display-prompt                     don't print prompt at generation (default: false)
-co,   --color                          colorise output to distinguish prompt and user input from generations
                                        (default: false)
--no-context-shift                      disables context shift on infinite text generation (default: disabled)
                                        (env: LLAMA_ARG_NO_CONTEXT_SHIFT)
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
                                        template)
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-ptc,  --print-token-count N            print token count every N tokens (default: -1)
--prompt-cache FNAME                    file to cache prompt state for faster startup (default: none)
--prompt-cache-all                      if specified, saves user input and generations to cache as well
--prompt-cache-ro                       if specified, uses the prompt cache but does not update it
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation                   run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: auto enabled if chat template is available)
-no-cnv, --no-conversation              force disable conversation mode (default: false)
-st,   --single-turn                    run conversation for a single turn only, then exit when done
                                        will not be interactive if first turn is predefined with --prompt
                                        (default: false)
-i,    --interactive                    run in interactive mode (default: false)
-if,   --interactive-first              run in interactive mode and wait for input right away (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--in-prefix-bos                         prefix BOS to user inputs, preceding the `--in-prefix` string
--in-prefix STRING                      string to prefix user inputs with (default: empty)
--in-suffix STRING                      string to suffix after user inputs with (default: empty)
--no-warmup                             skip warming up the model with an empty run
-gan,  --grp-attn-n N                   group-attention factor (default: 1)
                                        (env: LLAMA_ARG_GRP_ATTN_N)
-gaw,  --grp-attn-w N                   group-attention width (default: 512)
                                        (env: LLAMA_ARG_GRP_ATTN_W)
--jinja                                 use jinja template for chat (default: disabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               controls whether thought tags are allowed and/or extracted from the
                                        response, and in which format they're returned; one of:
                                        - none: leaves thoughts unparsed in `message.content`
                                        - deepseek: puts thoughts in `message.reasoning_content` (except in
                                        streaming mode, which behaves as `none`)
                                        (default: deepseek)
                                        (env: LLAMA_ARG_THINK)
--reasoning-budget N                    controls the amount of thinking allowed; currently only one of: -1 for
                                        unrestricted thinking budget, or 0 to disable thinking (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2,
                                        deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4,
                                        rwkv-world, smolvlm, vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2,
                                        deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4,
                                        rwkv-world, smolvlm, vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles

example usage:

  text generation:     build/bin/llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv

  chat (conversation): build/bin/llama-cli -m your_model.gguf -sys "You are a helpful assistant"

LJ Sun  1 Jun 2025 07:14:27 BST

+ https://www.reddit.com/r/LocalLLaMA/comments/186phti/m1m2m3_increase_vram_allocation_with_sudo_sysctl/

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate)

https://www.reddit.com/r/LocalLLaMA/comments/186phti/m1m2m3_increase_vram_allocation_with_sudo_sysctl/

r/LocalLLaMA
•
2 yr. ago
farkinga

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate)
Tutorial | Guide
If you're using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple's unique architecture for sharing the same high-speed RAM between CPU and GPU.

It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345

See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315

Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature ... And tbh that wasn't that great.

Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you'll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!

EDIT: if you have a 192gb m1/m2/m3 system, can you confirm whether this trick can be used to recover approx 40gb VRAM? A boost of 40gb is a pretty big deal IMO.

Sort by:

Best

Comments Section

farkinga OP • 2y ago
One note on this ... All macos systems would be happiest to have at least 8gb available for OS stuff.

For a 32gb system, the math looks like this: 32gb-8gb=24gb. For me, I can gain 2.2gb this way. Not bad!

For those with 192gb - WOW. You go from having ~140gb VRAM to 184gb. That's a HUGE increase. As long as you keep the rest of your system utilization under control, this trick just massively increased the utility of those high-end Metal systems.

FlishFlashman • 2y ago
I looked at what wired memory (memory that can't be swapped) was without having an LLM loaded/running and then added a margin to that. I ended up allocating 26.5GB, up from 22.8GB default.

It worked, but it didn't work great because I still had a bunch of other stuff running on my Mac, so (not surprisingly) swapping slowed it down. For anything more than a proof of concept test I'd be shutting all the unnecessary stuff down.

u/fallingdowndizzyvr avatar
fallingdowndizzyvr • 2y ago
I ended up allocating 26.5GB, up from 22.8GB default.

On my 32GB Mac, I allocate 30GB.

It worked, but it didn't work great because I still had a bunch of other stuff running on my Mac, so (not surprisingly) swapping slowed it down. For anything more than a proof of concept test I'd be shutting all the unnecessary stuff down.

That's what I do and I have no swapping at all. I listed the two big things to turn off to save RAM. Look for "I also do these couple of things to save RAM." about halfway down the post. Thus I am able to run without any swapping at all with some headroom to spare. Max RAM usage is 31.02GB.

https://www.reddit.com/r/LocalLLaMA/comments/18674zd/macs_with_32gb_of_memory_can_run_70b_models_with/

bebopkim1372 • 2y ago
My M1 Max Mac Studio has 64GB of RAM. By running sudo sysctl iogpu.wired_limit_mb=57344, it did magic!

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/Users/****/****/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 57344.00 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
Yay!

farkinga OP • 2y ago
Yeah! That's what I'm talking about. Would you happen remember what it was reporting before? If it's like the rest, I'm assuming it said something like 40 or 45gb, right?

bebopkim1372 • 2y ago
It was 48GB and now I can use 12GB more!

farkinga OP • 2y ago
wow, this is wild. It's basically adding another GPU ... and that GPU is actually pretty good, great bus speeds... for free!

CheatCodesOfLife • 2y ago • Edited 2y ago
64GB M1 Max here. Before running the command, if I tried to load up goliath-120b: (47536.00 / 49152.00) - fails

And after sudo sysctl iogpu.wired_limit_mb=57344 : (47536.00 / 57344.00)

So I guess the default is: 49152

u/fallingdowndizzyvr avatar
fallingdowndizzyvr • 2y ago •
Edited 2y ago
64GB M1 Max here. Before running the command, if I tried to load up goliath-120b: (47536.00 / 49152.00) - fails

I wonder why that failed. Your limit is higher than the RAM needed. I run with a tighter gap and it loads and runs, (28738.98 / 30146.00).

So I guess the default is: 49152

It is. To be more clear, llama.cpp tells you want the recommendedMaxWorkingSetSize is. Which should match that number.

bebopkim1372 • 2y ago
Maybe 47536MB is the net model size. For LLM inference, memory for context and optional context cache memory are also needed.

u/fallingdowndizzyvr avatar
fallingdowndizzyvr • 2y ago
They are. If you look at what llama.cpp prints out, it prints out all the buffers that it's trying to allocate. And successively updates the ( X/Y ) it needs. Was the one you posted just the first one? The very last one before it exits out with an error will be the most informative one. That one should have an X that's bigger than Y.

FlishFlashman • 2y ago
≥64GB allows 75% to be used by GPU. ≤32 its ~66%. Not sure about the 36GB machines.

u/fallingdowndizzyvr avatar
fallingdowndizzyvr • 2y ago
As per the latest developments in that discussion, "iogpu.wired_limit_mb" only works on Sonoma. So if you are on an older version of Mac OS, try "debug.iogpu.wired_limit" instead.

CheatCodesOfLife • 2y ago
That totally worked. I can run goliath 120b on my m1 max laptop now. Thanks a lot.

Zestyclose_Yak_3174 • 2y ago
Which quant did you use and how was your experience?

CheatCodesOfLife • 2y ago
46G goliath-120b.Q2_K

So the smallest one I found (I didn't quantize this one myself, found it on HF somewhere)

And it was very slow. about 13t/s prompt_eval and then 2.5t/s generating text, so only really useful for me when I need to run it on my laptop (I get like 15t/s with 120b model on my 2x3090 rig at 3bpw exl2)
As for the models it's self, I like it a lot and use it frequently.

TBH, this ram thing is more helpful for me because it lets me run Q5 70b models instead of just Q4 now.

ArthurAardvark • 1y ago
Oo. Then you'll like to see this.

https://www.reddit.com/r/LocalLLaMA/comments/1al58xw/yet_another_state_of_the_art_in_llm_quantization/

And TY for aware-ing me to the fact that I can run 120B lol

u/bladeolson26 avatar
bladeolson26 • 1y ago
u/farkinga Thanks for this post. I have an M2 Ultra with 192GB. I will give this a try and share the results.

u/bladeolson26 avatar
bladeolson26 • 1y ago
My first test, I set using 64GB

sudo sysctl iogpu.wired_limit_mb=65536

I loaded Dolphin Mixtral 8X 7B Q5 ( 34GB model )

I gave it my test prompt and it seems fast to me :

time to first token: 1.99s

gen t: 43.24s

speed: 37.00 tok/s

stop reason: completed

gpu layers: 1

cpu threads: 22

mlock: false

token count: 1661/1500

Next I tried 128GB

sudo sysctl iogpu.wired_limit_mb=131072

I loaded Goliath 120b Q4 ( 70GB model)

I gave it my test prompt and it slower to display

time to first token: 3.88s

gen t: 128.31s

speed: 7.00 tok/s

stop reason: completed

gpu layers: 1

cpu threads: 20

mlock: false

token count: 1072/1500

Third Test I tried 144GB ( leaving 48GB for OS operation 25%)

sudo sysctl iogpu.wired_limit_mb=147456

as expected similar results. no crashes.

188GB leaving just 8GB for the OS, etc..

It runs just fine. I did not have a model that big though.

The Prompt I used : Write a Game of Pac-Man in Swift :

the result from last Goliath at 188GB

time to first token: 4.25s

gen t: 167.94s

speed: 7.00 tok/s

stop reason: completed

gpu layers: 1

cpu threads: 20

mlock: false

token count: 1275/1500

import UIKit

class GameView: UIView {
    var pacmanImageView: UIImageView!
    var dotViews: [DotView] = []
    var ghostImages: [UIImage]? = nil
    var scoreLabel: UILabel!

    override init(frame: CGRect) {
        super.init(frame: frame)

        pacmanImageView = UIImageView(image: #imageLiteral(resourceName: "Pacman"))
        addSubview(pacmanImageView)

        scoreLabel = UILabel(frame: .zero)
        scoreLabel.textAlignment = .center
        scoreLabel.font = .systemFont(ofSize: 24, weight: .bold)
        scoreLabel.textColor = .white
        addSubview(scoreLabel)rangle(#imageLiteral(resourceName: "Pacman")))

        setupGame()
    }

    private func setupGame() {
        for i in 0..<9 {
            let dotView = DotView(tag: i)
            addSubview(dotView)
            dotViews.append(dotView)

            if i % 3 == 0 && i != 8 {
                let spaceView = UIView()
                spaceView.frame.size.height = frame.height / 15 * 2
                addSubview(spaceView)
            }
        }

        scoreLabel.text = "Score: 0"
    }

    private func movePacman(_ direction: Direction) {
        var newFrame = pacmanImageView.frame

        switch direction {
        case .up:
            if newFrame.minY > frame.origin.y {
                newFrame.origin.y -= newFrame.height / 2
            }

        case .down:
            if newFrame.maxY < frame.size.height - frame.origin.y - newFrame.height / 2 {
                newFrame.origin.y += newFrame.height / 2
            }

        case .left:
            if newFrame.minX > frame.origin.x {
                newFrame.origin.x -= newFrame.width / 2
            }

        case .right:
            if newFrame.maxX < frame.size.width - frame.origin.x - newBoardView.frame.width / 2 {
                newFrame.origin.x += newBoardView.frame.width / 2
            }
        }

        pacmanImageView.frame = newFrame
    }

    func gameLogic() {
        // Implement your game logic here:
        // - Detect collisions with dots and ghosts
        // - Update score
        // - Move Pac-Man and ghosts
        // - Generate new dots
    }
}

class DotView: UIView {
    var isEaten = false

    override init(frame: CGRect) {
        super.init(frame: frame)

        backgroundColor = .systemGreen
        layer.cornerRadius = 10
        isUserInteractionEnabled = true

        let tapGesture = UITapGestureRecognizer(target: self, action: #selector(eatDot))
        addGestureRecognizer(tapGesture)
    }

    @objc func eatDot() {
        if !isEaten {
            isEaten = true
            backgroundColor = .systemOrange

            // Decrease score and update label

            // Check for game over conditions
        }
    }

    required init?(coder: NSCoder) {
        super.init(coder: coder)
    }
}

enum Direction {
    case up, down, left, right
}


farkinga OP • 1y ago
Omg, I am legit excited it ran with just 8gb reserved for os. That's so much extra VRAM - for free!

Thanks for trying it at different levels. I doubt it will be seen here; consider posting as a new thread.

krishnakaasyap • 1y ago
This is awesome, fellow Redditor! But what would be the stats if you used all the GPU layers and NPU cores? Would it improve the time to first token and tokens per second?I would love to learn more about the M2 Ultra 192GB Mac Studio as a server for inferencing large language models (LLMs). Where can I find more informative stuff, like your comment?

u/kkb294 avatar
kkb294 • 2mo ago
I increased the VRAM allocation to 40GB from the default 36GB. Thanks for the post 😀.!

u/hakyim avatar
hakyim • 16d ago
how do you get that system resources information?

u/kkb294 avatar
kkb294 • 16d ago
It is available in LM Studio UI.

u/Zugzwang_CYOA avatar
Zugzwang_CYOA • 2y ago • Edited 2y ago
How is the prompt processing time on a mac? If I were to work with a prompt that is 8k in size for RP, with big frequent changes in the prompt, would it be able to read my ever-changing prompt in a timely manner and respond?

I would like to use Sillytavern as my front end, and that can result in big prompt changes between replies.

bebopkim1372 • 2y ago
For M1, when prompt evaluations occur, BLAS operation is used and the speed is terrible. I also have a PC with 4060 Ti 16GB, and cuBLAS is the speed of light compared with BLAS speed on my M1 Max. BLAS speeds under 30B modles are acceptable, but more than 30B, it is really slow.

u/Zugzwang_CYOA avatar
Zugzwang_CYOA • 2y ago
Good to know. It sounds like macs are great at asking simple questions of powerful LLMs, but not so great at roleplaying with large context stories. I had hoped that an M2 Max would be viable for RP at 70b or 120b, but I guess not.

bebopkim1372 • 2y ago
I am using koboldcpp and it caches the prompt evaulation result, so if your prompt change actions add new content at the end of previous prompt, it will be okay because koboldcpp performs prompt evaluation only for new added content though it is still slow for 30B or bigger size models. If your prompt change is amending in the middle of context, many parts of the cache will be useless and there will be more prompt evaluation needed, so it will be very slow.

u/Zugzwang_CYOA avatar
Zugzwang_CYOA • 2y ago • Edited 2y ago
The way I use Sillytavern for roleplaying involves a lot of world entry information. World entries are inserted into the prompt when they are triggered through key words, and I use many such entries for world building. Those same world entries disappear from the prompt when they are not being talked about. I also sometimes run group chats with multiple characters. In such cases, the entire character card of the previous character would disappear from the prompt, and a new character card would appear in its place when the next character speaks. That's why my prompts tend to be ever-changing.

So, unless the cache keeps information from previous prompts, it sounds like I would be continuously re-evaluating with every response.

I suppose it would be different if it did store information from previous prompts, as that would let me swap between speaking characters or trigger a previously used world entry without having to re-evaluate every time.

But with my current 12gb 3060, quantized 13b models interface so quickly that I never even bothered to note prompt evaluation time, even with 6-8k context, and it sounds like the M2 max studio with 96gb won't be able to allow for that kind of thing at 70b as I originally hoped.

Thank you for your responses. They have been helpful.

bebopkim1372 • 2y ago
For heavy RP users like you, I think used multiple 3090s will be best for very large LLMs.

u/guymadison42 avatar
guymadison42 • 8mo ago
I am compiling LLVM with a 32 GB system, wired memory is at roughly 8 GB. That's 8 GB my system cannot reach, this has always been an issue with Metal since 2012.

I am really surprised it's never been fixed.

u/Fun_Huckleberry_7781 avatar
Fun_Huckleberry_7781 • 10mo ago
How can I check if the changed worked? I did the intial code and said it was initially set to 0

farkinga OP • 10mo ago
One way I've verified is through the llama.cpp diagnostic output. It reports the available vram as well as the size of the model and how much vram it requires.

I've got 32gb total and I think the default availability is approx 22gb. So I can easily increase to 26gb and I see the difference immediately when I launch llama.cpp - the available vram will be reported as 26gb.

Gold_Bee2694 • 2mo ago
I got a MacBook Pro with a M4 pro 14 Core CPU and 20 Core GPU and 24Gb of RAM and I want to run some coding models in lm studio so I'm wondering if its a good idea to change the VRAM from 16gb to 18 or maybe 20gb.

farkinga OP • 2mo ago
24gb RAM isn't a typical configuration for apple hardware but it is a plausible VRAM allocation for a 32gb RAM system. Check again; you might have 32gb ram.

On a 32gb M1 setup, I've allocated up to 26gb to VRAM and used that to run LLMs - but 24gb is even safer.

Gold_Bee2694 • 2mo ago
I don’t know why they changed it to 24gb of ram but it’s a thing now: https://www.apple.com/de/shop/buy-mac/macbook-pro/16%22-space-schwarz-standard-display-apple-m4-pro-mit-14-core-cpu-und-20-core-gpu-24-gb-arbeitsspeicher-512gb#

farkinga OP • 2mo ago
Hmmm... fair enough - thanks for the link. Well, then it's tight but two things that will help are:

make sure nothing else is running; close everything in the dock and menubar; there are guides that explain how to do this

consider llama.cpp instead of LM Studio to reduce RAM requirements a bit (still need to run terminal.app but this is pretty light)

I bet you could run your system on 4gb if you just ran terminal and llama.cpp. That means you could try allocating 20gb to VRAM. And you might as well try LMStudio first since it's easier.

Hey, worst that happens is you reboot; this is not a permanent change and it automatically resets on reboot.

u/shahmeer5 avatar
shahmeer5 • 1mo ago
does it affect in Gaming performance using Crossover or native ones?

PS: to get the current value use: sudo sysctl iogpu

Jelegend • 2y ago
I am getting the following error on running this command on Mac Studio M2 Max 64GB RAM

sysctl: unknown oid 'iogpu.wired_limit_mb'

Can soeome help me out here on what to do here?

u/LoSboccacc avatar
LoSboccacc • 2y ago
Profile Badge for the Achievement Top 1% Poster Top 1% Poster
Older os use debug.iogpu.wired_limit

bebopkim1372 • 2y ago
Do you use macOS Sonoma? Mine is Sonoma 14.1.1 - Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000 arm64.

spiffco7 • 2y ago
This is lifesaving news. Thank you.

+ https://github.com/ggml-org/llama.cpp/discussions/2182#discussioncomment-7698315

https://github.com/ggml-org/llama.cpp/discussions/2182#discussioncomment-7698315

Skip to content
Navigation Menu
ggml-org
llama.cpp

Type / to search
Code
Issues
320
Pull requests
462
Discussions
Actions
Projects
1
Wiki
Security
5
Insights
Adjust VRAM/RAM split on Apple Silicon #2182
dr3murr started this conversation in General
Adjust VRAM/RAM split on Apple Silicon
#2182
@dr3murr
dr3murr
on Jul 11, 2023 · 15 comments · 31 replies
Return to top

dr3murr
on Jul 11, 2023
// this tool allows you to change the VRAM/RAM split on Unified Memory on Apple Silicon to whatever you want, allowing for more VRAM for inference
// c++ -std=c++17 -framework CoreFoundation -o vram_patcher vram_patcher.cpp
// credits to @asentientbot for helping me with some stuff because I never owned a Mac before
// please read the code, if you don't understand what it does don't use it
// tested on macos ventura beta 3, pattern might be different on other versions
// usage: ./vram_patcher <desired percentage of VRAM>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sys/sysctl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <CoreFoundation/CoreFoundation.h>

// A helper function to get the sysctl value for a given name
std::string get_sysctl(const std::string& name) {
    size_t len = 0;
    // Get the size of the value
    if (sysctlbyname(name.c_str(), nullptr, &len, nullptr, 0) == -1) {
        std::cerr << "Failed to get sysctl size for " << name << std::endl;
        return "";
    }
    // Allocate a buffer for the value
    char* buf = new char[len];
    // Get the value
    if (sysctlbyname(name.c_str(), buf, &len, nullptr, 0) == -1) {
        std::cerr << "Failed to get sysctl value for " << name << std::endl;
        delete[] buf;
        return "";
    }
    // Convert the value to a string
    std::string value(buf, len);
    delete[] buf;
    return value;
}

// A helper function to convert a CFString to a std::string
std::string CFStringToStdString(CFStringRef cfstr) {
  if (cfstr == nullptr) return "";
  CFIndex length = CFStringGetLength(cfstr);
  CFIndex maxSize = CFStringGetMaximumSizeForEncoding(length, kCFStringEncodingUTF8) + 1;
  char* buffer = new char[maxSize];
  if (CFStringGetCString(cfstr, buffer, maxSize, kCFStringEncodingUTF8)) {
    std::string result(buffer);
    delete[] buffer;
    return result;
  } else {
    delete[] buffer;
    return "";
  }
}

// A function to get the name of a disk from a path
std::string get_volume_name(const std::string& path) {
  // Create a CFURL from the path
  CFURLRef url = CFURLCreateFromFileSystemRepresentation(nullptr, (const UInt8*)path.c_str(), path.length(), false);
  if (url == nullptr) return "";

  // Get the volume URL from the path URL
  CFURLRef volumeURL = CFURLCreateCopyDeletingLastPathComponent(nullptr, url);
  CFRelease(url);
  if (volumeURL == nullptr) return "";

  // Get the volume name from the volume URL
  CFStringRef volumeName = nullptr;
  if (CFURLCopyResourcePropertyForKey(volumeURL, kCFURLVolumeNameKey, &volumeName, nullptr)) {
    CFRelease(volumeURL);
    if (volumeName == nullptr) return "";
    // Convert the volume name to a std::string
    std::string result = CFStringToStdString(volumeName);
    CFRelease(volumeName);
    return result;
  } else {
    CFRelease(volumeURL);
    return "";
  }
}

void change_float_constant(std::vector<uint8_t>& code, float new_value) {
  // Check that the code vector has enough bytes for the two instructions
  if (code.size() < 8) {
    throw std::invalid_argument("code vector is too small");
  }

  // Convert the new float value to a 32-bit unsigned integer representation
  uint32_t new_bits = *reinterpret_cast<uint32_t*>(&new_value);

  // Extract the lower and upper 16 bits of the new value
  uint16_t low_bits = new_bits & 0xffff;
  uint16_t high_bits = new_bits >> 16;

  // Encode the new value as two mov instructions
  // The first instruction is movz w8, #low_bits
  // The second instruction is movk w8, #high_bits, lsl #16
  // The opcode format is:
  // | 31 30 29 28 | 27 26 25 24 | 23 22 21 20 | 19 18 17 16 | 15 14 13 12 | 11 10 9 8 | 7 6 5 4 | 3 2 1 0 |
  // | sf  0  0  1 | opc  1  0  0 | 0  0  0  0 |    hw    |     imm16     | 0  0  0  0 |   Rd   |   0  0  0  0 |
  // where sf = 0 for 32-bit register, opc = 00 for movz, 01 for movn, 10 for movk, hw = 00 for lsl #0, 01 for lsl #16, 10 for lsl #32, 11 for lsl #48, imm16 = 16-bit immediate value, Rd = destination register

  // The first instruction has sf = 0, opc = 00, hw = 00, imm16 = low_bits, Rd = 8
  uint32_t first_opcode = 0x52800000 | (low_bits << 5) | 8;

  // The second instruction has sf = 0, opc = 10, hw = 01, imm16 = high_bits, Rd = 8
  uint32_t second_opcode = 0x72800000 | (high_bits << 5) | (1 << 21) | 8;

  // Convert the opcodes to little endian bytes and overwrite the code vector
  code[0] = first_opcode & 0xff;
  code[1] = (first_opcode >> 8) & 0xff;
  code[2] = (first_opcode >> 16) & 0xff;
  code[3] = (first_opcode >> 24) & 0xff;
  code[4] = second_opcode & 0xff;
  code[5] = (second_opcode >> 8) & 0xff;
  code[6] = (second_opcode >> 16) & 0xff;
  code[7] = (second_opcode >> 24) & 0xff;
}

int main(int argc, char** argv) {
    // Check if the program is run as root
    if (getuid() != 0) {
        std::cerr << "Sorry, this program must be run as root" << std::endl;
        return 1;
    }

    // Check if the program has one argument
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " vram_percentage" << std::endl;
        return 1;
    }

    // Check if the argument is a valid percentage
    int percentage = std::stoi(argv[1]);
    if (percentage < 10 || percentage > 95) {
        std::cerr << "Invalid percentage: " << percentage << std::endl;
        return 1;
    }

    // Check if the CPU is Apple Silicon
    std::string cpu_brand = get_sysctl("machdep.cpu.brand_string");
    if (cpu_brand.find("Apple") == std::string::npos) {
        std::cerr << "Sorry, this program only works on Apple Silicon" << std::endl;
        return 1;
    }

    // Define the paths for the original and patched kernel collections
    std::string original_kc = "/private/var/db/KernelExtensionManagement/KernelCollections/BootKernelCollection.kc";
    std::string patched_kc = "/tmp/BootKernelCollection.kc";
    std::string final_kc = "/Library/KernelCollections/vram_patch.kc";
    std::string script_kc = "/Library/KernelCollections/complete_patch.sh";

    // Copy the original kernel collection to a temporary directory
    std::ifstream src(original_kc, std::ios::binary);
    std::ofstream dst(patched_kc, std::ios::binary);
    if (!src || !dst) {
        std::cerr << "Failed to copy the kernel collection" << std::endl;
        return 1;
    }
    dst << src.rdbuf();
    src.close();
    dst.close();

    // Read the patched kernel collection into a vector of bytes
    std::ifstream in(patched_kc, std::ios::binary);
    if (!in) {
        std::cerr << "Failed to read the patched kernel collection" << std::endl;
        return 1;
    }
    std::vector<uint8_t> data((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
    in.close();

    // Define the byte sequence to search for
    std::vector<uint8_t> target = {0x08, 0x01, 0xC0, 0xD2, 0x3F, 0x00, 0x08, 0xEB, 0xA8, 0xAA, 0x8A, 0x52, 0xA8, 0x40, 0xA8, 0x72, 0x00, 0x01, 0x27, 0x1E, 0x01, 0x30, 0x27, 0x1E, 0x28, 0x8C, 0x20, 0x1E};

    // Define the byte sequence to replace with
    std::vector<uint8_t> replacement = {0x68, 0x73, 0x89, 0x52, 0xA8, 0x94, 0xA8, 0x72, 0x08, 0x01, 0x27, 0x1E, 0x1F, 0x20, 0x03, 0xD5, 0x1F, 0x20, 0x03, 0xD5, 0x1F, 0x20, 0x03, 0xD5, 0x1F, 0x20, 0x03, 0xD5};

    // Subtract the percentage from 100
    percentage = 100 - percentage;

    // Change the floating point constant in the replacement byte sequence
    change_float_constant(replacement, percentage);

    // Find the first occurrence of the target byte sequence in the data vector
    auto it = std::search(data.begin(), data.end(), target.begin(), target.end());

    // If the target byte sequence is found, replace it with the replacement byte sequence
    if (it != data.end()) {
        std::copy(replacement.begin(), replacement.end(), it);
    } else {
        std::cerr << "Failed to find the target byte sequence in the kernel collection" << std::endl;
        return 1;
    }

    // Write the modified data vector to the patched kernel collection
    std::ofstream out(patched_kc, std::ios::binary);
    if (!out) {
        std::cerr << "Failed to write the patched kernel collection" << std::endl;
        return 1;
    }
    out.write(reinterpret_cast<char*>(data.data()), data.size());
    out.close();

    // Copy the patched kernel collection to the final destination
    std::ifstream src2(patched_kc, std::ios::binary);
    std::ofstream dst2(final_kc, std::ios::binary);
    if (!src2 || !dst2) {
        std::cerr << "Failed to copy the patched kernel collection; please disable SIP and try again" << std::endl;
        return 1;
    }
    dst2 << src2.rdbuf();
    src2.close();
    dst2.close();

    // Change the ownership of the final kernel collection to root:wheel
    if (chown(final_kc.c_str(), 0, 0) == -1) {
        std::cerr << "Failed to change the ownership of the final kernel collection" << std::endl;
        return 1;
    }

    // Get the volume name for the /System path
    std::string volume_name = get_volume_name("/System");
    if (volume_name.empty()) {
        std::cerr << "Failed to get the volume name for /System" << std::endl;
        return 1;
    }

    // Create a shell script to configure the boot with the final kernel collection
    std::ofstream script(script_kc);
    if (!script) {
        std::cerr << "Failed to create the shell script" << std::endl;
        return 1;
    }
    script << "#!/bin/bash\n";
    script << "# Disable System Integrity Protection\n";
    script << "csrutil disable\n";
    script << "# Disable Apple Mobile File Integrity\n";
    script << "nvram boot-args=\"ipc_control_port_options=0 amfi_get_out_of_my_way=1\"\n";
    script << "# Get the absolute path of this script\n";
    script << "SCRIPT=$(cd \"$(dirname \"${BASH_SOURCE[0]}\")\" && pwd)/$(basename \"${BASH_SOURCE[0]}\")\n";
    script << "# Append the kernel collection name to the script path\n";
    script << "KC=\"${SCRIPT%/*}/vram_patch.kc\"\n"; // This line replaces the original one
    script << "# Get the volume name from the script path\n";
    script << "VOLUME=\"${SCRIPT%%/Library*}\"\n";
    script << "# Run the kmutil command with the absolute paths\n";
    script << "kmutil configure-boot -C -c \"$KC\" -v \"$VOLUME\"\n";
    script << "sync\n";
    script << "echo Finished. You may reboot your system now.\n";
    script.close();

    // Change the ownership and permissions of the shell script to root:wheel and 755
    if (chown(script_kc.c_str(), 0, 0) == -1) {
        std::cerr << "Failed to change the ownership of the shell script" << std::endl;
        return 1;
    }
    if (chmod(script_kc.c_str(), 0755) == -1) {
        std::cerr << "Failed to change the permissions of the shell script" << std::endl;
        return 1;
    }

    // Output the instructions to run the shell script in RecoveryOS
    std::cout << "The shell script to configure the boot with the patched kernel collection has been created at " << script_kc << std::endl;
    std::cout << "Now reboot to RecoveryOS and run: \"/Volumes/" << volume_name << "/Library/KernelCollections/complete_patch.sh\"" << std::endl;
    std::cout << "If your boot fails after running the script or after an update, you need to go to Utilities->Startup Security Tool in RecoveryOS and pick Full Security" << std::endl;

    return 0;
}
Replies:15 comments · 31 replies

dr3murr
on Jul 12, 2023
Author
image
if you see this screen (after an update or otherwise), click "Recovery", sign in, go to Utilities->Startup Security Tool, and click "Reduced Security"
then open terminal and run "csrutil disable" and reboot
then you can patch again

0 replies

JasonOSX
on Jul 13, 2023
Is this supposed to fix the shortcoming that only 2/3 or 3/4 of the unified memory can be used for Metal inference? If so, I am very eager to test it out and would like to thank you for your contribution! I am wondering if "reduced security" is only temporary. Can you enable full security after the patch? Did you also ping @ggerganov - I guess he will also be interested in this concept

2 replies
@ggerganov
ggerganov
on Jul 18, 2023
Maintainer
Thanks - interesting approach. I'll wait on some more reports as I don't have a spare machine to test on. Would be great if this solves the memory limit issue

@dr3murr
dr3murr
on Jul 22, 2023
Author
it def does on my machine


dr3murr
on Jul 13, 2023
Author
Yes, this fixes that. You need relaxed security or else it will boot the unpatched kernel. This is why changing the security level undoes the patch.

0 replies

JasonOSX
on Jul 13, 2023
Yes, this fixes that. You need relaxed security or else it will boot the unpatched kernel. This is why changing the security level undoes the patch.

Thanks for your work!

I understand that it will not boot a custom kernel with full protection, comparable to a locked (secure) bootloader on a phone.

However I would advice against people disabling or loosening their security unless they understand all the risks associated. It can open a path to sophisticated and persistent threats, especially when working in AI research as some of us are.

More info: https://support.apple.com/guide/security/startup-disk-security-policy-control-sec7d92dc49f/web

I know someone at the Apple software engineering team. I will try to ask him if they can consider upping the limit a bit or if there is a way to use a similar approach without sacrificing on security. (I guess it will be difficult but I can at least give it a shot)

1 reply
@dr3murr
dr3murr
on Jul 13, 2023
Author
vm_size_t AGXAccelerator::calcMaxGPUPhysicalMemoryBytes(uint64_t unifiedMemoryBytes) {
  float reservePercent = 33.333f;
  // Use a lower percentage if the unified memory is larger than 32 GB
  if (unifiedMemoryBytes > 0x800000000LL) {
    reservePercent = 25.0f;
  }
  const OSMetaClassBase *prop = this->device->getProperty("gpu-sysmem-reserve-percent");
  if (prop) {
    OSData *data = OSDynamicCast(OSData, prop);
    if (data) {
      reservePercent = *(float *)data->getBytesNoCopy();
    }
  }
  return (vm_size_t)((unifiedMemoryBytes * (100.0f - reservePercent) / 100.0f + page_size - 1) & -page_size);
}
that would be great
the code already has a case to check for an override %, but I can find no way to set gpu-sysmem-reserve-percent without a patch


JasonOSX
on Jul 13, 2023
vm_size_t AGXAccelerator::calcMaxGPUPhysicalMemoryBytes(uint64_t unifiedMemoryBytes) {
  float reservePercent = 33.333f;
  // Use a lower percentage if the unified memory is larger than 32 GB
  if (unifiedMemoryBytes > 0x800000000LL) {
    reservePercent = 25.0f;
  }
  const OSMetaClassBase *prop = this->device->getProperty("gpu-sysmem-reserve-percent");
  if (prop) {
    OSData *data = OSDynamicCast(OSData, prop);
    if (data) {
      reservePercent = *(float *)data->getBytesNoCopy();
    }
  }
  return (vm_size_t)((unifiedMemoryBytes * (100.0f - reservePercent) / 100.0f + page_size - 1) & -page_size);
}
that would be great the code already has a case to check for an override %, but I can find no way to set gpu-sysmem-reserve-percent without a patch

That's an interesting find! but with your current approach we need both disabled SIP (protects the entire system by preventing the execution of unauthorized code) AND reduced security boot policy? what happens if you re-enable SIP but keep reduced security boot policy?

0 replies

dr3murr
on Jul 13, 2023
Author
not quite sure, you need sip disabled to write to /library/kernelcollections
or at least you need it to delete from it, which is necessary if you patch again
maybe you could avoid it by writing to another path that's not sip protected but idk where
the physical file on disk isnt the actual kernel that gets booted from fwiw because you can rm or corrupt the entire file you set-boot-object'd and it will still boot

0 replies

WiseFarAI
on Jul 15, 2023
This is great progress! I agree that permanently disabling SIP and Full security is a risk some might not want to take too lightly. I think it would be good if some more eyes could look at this. I will also ponder about another way of implementing this change

// this tool allows you to change the VRAM/RAM split on Unified Memory on Apple Silicon to whatever you want, allowing for more VRAM for inference
// c++ -std=c++17 -framework CoreFoundation -o vram_patcher vram_patcher.cpp
// credits to @asentientbot for helping me with some stuff because I never owned a Mac before
// please read the code, if you don't understand what it does don't use it
// tested on macos ventura beta 3, pattern might be different on other versions
// usage: ./vram_patcher <desired percentage of VRAM>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sys/sysctl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <CoreFoundation/CoreFoundation.h>

// A helper function to get the sysctl value for a given name
std::string get_sysctl(const std::string& name) {
    size_t len = 0;
    // Get the size of the value
    if (sysctlbyname(name.c_str(), nullptr, &len, nullptr, 0) == -1) {
        std::cerr << "Failed to get sysctl size for " << name << std::endl;
        return "";
    }
    // Allocate a buffer for the value
    char* buf = new char[len];
    // Get the value
    if (sysctlbyname(name.c_str(), buf, &len, nullptr, 0) == -1) {
        std::cerr << "Failed to get sysctl value for " << name << std::endl;
        delete[] buf;
        return "";
    }
    // Convert the value to a string
    std::string value(buf, len);
    delete[] buf;
    return value;
}

// A helper function to convert a CFString to a std::string
std::string CFStringToStdString(CFStringRef cfstr) {
  if (cfstr == nullptr) return "";
  CFIndex length = CFStringGetLength(cfstr);
  CFIndex maxSize = CFStringGetMaximumSizeForEncoding(length, kCFStringEncodingUTF8) + 1;
  char* buffer = new char[maxSize];
  if (CFStringGetCString(cfstr, buffer, maxSize, kCFStringEncodingUTF8)) {
    std::string result(buffer);
    delete[] buffer;
    return result;
  } else {
    delete[] buffer;
    return "";
  }
}

// A function to get the name of a disk from a path
std::string get_volume_name(const std::string& path) {
  // Create a CFURL from the path
  CFURLRef url = CFURLCreateFromFileSystemRepresentation(nullptr, (const UInt8*)path.c_str(), path.length(), false);
  if (url == nullptr) return "";

  // Get the volume URL from the path URL
  CFURLRef volumeURL = CFURLCreateCopyDeletingLastPathComponent(nullptr, url);
  CFRelease(url);
  if (volumeURL == nullptr) return "";

  // Get the volume name from the volume URL
  CFStringRef volumeName = nullptr;
  if (CFURLCopyResourcePropertyForKey(volumeURL, kCFURLVolumeNameKey, &volumeName, nullptr)) {
    CFRelease(volumeURL);
    if (volumeName == nullptr) return "";
    // Convert the volume name to a std::string
    std::string result = CFStringToStdString(volumeName);
    CFRelease(volumeName);
    return result;
  } else {
    CFRelease(volumeURL);
    return "";
  }
}

void change_float_constant(std::vector<uint8_t>& code, float new_value) {
  // Check that the code vector has enough bytes for the two instructions
  if (code.size() < 8) {
    throw std::invalid_argument("code vector is too small");
  }

  // Convert the new float value to a 32-bit unsigned integer representation
  uint32_t new_bits = *reinterpret_cast<uint32_t*>(&new_value);

  // Extract the lower and upper 16 bits of the new value
  uint16_t low_bits = new_bits & 0xffff;
  uint16_t high_bits = new_bits >> 16;

  // Encode the new value as two mov instructions
  // The first instruction is movz w8, #low_bits
  // The second instruction is movk w8, #high_bits, lsl #16
  // The opcode format is:
  // | 31 30 29 28 | 27 26 25 24 | 23 22 21 20 | 19 18 17 16 | 15 14 13 12 | 11 10 9 8 | 7 6 5 4 | 3 2 1 0 |
  // | sf  0  0  1 | opc  1  0  0 | 0  0  0  0 |    hw    |     imm16     | 0  0  0  0 |   Rd   |   0  0  0  0 |
  // where sf = 0 for 32-bit register, opc = 00 for movz, 01 for movn, 10 for movk, hw = 00 for lsl #0, 01 for lsl #16, 10 for lsl #32, 11 for lsl #48, imm16 = 16-bit immediate value, Rd = destination register

  // The first instruction has sf = 0, opc = 00, hw = 00, imm16 = low_bits, Rd = 8
  uint32_t first_opcode = 0x52800000 | (low_bits << 5) | 8;

  // The second instruction has sf = 0, opc = 10, hw = 01, imm16 = high_bits, Rd = 8
  uint32_t second_opcode = 0x72800000 | (high_bits << 5) | (1 << 21) | 8;

  // Convert the opcodes to little endian bytes and overwrite the code vector
  code[0] = first_opcode & 0xff;
  code[1] = (first_opcode >> 8) & 0xff;
  code[2] = (first_opcode >> 16) & 0xff;
  code[3] = (first_opcode >> 24) & 0xff;
  code[4] = second_opcode & 0xff;
  code[5] = (second_opcode >> 8) & 0xff;
  code[6] = (second_opcode >> 16) & 0xff;
  code[7] = (second_opcode >> 24) & 0xff;
}

int main(int argc, char** argv) {
    // Check if the program is run as root
    if (getuid() != 0) {
        std::cerr << "Sorry, this program must be run as root" << std::endl;
        return 1;
    }

    // Check if the program has one argument
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " vram_percentage" << std::endl;
        return 1;
    }

    // Check if the argument is a valid percentage
    int percentage = std::stoi(argv[1]);
    if (percentage < 10 || percentage > 95) {
        std::cerr << "Invalid percentage: " << percentage << std::endl;
        return 1;
    }

    // Check if the CPU is Apple Silicon
    std::string cpu_brand = get_sysctl("machdep.cpu.brand_string");
    if (cpu_brand.find("Apple") == std::string::npos) {
        std::cerr << "Sorry, this program only works on Apple Silicon" << std::endl;
        return 1;
    }

    // Define the paths for the original and patched kernel collections
    std::string original_kc = "/private/var/db/KernelExtensionManagement/KernelCollections/BootKernelCollection.kc";
    std::string patched_kc = "/tmp/BootKernelCollection.kc";
    std::string final_kc = "/Library/KernelCollections/vram_patch.kc";
    std::string script_kc = "/Library/KernelCollections/complete_patch.sh";

    // Copy the original kernel collection to a temporary directory
    std::ifstream src(original_kc, std::ios::binary);
    std::ofstream dst(patched_kc, std::ios::binary);
    if (!src || !dst) {
        std::cerr << "Failed to copy the kernel collection" << std::endl;
        return 1;
    }
    dst << src.rdbuf();
    src.close();
    dst.close();

    // Read the patched kernel collection into a vector of bytes
    std::ifstream in(patched_kc, std::ios::binary);
    if (!in) {
        std::cerr << "Failed to read the patched kernel collection" << std::endl;
        return 1;
    }
    std::vector<uint8_t> data((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
    in.close();

    // Define the byte sequence to search for
    std::vector<uint8_t> target = {0x08, 0x01, 0xC0, 0xD2, 0x3F, 0x00, 0x08, 0xEB, 0xA8, 0xAA, 0x8A, 0x52, 0xA8, 0x40, 0xA8, 0x72, 0x00, 0x01, 0x27, 0x1E, 0x01, 0x30, 0x27, 0x1E, 0x28, 0x8C, 0x20, 0x1E};

    // Define the byte sequence to replace with
    std::vector<uint8_t> replacement = {0x68, 0x73, 0x89, 0x52, 0xA8, 0x94, 0xA8, 0x72, 0x08, 0x01, 0x27, 0x1E, 0x1F, 0x20, 0x03, 0xD5, 0x1F, 0x20, 0x03, 0xD5, 0x1F, 0x20, 0x03, 0xD5, 0x1F, 0x20, 0x03, 0xD5};

    // Subtract the percentage from 100
    percentage = 100 - percentage;

    // Change the floating point constant in the replacement byte sequence
    change_float_constant(replacement, percentage);

    // Find the first occurrence of the target byte sequence in the data vector
    auto it = std::search(data.begin(), data.end(), target.begin(), target.end());

    // If the target byte sequence is found, replace it with the replacement byte sequence
    if (it != data.end()) {
        std::copy(replacement.begin(), replacement.end(), it);
    } else {
        std::cerr << "Failed to find the target byte sequence in the kernel collection" << std::endl;
        return 1;
    }

    // Write the modified data vector to the patched kernel collection
    std::ofstream out(patched_kc, std::ios::binary);
    if (!out) {
        std::cerr << "Failed to write the patched kernel collection" << std::endl;
        return 1;
    }
    out.write(reinterpret_cast<char*>(data.data()), data.size());
    out.close();

    // Copy the patched kernel collection to the final destination
    std::ifstream src2(patched_kc, std::ios::binary);
    std::ofstream dst2(final_kc, std::ios::binary);
    if (!src2 || !dst2) {
        std::cerr << "Failed to copy the patched kernel collection" << std::endl;
        return 1;
    }
    dst2 << src2.rdbuf();
    src2.close();
    dst2.close();

    // Change the ownership of the final kernel collection to root:wheel
    if (chown(final_kc.c_str(), 0, 0) == -1) {
        std::cerr << "Failed to change the ownership of the final kernel collection" << std::endl;
        return 1;
    }

    // Get the volume name for the /System path
    std::string volume_name = get_volume_name("/System");
    if (volume_name.empty()) {
        std::cerr << "Failed to get the volume name for /System" << std::endl;
        return 1;
    }

    // Create a shell script to configure the boot with the final kernel collection
    std::ofstream script(script_kc);
    if (!script) {
        std::cerr << "Failed to create the shell script" << std::endl;
        return 1;
    }
    script << "#!/bin/bash\n";
    script << "# Disable System Integrity Protection\n";
    script << "csrutil disable\n";
    script << "# Disable Apple Mobile File Integrity\n";
    script << "nvram boot-args=\"ipc_control_port_options=0 amfi_get_out_of_my_way=1\"\n";
    script << "# Get the absolute path of this script\n";
    script << "SCRIPT=$(cd \"$(dirname \"${BASH_SOURCE[0]}\")\" && pwd)/$(basename \"${BASH_SOURCE[0]}\")\n";
    script << "# Append the kernel collection name to the script path\n";
    script << "KC=\"${SCRIPT%/*}/vram_patch.kc\"\n"; // This line replaces the original one
    script << "# Get the volume name from the script path\n";
    script << "VOLUME=\"${SCRIPT%%/Library*}\"\n";
    script << "# Run the kmutil command with the absolute paths\n";
    script << "kmutil configure-boot -C -c \"$KC\" -v \"$VOLUME\"\n";
    script << "sync\n";
    script << "echo Finished. You may reboot your system now.\n";
    script.close();

    // Change the ownership and permissions of the shell script to root:wheel and 755
    if (chown(script_kc.c_str(), 0, 0) == -1) {
        std::cerr << "Failed to change the ownership of the shell script" << std::endl;
        return 1;
    }
    if (chmod(script_kc.c_str(), 0755) == -1) {
        std::cerr << "Failed to change the permissions of the shell script" << std::endl;
        return 1;
    }

    // Output the instructions to run the shell script in RecoveryOS
    std::cout << "The shell script to configure the boot with the patched kernel collection has been created at " << script_kc << std::endl;
    std::cout << "Now reboot to RecoveryOS and run: \"/Volumes/" << volume_name << "/Library/KernelCollections/complete_patch.sh\"" << std::endl;
    std::cout << "If your boot fails after running the script or after an update, you need to go to Utilities->Startup Security Tool in RecoveryOS and pick Full Security" << std::endl;

    return 0;
}
0 replies

CyborgArmy83
on Aug 14, 2023
not quite sure, you need sip disabled to write to /library/kernelcollections or at least you need it to delete from it, which is necessary if you patch again maybe you could avoid it by writing to another path that's not sip protected but idk where the physical file on disk isnt the actual kernel that gets booted from fwiw because you can rm or corrupt the entire file you set-boot-object'd and it will still boot

Hey @r3muxd

As you might know the new MacOS 14 (Sonoma) Public Beta 3 has been released and includes some updated CoreML and other libraries which are used in AI development. I had to update to the 14.0 beta to work on some AI projects that required the updated frameworks.

Do you think this script works on MacOS 14 Beta or will this code need an update? I will be happy to help/test and figure it out with you! However, I can use the help and you already have some expertise from creating this in the first place :)

I tried the script before on my older Macbook Pro M1 Pro and it seemed to work but I am just a bit on the cautious side with my new M1 Max MBP.

0 replies

dr3murr
on Aug 20, 2023
Author
just try it, worst come to worst you'll need to flip back on security when
you get booted to recoveryos if theres an error
if the pattern is wrong for new MacOS, then you'll get an error running the
script and nothing will happen
…
0 replies

crasm
on Oct 25, 2023
@r3muxd I’m having some trouble following the instructions for my system on the latest (stable) macOS Ventura. When I boot into recovery, the directory /Volumes/Macintosh HD/Library is empty.

The KernelCollections directory on my system is available at /Volumes/Data/Library/KernelCollections after mounting the Macintosh HD - Data volume. However, reading the script, it seems unlikely the system will boot if it’s not on the main volume? I am not familiar with the macOS boot process.

Did you need to do any additional steps after booting into recovery, before running the script?

Thanks for any help! I’m looking forward to running Q4_K_M falcon 180B if I can get this to work.

6 replies
@dr3murr
dr3murr
on Oct 31, 2023
Author
In any case, it should be fine to just attempt to run the script from the other location. If MacOS fails to boot, you will get a prompt telling you to remove the custom kernel, which you can do trivially by clicking Recovery->Utilities->Startup Security Tool and resetting it to default.

@crasm
crasm
on Nov 1, 2023
Success! I was able to load and run inference on Q4_K Falcon 180B with my 128GB Mac Studio.

For anybody else trying to get this to work:

diskutil list, look for a line like    5:                APFS Volume Data                    1.4 TB     disk3s5
Mount that volume with diskutil apfs unlock /dev/<volume>
Run /Volumes/Data/Library/KernelCollections/complete_patch.sh
@r3muxd Thank you very much for your help and assurances

Notes:

The output of your program is Macintosh HD normally and macOS Base System in recovery.
There are some warnings and errors compiling vram_patcher.cpp, but it works fine
./complete_patch.sh in recovery gave an error, but it works fine
Error setting variable – 'boot-args': (iokit/common) not permitted.
@hlo-world
hlo-world
on Nov 21, 2023
Thanks crasm, can confirm it works on Sonoma

@JasonOSX
JasonOSX
on Nov 21, 2023
Did you patch it through recovery or else? and did you disable SIP or more? asking because I just upgraded to the latest Sonoma

@hlo-world
hlo-world
on Nov 22, 2023
Yea you have to reduce the security and disable SIP in Recovery.


ddh0
on Nov 3, 2023
Hi, thanks for this. I'm getting this error: Failed to copy the patched kernel collection. Reading the code I see it's just copying a file so I'm not sure what's up. I've checked to make sure both the parent directories exist, and I'm running the program as root.

I'm on macOS Sonoma 14.1, M2 Air 24GB. Trying to set percent to 87.5. Any help would be appreciated. Thanks.

3 replies
@crasm
crasm
on Nov 3, 2023
Did you disable SIP?

@ddh0
ddh0
on Nov 3, 2023
Ah, ok. That wasn't mentioned. Thank you!

@ddh0
ddh0
on Nov 3, 2023
Works great now! Running mistral 7b FP16 on metal which I couldn't quite do before.


kdyke
on Nov 29, 2023
Folks... just do: sudo sysctl iogpu.wired_limit_mb=<mb> from Terminal. You’d have to do it every boot as it’s not sticky, but that seems better than patching your OS and disabling SIP. I would not recommend going to 100% as the OS needs some reasonable amount of memory to fit everything else that has to be around not wired down by the GPU driver stack, and things will start to go seriously wrong if you don’t leave enough space.

14 replies
@sammcj
sammcj
on Dec 2, 2023
Ohhhh I see, thank you! :)

@ingochris
ingochris
on Dec 4, 2023
Thank you so much!

@crasm
crasm
on Dec 4, 2023
For nix-darwin, you can use:

    launchd.daemons."sysctl-vram-limit" = {
      command = "/usr/sbin/sysctl iogpu.wired_limit_mb=115200";
      serviceConfig.RunAtLoad = true;
    };
@tusing
tusing
on Dec 25, 2023
Thanks for the Nix instructions, did not realize this is why it wasn't working!

@beginor
beginor
on May 29, 2024
indeed, it works!


kdyke
on Nov 29, 2023
It’s expecting a value in megabytes.   So if you’re using RAM math instead of HD/SDD math, for 60GB you’d want 60*1024 =  61440
…
1 reply
@ddh0
ddh0
on Nov 29, 2023
mine was base 10 math, where 20000 was 20GB


marcingomulkiewicz
on Apr 5, 2024
(Sorry for necrothreading, but I've spent some serious time looking for the solution for this particular problem, and I think I found the easiest one)

What worked for me (Sonoma) is to create sysctl.conf, and add appropriate parameter to it:

sudo nano /etc/sysctl.conf
and inside:

# change default CPU/GPU RAM split
iogpu.wired_limit_mb=28672
(as already mentioned, the number is max RAM in MB that GPU can have; select some value reasonable for your machine).

I'm not sure why /etc/sysctl.conf does not exist by default (good old unix habit is to put some template with commented out defaults), but as far as I know sysctl is not deprecated, so it should keep working.

4 replies
@Summit1122
Summit1122
on Aug 8, 2024
how can I check if this worked?

@beginor
beginor
on Aug 9, 2024
Yes, it works! Just run any model with llama-cli or llama-server , check the ggml_metal_init: recommendedMaxWorkingSetSize value in console output, which is effected by iogpu.wired_limit_mb.

@trevcodner
trevcodner
on Feb 14
Morning, can I confirm, that this works instead of the script above, or requires that script too? I suspect it's just the command line, but wanted to check. Thanks

@jared-krauss
jared-krauss
on Apr 13
(Sorry for necrothreading, but I've spent some serious time looking for the solution for this particular problem, and I think I found the easiest one)

What worked for me (Sonoma) is to create sysctl.conf, and add appropriate parameter to it:

sudo nano /etc/sysctl.conf
and inside:

# change default CPU/GPU RAM split
iogpu.wired_limit_mb=28672
(as already mentioned, the number is max RAM in MB that GPU can have; select some value reasonable for your machine).

I'm not sure why /etc/sysctl.conf does not exist by default (good old unix habit is to put some template with commented out defaults), but as far as I know sysctl is not deprecated, so it should keep working.

No no, it's appreciated. Because I too am bringing this back a year later to say this is what I needed to help me make bigger Gaussian Splats, running out of space and getting the Sigkill code.


Azirine
3 weeks ago
I found that even after doing sudo nano /etc/sysctl.conf, going slightly above the original limit (¾ × 128 = 96 GiB in my case) makes models load much slower with a long swap/unswap process. #13361

0 replies

+ LJ Thu 26 Jun 2025 00:04:46 BST — ln -s ~/.lmstudio/models/rednote-hilab/dots.llm1.inst/dots.llm1.inst-UD-TQ1_0.gguf

https://huggingface.co/rednote-hilab/dots.llm1.inst

ln -s ~/.lmstudio/models/rednote-hilab/dots.llm1.inst/dots.llm1.inst-UD-TQ1_0.gguf

cmake . -B ./build
cmake --build build --config Release -j

build/bin/llama-cli \
    --model models/dots.llm1.inst-UD-TQ1_0.gguf \
    --temp 0 \
    --top_p 0.95 \
    --min_p 0 \
    --ctx-size 4096

sudo sysctl iogpu.wired_limit_mb=80000

build/bin/llama-cli --model models/dots.llm1.inst-UD-TQ1_0.gguf --temp 0 --top_p 0.95 --min_p 0 --ctx-size 4096

(torch311) ljubomir@macbook2(:):~/llama.cpp$ sudo sysctl iogpu.wired_limit_mb=80000
(torch311) ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli --model models/dots.llm1.inst-UD-TQ1_0.gguf --temp 0 --top_p 0.95 --min_p 0 --ctx-size 4096


sudo sysctl iogpu.wired_limit_mb=80000
build/bin/llama-server --model models/dots.llm1.inst-UD-TQ1_0.gguf --temp 0 --top_p 0.95 --min_p 0 --ctx-size 4096 &
# access on http://127.0.0.1:8080



https://huggingface.co/unsloth/dots.llm1.inst-GGUF/discussions/1

Hm, I see it's working in llama.cpp. And I see LMStudio got an upgrade today:

You are all up to date! The current version is 0.3.17

But still, the llama.cpp runtime bundled with LMStudio on my MacOS is from one month ago:

Metal llama.cpp
v1.33.0

Engine
Apple Metal accelerated llama.cpp engine
Latest Version Installed
- Llama 4 vision support
- Enable any LLM to be used as an embedding model (requires LM Studio 0.3.16-b6)
- Fixed prompt processing bugs when chats exceed context length
- llama.cpp updated to b5459 (commit 8a1d206)

I see the 8a1d206 commit in llama.cpp be from 2025-05-22 :

8a1d206f ggerganov@gmail.com 2025-05-22 22:21:07 +0300 : tts : fix n_ubatch + make WavTokenizer cache-less (#13713)

I see latter commit from 2025-06-15 that looks like important to have:

9ae4143b mikjuo@gmail.com 2025-06-15 00:52:06 -0700 : model : add dots.llm1 architecture support (#14044) (#14118)

I guess will just wait some more, from LMStudio to move to news llama.cpp.
On trying to load dots.llm1 I'm still getting

error loading model: error loading model architecture: unknown model architecture: 'dots1'

and that kind of makes sense.

LJ Thu 26 Jun 2025 00:04:46 BST

+ LJ Thu 26 Jun 2025 01:38:40 BST — Run with larger context with flash attention

Run with larger context with flash attention

sudo sysctl iogpu.wired_limit_mb=80000
build/bin/llama-server --model models/dots.llm1.inst-UD-TQ1_0.gguf --temp 0 --top_p 0.95 --min_p 0 --ctx-size 32768 --flash-attn --cache-type-k f16 --cache-type-v f16 &
# access on http://127.0.0.1:8080

The caches with 32K context and f16 take ~32GB:

llama_kv_cache_unified: size = 31744.00 MiB ( 32768 cells,  62 layers,  1 seqs), K (f16): 15872.00 MiB, V (f16): 15872.00 MiB

So together with the mmap-ed model may take ~80GB (32GB + 45GB = 77GB)

ljubomir@macbook2(:):~/llama.cpp$ l models/dots.llm1.inst-UD-TQ1_0.gguf
lrwx------@ 1 ljubomir  staff    90B 25 Jun 23:14 models/dots.llm1.inst-UD-TQ1_0.gguf -> /Users/ljubomir/.lmstudio/models/rednote-hilab/dots.llm1.inst/dots.llm1.inst-UD-TQ1_0.gguf
ljubomir@macbook2(:):~/llama.cpp$ l /Users/ljubomir/.lmstudio/models/rednote-hilab/dots.llm1.inst/dots.llm1.inst-UD-TQ1_0.gguf
-rw-r--r--@ 1 ljubomir  staff    45G 25 Jun 22:17 /Users/ljubomir/.lmstudio/models/rednote-hilab/dots.llm1.inst/dots.llm1.inst-UD-TQ1_0.gguf

Reduce the 32GB for caches to 16GB with q8_0 cache type to a total of <75GB RAM and 16 tps:

sudo sysctl iogpu.wired_limit_mb=80000
build/bin/llama-server --model models/dots.llm1.inst-UD-TQ1_0.gguf --temp 0 --top_p 0.95 --min_p 0 --ctx-size 32768 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

Added --jinja after commend here

https://www.reddit.com/r/LocalLLaMA/comments/1ljrwrq/comment/mztn9xo/?context=3

sudo sysctl iogpu.wired_limit_mb=80000
build/bin/llama-server --model models/dots.llm1.inst-UD-TQ1_0.gguf --temp 0 --top_p 0.95 --min_p 0 --ctx-size 32768 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

LJ Thu 26 Jun 2025 01:38:40 BST

+ Gemma 3n is here! 🎉

Gemma 3n is here! 🎉

🔊Multimodal (text/audio/image/video) understanding
🤯Runs with as little as 2GB of RAM
🏆First model under 10B with @lmarena_ai score of 1300+
https://x.com/osanseviero/status/1938277414687121531

https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$

git pull
mviv build{,.1}
cmake . -B ./build
cmake --build build --config Release -j

mviv ~/Downloads/gemma-3n-E4B-it-UD-Q8_K_XL.gguf models/


build/bin/llama-server --model models/gemma-3n-E4B-it-UD-Q8_K_XL.gguf --temp 1.0 --top_k 64 --top_p 0.95 --min_p 0 --ctx-size 32768 &
# access on http://127.0.0.1:8080

https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF/tree/main

Set temperature = 1.0, top_k = 64, top_p = 0.95, min_p = 0.0
Gemma 3n max tokens (context length): 32K. Gemma 3n chat template:
<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\nHey there!<end_of_turn>\n<start_of_turn>user\nWhat is 1+1?<end_of_turn>\n<start_of_turn>model\n

Add --jinja?

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$ build/bin/llama-server --model models/gemma-3n-E4B-it-UD-Q8_K_XL.gguf --temp 1.0 --top_k 64 --top_p 0.95 --min_p 0 --ctx-size 32768 --jinja &

Works!


LJ Thu 26 Jun 18:47:28 BST 2025

+ LJ Sun 6 Jul 2025 09:51:26 BST — cd llama.cpp/

cd llama.cpp/
cd models/
ln -s ~/.lmstudio/models/DevQuasar/tencent.Hunyuan-A13B-Instruct-GGUF/tencent.Hunyuan-A13B-Instruct.Q3_K_M.gguf

cd ..
git pull
mviv build{,.1}

unset CC CXX
unset LDFLAGS
unset CPPFLAGS
env |egrep 'CC|CXX|FLAGS'

brew install libomp

rm -rf build

cmake . -B ./build \
  -DCMAKE_C_FLAGS="-Xpreprocessor -fopenmp -I/opt/homebrew/opt/libomp/include" \
  -DCMAKE_CXX_FLAGS="-Xpreprocessor -fopenmp -I/opt/homebrew/opt/libomp/include" \
  -DOpenMP_C_LIB_NAMES="libomp" \
  -DOpenMP_CXX_LIB_NAMES="libomp" \
  -DOpenMP_libomp_LIBRARY="/opt/homebrew/opt/libomp/lib/libomp.dylib" \
  -DCMAKE_EXE_LINKER_FLAGS="-L/opt/homebrew/opt/libomp/lib"

cmake --build build --config Release -j

Original by DevQuasar tencent.Hunyuan-A13B-Instruct.Q3_K_M.gguf

ljubomir@macbook2(:):~/llama.cpp/models$ l tencent.Hunyuan-A13B-Instruct.Q3_K_M.gguf
lrwx------@ 1 ljubomir  staff   119B  6 Jul 01:06 tencent.Hunyuan-A13B-Instruct.Q3_K_M.gguf -> /Users/ljubomir/.lmstudio/models/DevQuasar/tencent.Hunyuan-A13B-Instruct-GGUF/tencent.Hunyuan-A13B-Instruct.Q3_K_M.gguf

sudo sysctl iogpu.wired_limit_mb=80000
build/bin/llama-server --model models/Hunyuan-A13B-Instruct-IQ4_NL.gguf --temp 0.8 --top_p 0.95 --min_p 0.05 --top_k 40 --repeat-penalty 1.1 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

Latter by Unsloth Hunyuan-A13B-Instruct-IQ4_NL.gguf from https://huggingface.co/unsloth/Hunyuan-A13B-Instruct-GGUF

(torch311) ljubomir@macbook2(:):~/llama.cpp$ l models/Hunyuan-A13B-Instruct-IQ4_NL.gguf
-rw-r--r--@ 1 ljubomir  staff    42G 10 Jul 22:59 models/Hunyuan-A13B-Instruct-IQ4_NL.gguf

(torch311) ljubomir@macbook2(:):~/llama.cpp$ l ~/.lmstudio/models/unsloth/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ4_NL.gguf
lrwx------@ 1 ljubomir  staff    66B 10 Jul 23:24 /Users/ljubomir/.lmstudio/models/unsloth/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ4_NL.gguf -> /Users/ljubomir/llama.cpp/models/Hunyuan-A13B-Instruct-IQ4_NL.gguf

sudo sysctl iogpu.wired_limit_mb=80000
build/bin/llama-server --model models/Hunyuan-A13B-Instruct-IQ4_NL.gguf --temp 0.8 --top_p 0.95 --min_p 0.05 --top_k 40 --repeat-penalty 1.1 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

LJ Sun  6 Jul 2025 09:51:26 BST

+ LJ Sat 19 Jul 2025 14:43:56 BST — https://huggingface.co/unsloth/ERNIE-4.5-21B-A3B-PT-GGUF

Baidu ERNIE 4.5

https://huggingface.co/collections/baidu/ernie-45-6861cd4c9be84540645f35c9

https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-PT

https://huggingface.co/unsloth/ERNIE-4.5-21B-A3B-PT-GGUF

ERNIE-4.5-21B-A3B-PT-UD-Q6_K_XL.gguf

build/bin/llama-server --model models/ERNIE-4.5-21B-A3B-PT-UD-Q6_K_XL.gguf --temp 0.8 --top_p 0.8 --min_p 0.05 --top_k 40 --repeat-penalty 1.1 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

prompt eval time =    3371.18 ms /  2700 tokens (    1.25 ms per token,   800.91 tokens per second)
       eval time =   21118.88 ms /  1039 tokens (   20.33 ms per token,    49.20 tokens per second)
      total time =   24490.06 ms /  3739 tokens

Gets ~50 tps

TODO try <77GB 300B-A47B TQ1_0 model

https://huggingface.co/unsloth/ERNIE-4.5-300B-A47B-PT-GGUF

LJ Sat 19 Jul 2025 14:43:56 BST

+ LJ Thu 10 Jul 2025 21:11:04 BST — Benchmark for Devstral? #4058

Use local code agent with aider - devstral

https://github.com/Aider-AI/aider/issues/4058

Benchmark for Devstral? #4058

https://github.com/Aider-AI/aider/issues/4058#issuecomment-2960920629

psymonryan
last month · edited by psymonryan
My tips:

Use the largest quant you can fit in memory
Use 'diff' format rather than 'whole'
Here are my llama-server settings:

      /Users/simon/models/bin-arm64/llama-server
      --host 0.0.0.0 --port 8284 --flash-attn --slots
      --ctx-size 24576
      --model /Users/simon/models/localmodels/mistralai_Devstral-Small-2505-Q8_0.gguf
      -ngl 99 -ngld 99
      --cache-type-k q8_0 --cache-type-v q8_0
      --seed "-1"
      --temp 0.15
      --repeat-penalty 1.0
      --min-p 0.01
      --top-k 64
      --top-p 0.95
      --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
      --dry-multiplier 0.5

And my .aider.conf.yml:

## Specify the api base urls (/v1 needed for llama-swap)
openai-api-base: http://m3macbook.mylocal.lan:8012/v1
alias:
  - "thinking:openai/QwQ"
  - "thinker:openai/QwQ"
  - "QwQ:openai/QwQ"
  - "coding:openai/Qwen2.5-Coder-32B"
  - "coder:openai/Qwen2.5-Coder-32B"
  - "codersmall:openai/Qwen2.5-Coder-14B"
  - "gemma:openai/Gemma-3-27B-qat-Q4"
  - "mistral:openai/mistral"
  - "glm:openai/glm"
  - "qwen3:openai/Qwen3-32B"
  - "qwen3nt:openai/Qwen3-32B-nt"
  - "cogito:openai/cogito"
  - "devstral:openai/devstral"
  - "devstral_long:openai/devstral_long"
auto-commits: false
dark-mode: true
model: qwen3
editor-model: devstral
# Don't bother with a separate weak model, just re-use the editor-model (since its already loaded in llama-swap)
weak-model: devstral
show-model-warnings: false
multiline: false
watch-files: true
# whole is slower but better for small models (but it messes up your comments)
# udiff seems to mostly work with Qwen2.5-Coder-32B
# diff format is all you need with devstral as it is tuned for agent use
edit-format: diff
editor-edit-format: diff
# map-tokens: 0
read: .aider.conventions.md

I'm using it daily for work and for my personal mindmap project

It does occasionally fail to edit on the first or second time, but mostly self corrects where as Qwen2.5-Coder-32B at the quant I am using does not.

I found it is way better with the Q8 quant and with KV quant size set to Q8 also.
I do also occasionaly run it in 'long context' mode with the Bartowski Q6_K_L quant and KL set to q4_0 and it works but is less reliable.

Also it depends on how complex your instructions are. If you are using it in architect mode, then it should get good instructions from Qwen3 or QwQ. (ie: dont expect it to think or understand too deeply, use instructions like: "In function X add code to add Y functionality")

Hope this helps!

https://aider.chat/docs/config/aider_conf.html

Configuration	YAML config file
YAML config file
Most of aider’s options can be set in an .aider.conf.yml file. Aider will look for a this file in these locations:

Your home directory.
The root of your git repo.
The current directory.
If the files above exist, they will be loaded in that order. Files loaded last will take priority.

You can also specify the --config <filename> parameter, which will only load the one config file.


LJ Thu 10 Jul 2025 21:11:04 BST

+ LJ Fri 11 Jul 2025 20:37:40 BST — https://huggingface.co/unsloth/Hunyuan-A13B-Instruct-GGUF

https://huggingface.co/unsloth/Hunyuan-A13B-Instruct-GGUF

If you are using llama.cpp, use --jinja

./llama.cpp/llama-cli -hf unsloth/Hunyuan-A13B-Instruct-GGUF:Q4_K_XL -ngl 99 --jinja --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.05 --repeat-penalty 1.05

sudo sysctl iogpu.wired_limit_mb=80000
build/bin/llama-server --model models/Hunyuan-A13B-Instruct-IQ4_NL.gguf --temp 0.7 --top_p 0.8 --min_p 0.05 --top_k 20 --repeat-penalty 1.05 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

Start aider in your working directory

aider --openai-api-base http://localhost:8080 --openai-api-key dummy-key --model hunyuan-a13b


# .aider.conf.yml

# sudo sysctl iogpu.wired_limit_mb=80000
# build/bin/llama-server --model models/Hunyuan-A13B-Instruct-IQ4_NL.gguf --temp 0.7 --top_p 0.8 --min_p 0.05 --top_k 20 --repeat-penalty 1.05 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

# Start aider in your working directory
# aider --openai-api-base http://localhost:8080 --openai-api-key dummy-key --model hunyuan-a13b

# Specifies the base URL for the OpenAI-compatible API provided by llama.cpp.
# The /v1 endpoint is standard for OpenAI API compatibility.
openai-api-base: http://127.0.0.1:8080/v1

# A dummy API key is required by aider for OpenAI API, but not used by llama.cpp.
openai-api-key: dummy-key

# Define aliases for your local model.
# The model name 'hunyuan-a13b' is derived from your GGUF file name
# (Hunyuan-A13B-Instruct-IQ4_NL.gguf), which llama.cpp typically exposes.
alias:
  - "hunyuan:openai/hunyuan-a13b" # A convenient alias for your model
  - "local:openai/hunyuan-a13b"   # Another general alias for your local model

# Set your primary model to the one served by llama.cpp.
model: hunyuan-a13b

# For editor and weak models, if you are only serving one model via llama.cpp,
# it's best to point them to the same model.
editor-model: hunyuan-a13b
weak-model: hunyuan-a13b

# General Aider settings (you can adjust these as needed):
auto-commits: true
dark-mode: false
multiline: false
watch-files: true
show-model-warnings: false

# Edit format for how aider applies changes. 'diff' is generally robust.
edit-format: diff
editor-edit-format: diff

# If you have a conventions file, specify it here.
# read: .aider.conventions.md

But aider doesn't know about that endpoint. So use mistal coding model next.

$ cat ~/z/itrade/src/.aider.conf.yml

# .aider.conf.yml

# 1) Run with hunyuan-a13b
# sudo sysctl iogpu.wired_limit_mb=80000
# build/bin/llama-server --model models/Hunyuan-A13B-Instruct-IQ4_NL.gguf --temp 0.7 --top_p 0.8 --min_p 0.05 --top_k 20 --repeat-penalty 1.05 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080
# Start aider in your working directory
# aider --openai-api-base http://localhost:8080 --openai-api-key dummy-key --model hunyuan-a13b

# 2) Run with mistral devstral
# build/bin/llama-server --model models/mistralai_Devstral-Small-2507-Q6_K_L.gguf --temp 0.7 --top_p 0.8 --min_p 0.05 --top_k 20 --repeat-penalty 1.05 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080
# Start aider in your working directory
# aider --openai-api-base http://localhost:8080 --openai-api-key dummy-key

# .aider.conf.yml

# Specifies the base URL for the OpenAI-compatible API provided by llama.cpp.
# The /v1 endpoint is standard for OpenAI API compatibility.
openai-api-base: http://127.0.0.1:8080/v1

# A dummy API key is required by aider for OpenAI API, but not used by llama.cpp.
openai-api-key: dummy-key

# Define aliases for your local models.
# IMPORTANT: Prefix the model name with "openai/" to correctly route through LiteLLM.
# This tells LiteLLM that even though it's a local server, it's mimicking the OpenAI API.

# Alias for your Hunyuan model
alias:
  - "hunyuan:openai/hunyuan-a13b"
  - "local-hunyuan:openai/hunyuan-a13b"

# Alias for your new Devstral Mistral model
# Assuming llama.cpp exposes this model as 'mistralai_Devstral-Small-2507-Q6_K_L'
# (which is typically derived from the GGUF filename without the .gguf extension).
  - "devstral:openai/mistralai_Devstral-Small-2507-Q6_K_L"
  - "local-devstral:openai/mistralai_Devstral-Small-2507-Q6_K_L"

# Set your primary model. You can switch this to 'devstral' if you want it as default.
# For now, let's keep it as hunyuan-a13b as that was your last active model.
# Remember to use the 'openai/' prefix here too.
model: openai/devstral

# For editor and weak models, point them to your desired local model with the prefix.
editor-model: openai/devstral
weak-model: openai/devstral

# General Aider settings
auto-commits: true
dark-mode: false
multiline: false
watch-files: true
show-model-warnings: false

# Edit format for how aider applies changes. 'diff' is generally robust.
edit-format: diff
editor-edit-format: diff

# If you have a conventions file, specify it here.
# read: .aider.conventions.md


LJ Fri 11 Jul 2025 20:37:40 BST

+ https://huggingface.co/mradermacher/Seed-X-PPO-7B-GGUF

https://huggingface.co/ByteDance-Seed/Seed-X-PPO-7B


ByteDance-Seed/Seed-X-PPO-7B

Seed-X-PPO-7B

Introduction
We are excited to introduce Seed-X, a powerful series of open-source multilingual translation language models, including an instruction model, a reinforcement learning model, and a reward model. It pushes the boundaries of translation capabilities within 7 billion parameters. We develop Seed-X as an accessible, off-the-shelf tool to support the community in advancing translation research and applications:

https://huggingface.co/mradermacher/Seed-X-PPO-7B-GGUF

$ l ~/llama.cpp/models/Seed-X-PPO-7B.Q8_0.gguf
-rw-r--r--@ 1 ljubomir  staff   7.4G 21 Jul 09:10 /Users/ljubomir/llama.cpp/models/Seed-X-PPO-7B.Q8_0.gguf

+ https://huggingface.co/DavidAU/Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B

https://huggingface.co/collections/DavidAU/coder-and-programming-models-moe-reg-imatrix-686357166b1f0d4322ad3e2c

https://huggingface.co/DavidAU/Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B

DavidAU/Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B

Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B
This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly.

Coder MOE with 2 top coder models in a Mixture of Experts config, using the full power of each model to code in a 19B model.

Included:

Qwen/Qwen2.5-Coder-7B-Instruct (500+ likes; all major + many minor programming languages)
open-r1/OlympicCoder-7B (179+ likes; all major + many minor programming languages)
TWO models all working together to code, with Qwen2.5-Coder-7B-Instruct as a shared expert too.

Default config is 2 experts activated.

NOTE: All experts help with coding, regardless of how many you have activated.

SETTINGS:

Temp .5 to .7 (or lower)
Max Context is 32k
topk: 20, topp: .8, minp: .05
rep pen: 1.05-1.1 (can be lower)
Jinja Template (embedded) or CHATML template.
A System Prompt is not required. (ran tests with blank system prompt)
MODELS in THIS MOE - see each for more information, benchmarks and how they operate:

https://huggingface.co/open-r1/OlympicCoder-7B

https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct

For more information / other Qwen/Mistral Coders / additional settings see:

[ https://huggingface.co/DavidAU/Qwen2.5-MOE-2x-4x-6x-8x__7B__Power-CODER__19B-30B-42B-53B-gguf ]

Help, Adjustments, Samplers, Parameters and More
CHANGE THE NUMBER OF ACTIVE EXPERTS:

See this document:

https://huggingface.co/DavidAU/How-To-Set-and-Manage-MOE-Mix-of-Experts-Model-Activation-of-Experts

Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:

In "KoboldCpp" or "oobabooga/text-generation-webui" or "Silly Tavern" ;

Set the "Smoothing_factor" to 1.5

: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"

: in text-generation-webui -> parameters -> lower right.

: In Silly Tavern this is called: "Smoothing"

NOTE: For "text-generation-webui"

-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)

Source versions (and config files) of my models are here:

https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be

OTHER OPTIONS:

Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")

If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.

Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers

This a "Class 1" model:

For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]

You can see all parameters used for generation, in addition to advanced parameters and samplers to get the most out of this model here:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]


https://huggingface.co/mradermacher/Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B-i1-GGUF

(torch311) ljubomir@macbook2(:):~/llama.cpp$ l models/Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B.i1-Q5_K_M.gguf
-rw-r--r--@ 1 ljubomir  staff    13G 21 Jul 08:53 models/Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B.i1-Q5_K_M.gguf

+ LJ Sat 12 Jul 2025 11:24:17 BST — # 1a) Start mlx server: $ uvx --from mlx-lm mlx_lm.server --model mlx-community/Qwen3-8B-8bit

Moved from .bashrc, TODO try run

# Running aider locally - MLX:
# 1a) Start mlx server: $ uvx --from mlx-lm mlx_lm.server --model mlx-community/Qwen3-8B-8bit
# 1b) Then start Aider in your working directory: $ uvx aider --openai-api-base http://localhost:8080 --openai-api-key dummy-key --model mlx-community/Qwen3-8B-8bit
# Running aider locally - GGUF:
# 2a) Start llama.cpp server http://127.0.0.1:8080. NB to extend 32K->128K context add "--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768".
# $ llama-server --ctx-size 40960 --model ~/.lmstudio/models/unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL.gguf --model-draft ~/.lmstudio/models/unsloth/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf --top-p 0.95 --top-k 20 --min-p 0 --repeat-penalty 1.1 --draft-max 10 &
# $ llama-server --ctx-size 131072 --model ~/.lmstudio/models/x0000001/Qwen3-30B-A6B-16-Extreme-128k-context-Q6_K-GGUF/qwen3-30b-a6b-16-extreme-128k-context-q6_k.gguf --top-p 0.95 --top-k 100 --min-p 0.05 --repeat-penalty 64 --override-kv qwen3moe.expert_used_count=int:16 &
# 2b) Start aider in your working directory: $ aider --openai-api-base http://localhost:8080 --openai-api-key dummy-key --model qwen3-30b-a3b --model-metadata-file qwen3-30b-a3b.json
# 2c) The config file is: $ cat qwen3-30b-a3b.json
# {
#   "model": "qwen3-30b-a3b",
#   "max_input_tokens": 32768,
#   "max_output_tokens": 8192,
#   "input_cost_per_token": 0.0,
#   "output_cost_per_token": 0.0,
#   "show_warnings": true
# }

# https://www.reddit.com/r/LocalLLaMA/comments/1kmlu2y/comment/mser4ff/
#
# model_settings-yaml:
#
# - name: openai/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
#   edit_format: whole
#   use_repo_map: true
#   use_temperature: 0.7
#   streaming: false
#   system_prompt_prefix: "/no_think"
#   extra_params:
#     top_p: 0.8
#     max_tokens: 24000
#     top_k: 20
#     min_p: 0.0
#     temperature: 0.7
#     enable_thinking: false
#

LJ Sat 12 Jul 2025 11:24:17 BST

+ LJ Sat 12 Jul 2025 11:37:41 BST — mistralai/Devstral-Small-2507

Make Devstral-Small-2507 work with Aider localhost

mistralai/Devstral-Small-2507
https://huggingface.co/mistralai/Devstral-Small-2507

./llama-server -m mistralai/Devstral-Small-2507_gguf/Devstral-Small-2507-Q4_K_M.gguf -c 0 # -c configure the context size, 0 means model's default, here 128k.



LJ Sat 12 Jul 2025 11:37:41 BST

+ LJ Sat 12 Jul 2025 11:37:51 BST — unsloth/Kimi-Dev-72B-GGUF

Make Kimi-Dev-72B work with Aider localhost

Kim-Dev
https://github.com/MoonshotAI/Kimi-Dev

moonshotai/Kimi-Dev-72B
https://huggingface.co/moonshotai/Kimi-Dev-72B

unsloth/Kimi-Dev-72B-GGUF
https://huggingface.co/unsloth/Kimi-Dev-72B-GGUF

bullerwins/Kimi-Dev-72B-GGUF
https://huggingface.co/bullerwins/Kimi-Dev-72B-GGUF


Kimi-Dev-72B LOCAL Test (RooCode + LM Studio Coding & Debugging)
https://www.youtube.com/watch?v=KUghIvUdvu4


LJ Sat 12 Jul 2025 11:37:51 BST

+ https://www.reddit.com/r/LocalLLaMA/comments/1jtwcdo/guide_for_quickly_setting_up_aider_qwq_and_qwen/

https://www.reddit.com/r/LocalLLaMA/comments/1jtwcdo/guide_for_quickly_setting_up_aider_qwq_and_qwen/

Go to LocalLLaMA
r/LocalLLaMA
•
3 mo. ago
No-Statement-0001
llama.cpp

Guide for quickly setting up aider, QwQ and Qwen Coder
Tutorial | Guide
I wrote a guide for setting up a a 100% local coding co-pilot setup with QwQ as as an architect model and qwen Coder as the editor. The focus for the guide is on the trickiest part which is configuring everything to work together.

This guide uses QwQ and qwen Coder 32B as those can fit in a 24GB GPU. This guide uses llama-swap so QwQ and Qwen Coder are swapped in and our during aider's architect or editing phases. The guide also has settings for dual 24GB GPUs where both models can be used without swapping.

The original version is here: https://github.com/mostlygeek/llama-swap/tree/main/examples/aider-qwq-coder.

Here's what you you need:
aider - installation docs

llama-server - download latest release

llama-swap - download latest release

QwQ 32B and Qwen Coder 2.5 32B models

24GB VRAM video card

Running aider
The goal is getting this command line to work:

aider --architect \
    --no-show-model-warnings \
    --model openai/QwQ \
    --editor-model openai/qwen-coder-32B \
    --model-settings-file aider.model.settings.yml \
    --openai-api-key "sk-na" \
    --openai-api-base "http://10.0.1.24:8080/v1" \
Set --openai-api-base to the IP and port where your llama-swap is running.

Create an aider model settings file
# aider.model.settings.yml

#
# !!! important: model names must match llama-swap configuration names !!!
#

- name: "openai/QwQ"
  edit_format: diff
  extra_params:
    max_tokens: 16384
    top_p: 0.95
    top_k: 40
    presence_penalty: 0.1
    repetition_penalty: 1
    num_ctx: 16384
  use_temperature: 0.6
  reasoning_tag: think
  weak_model_name: "openai/qwen-coder-32B"
  editor_model_name: "openai/qwen-coder-32B"

- name: "openai/qwen-coder-32B"
  edit_format: diff
  extra_params:
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    repetition_penalty: 1.05
  use_temperature: 0.6
  reasoning_tag: think
  editor_edit_format: editor-diff
  editor_model_name: "openai/qwen-coder-32B"
llama-swap configuration
# config.yaml

# The parameters are tweaked to fit model+context into 24GB VRAM GPUs
models:
  "qwen-coder-32B":
    proxy: "http://127.0.0.1:8999"
    cmd: >
      /path/to/llama-server
      --host 127.0.0.1 --port 8999 --flash-attn --slots
      --ctx-size 16000
      --cache-type-k q8_0 --cache-type-v q8_0
       -ngl 99
      --model /path/to/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf

  "QwQ":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /path/to/llama-server
      --host 127.0.0.1 --port 9503 --flash-attn --metrics--slots
      --cache-type-k q8_0 --cache-type-v q8_0
      --ctx-size 32000
      --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
      --temp 0.6 --repeat-penalty 1.1 --dry-multiplier 0.5
      --min-p 0.01 --top-k 40 --top-p 0.95
      -ngl 99
      --model /mnt/nvme/models/bartowski/Qwen_QwQ-32B-Q4_K_M.gguf
Advanced, Dual GPU Configuration
If you have dual 24GB GPUs you can use llama-swap profiles to avoid swapping between QwQ and Qwen Coder.

In llama-swap's configuration file:

add a profiles section with aider as the profile name

using the env field to specify the GPU IDs for each model

# config.yaml

# Add a profile for aider
profiles:
  aider:
    - qwen-coder-32B
    - QwQ

models:
  "qwen-coder-32B":
    # manually set the GPU to run on
    env:
      - "CUDA_VISIBLE_DEVICES=0"
    proxy: "http://127.0.0.1:8999"
    cmd: /path/to/llama-server ...

  "QwQ":
    # manually set the GPU to run on
    env:
      - "CUDA_VISIBLE_DEVICES=1"
    proxy: "http://127.0.0.1:9503"
    cmd: /path/to/llama-server ...
Append the profile tag, aider:, to the model names in the model settings file

# aider.model.settings.yml
- name: "openai/aider:QwQ"
  weak_model_name: "openai/aider:qwen-coder-32B-aider"
  editor_model_name: "openai/aider:qwen-coder-32B-aider"

- name: "openai/aider:qwen-coder-32B"
  editor_model_name: "openai/aider:qwen-coder-32B-aider"
Run aider with:

$ aider --architect \
    --no-show-model-warnings \
    --model openai/aider:QwQ \
    --editor-model openai/aider:qwen-coder-32B \
    --config aider.conf.yml \
    --model-settings-file aider.model.settings.yml
    --openai-api-key "sk-na" \
    --openai-api-base "http://10.0.1.24:8080/v1"

+ https://www.reddit.com/r/LocalLLaMA/comments/1m7ci3s/howto_use_qwen3coder_or_any_other_llm_with_claude/

https://www.reddit.com/r/LocalLLaMA/comments/1m7ci3s/howto_use_qwen3coder_or_any_other_llm_with_claude/

Reformatted for Old Reddit users :)

Here's a simple way for Claude Code users to switch from the costly Claude models to the newly released SOTA open-source/weights coding model, Qwen3-Coder, via OpenRouter using LiteLLM on your local machine.

This process is quite universal and can be easily adapted to suit your needs. Feel free to explore other models (including local ones) as well as different providers and coding agents.

I'm sharing what works for me. This guide is set up so you can just copy and paste the commands into your terminal.

1. Clone the official LiteLLM repo:

git clone https://github.com/BerriAI/litellm.git
cd litellm

2. Create an .env file with your OpenRouter API key (make sure to insert your own API key!):

cat <<\EOF >.env
LITELLM_MASTER_KEY = "sk-1234"
# OpenRouter
OPENROUTER_API_KEY = "sk-or-v1-…" # 🚩
EOF

3. Create a config.yaml file that replaces Anthropic models with Qwen3-Coder (with all the recommended parameters):

cat <<\EOF >config.yaml
model_list:
  - model_name: "anthropic/*"
    litellm_params:
      model: "openrouter/qwen/qwen3-coder" # Qwen/Qwen3-Coder-480B-A35B-Instruct
      max_tokens: 65536
      repetition_penalty: 1.05
      temperature: 0.7
      top_k: 20
      top_p: 0.8
EOF

4. Create a docker-compose.yml file that loads config.yaml (it's easier to just create a finished one with all the required changes than to edit the original file):

cat <<\EOF >docker-compose.yml
services:
  litellm:
    build:
      context: .
      args:
        target: runtime
    ############################################################################
    command:
      - "--config=/app/config.yaml"
    container_name: litellm
    hostname: litellm
    image: ghcr.io/berriai/litellm:main-stable
    restart: unless-stopped
    volumes:
      - ./config.yaml:/app/config.yaml
    ############################################################################
    ports:
      - "4000:4000" # Map the container port to the host, change the host port if necessary
    environment:
      DATABASE_URL: "postgresql://llmproxy:dbpassword9090@db:5432/litellm"
      STORE_MODEL_IN_DB: "True" # allows adding models to proxy via UI
    env_file:
      - .env # Load local .env file
    depends_on:
      - db # Indicates that this service depends on the 'db' service, ensuring 'db' starts first
    healthcheck: # Defines the health check configuration for the container
      test: [ "CMD-SHELL", "wget --no-verbose --tries=1 http://localhost:4000/health/liveliness || exit 1" ] # Command to execute for health check
      interval: 30s # Perform health check every 30 seconds
      timeout: 10s # Health check command times out after 10 seconds
      retries: 3 # Retry up to 3 times if health check fails
      start_period: 40s # Wait 40 seconds after container start before beginning health checks

  db:
    image: postgres:16
    restart: always
    container_name: litellm_db
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: llmproxy
      POSTGRES_PASSWORD: dbpassword9090
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data # Persists Postgres data across container restarts
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -d litellm -U llmproxy"]
      interval: 1s
      timeout: 5s
      retries: 10

volumes:
  postgres_data:
    name: litellm_postgres_data # Named volume for Postgres data persistence
EOF

5. Build and run LiteLLM (this is important, as some required fixes are not yet in the published image as of 2025-07-23):

docker compose up -d --build

6. Export environment variables that make Claude Code use Qwen3-Coder via LiteLLM (remember to execute this before starting Claude Code or include it in your shell profile (.zshrc, .bashrc, etc.) for persistence):

export ANTHROPIC_AUTH_TOKEN=sk-1234
export ANTHROPIC_BASE_URL=http://localhost:4000
export ANTHROPIC_MODEL=openrouter/qwen/qwen3-coder
export ANTHROPIC_SMALL_FAST_MODEL=openrouter/qwen/qwen3-coder
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 # Optional: Disables telemetry, error reporting, and auto-updates

7. Start Claude Code and it'll use Qwen3-Coder via OpenRouter instead of the expensive Claude models (you can check with the /model command that it's using a custom model):

claude

8. Optional: Add an alias to your shell profile (.zshrc, .bashrc, etc.) to make it easier to use (e.g. qlaude for "Claude with Qwen"):

alias qlaude='ANTHROPIC_AUTH_TOKEN=sk-1234 ANTHROPIC_BASE_URL=http://localhost:4000 ANTHROPIC_MODEL=openrouter/qwen/qwen3-coder ANTHROPIC_SMALL_FAST_MODEL=openrouter/qwen/qwen3-coder claude'
Have fun and happy coding!

PS: There are other ways to do this using dedicated Claude Code proxies, of which there are quite a few on GitHub. Before implementing this with LiteLLM, I reviewed some of them, but they all had issues, such as not handling the recommended inference parameters. I prefer using established projects with a solid track record and a large user base, which is why I chose LiteLLM. Open Source offers many options, so feel free to explore other projects and find what works best for you.

+ https://gist.github.com/ivanfioravanti/44b4284be930b3c340cc1696d60c6143

https://gist.github.com/ivanfioravanti/44b4284be930b3c340cc1696d60c6143

All gists
Back to GitHub
@ivanfioravanti
ivanfioravanti/mlx_memory.sh
Created 6 months ago • Report abuse

Clone this repository at https://gist.github.com/ivanfioravanti/44b4284be930b3c340cc1696d60c6143.js

Script to set MLX memory limits

mlx_memory.sh

#!/usr/bin/env bash

# Default values for percentages
DEFAULT_WIRED_LIMIT_PERCENT=85
DEFAULT_WIRED_LWM_PERCENT=75

# Read input parameters or use default values
WIRED_LIMIT_PERCENT=${1:-$DEFAULT_WIRED_LIMIT_PERCENT}
WIRED_LWM_PERCENT=${2:-$DEFAULT_WIRED_LWM_PERCENT}

# Validate inputs are within 0-100
if [[ $WIRED_LIMIT_PERCENT -lt 0 || $WIRED_LIMIT_PERCENT -gt 100 || $WIRED_LWM_PERCENT -lt 0 || $WIRED_LWM_PERCENT -gt 100 ]]; then
  echo "Error: Percentages must be between 0 and 100."
  exit 1
fi

# Get the total memory in MB
TOTAL_MEM_MB=$(($(sysctl -n hw.memsize) / 1024 / 1024))

# Calculate the memory limits
WIRED_LIMIT_MB=$(($TOTAL_MEM_MB * $WIRED_LIMIT_PERCENT / 100))
WIRED_LWM_MB=$(($TOTAL_MEM_MB * $WIRED_LWM_PERCENT / 100))

# Display the calculated values
echo "Total memory: $TOTAL_MEM_MB MB"
echo "Maximum limit (iogpu.wired_limit_mb): $WIRED_LIMIT_MB MB ($WIRED_LIMIT_PERCENT%)"
echo "Lower bound (iogpu.wired_lwm_mb): $WIRED_LWM_MB MB ($WIRED_LWM_PERCENT%)"

# Apply the values with sysctl
sudo sysctl -w iogpu.wired_limit_mb=$WIRED_LIMIT_MB
sudo sysctl -w iogpu.wired_lwm_mb=$WIRED_LWM_MB
@ivanfioravanti
Author
ivanfioravanti commented on Jan 4 •
Note: use at your own risk! I push it even more when needed.

Usage Examples:

Use default values (85 and 75):
./mlx_memory.sh

Provide custom percentages (e.g., 90 and 80):
./mlx_memory.sh 90 80

+ LJ Mon 28 Jul 2025 08:41:59 BST — https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507

https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507

Best Practices
To achieve optimal performance, we recommend the following settings:

Sampling Parameters:

We suggest using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0.
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

https://huggingface.co/Intel/Qwen3-235B-A22B-Thinking-2507-gguf-q2ks-mixed-AutoRound

(torch311) ljubomir@macbook2(:):~/llama.cpp$ l models/Qwen3-235B-A22B-Thinking-2507-128x10B-Q2_K_S-0000*
-rw-r--r--@ 1 ljubomir  staff    46G 27 Jul 19:19 models/Qwen3-235B-A22B-Thinking-2507-128x10B-Q2_K_S-00001-of-00002.gguf
-rw-r--r--@ 1 ljubomir  staff    28G 27 Jul 21:16 models/Qwen3-235B-A22B-Thinking-2507-128x10B-Q2_K_S-00002-of-00002.gguf

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$

git pull
mviv build{,.1}

unset CC CXX
unset LDFLAGS
unset CPPFLAGS
env |egrep 'CC|CXX|FLAGS'

brew install libomp

cmake . -B ./build

cmake --build build --config Release -j

sudo sysctl iogpu.wired_limit_mb=88000
build/bin/llama-server --model models/Qwen3-235B-A22B-Thinking-2507-128x10B-Q2_K_S-00001-of-00002.gguf --temp 0.6 --top_k 20 --top_p 0.95 --min_p 0 --ctx-size 32768 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &
# access on http://127.0.0.1:8080

LJ Mon 28 Jul 2025 08:41:59 BST

+ LJ Sat 2 Aug 2025 06:22:24 BST — https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF

https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF

Best Practices
To achieve optimal performance, we recommend the following settings:

Sampling Parameters:

We suggest using temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05.
Adequate Output Length: We recommend using an output length of 65,536 tokens for most queries, which is adequate for instruct models.

https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locally#how-to-fit-long-context-256k-to-1m

📐How to fit long context (256K to 1M)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it.

We also uploaded 1 million context length GGUFs via YaRN scaling here.

(torch311) ljubomir@macbook2(:):~/llama.cpp$ l models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf
-rw-r--r--@ 1 ljubomir  staff    16G 31 Jul 19:48 models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$

git pull
mviv build{,.1}

unset CC CXX
unset LDFLAGS
unset CPPFLAGS
env |egrep 'CC|CXX|FLAGS'

brew install libomp

cmake . -B ./build

cmake --build build --config Release -j

sudo sysctl iogpu.wired_limit_mb=88000
# extend context (262144) 256K->1M (1048576), flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --model models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf --temp 0.7 --top_k 20 --top_p 0.8 --min_p 0 --ctx-size 1048576 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &


LJ Sat  2 Aug 2025 06:22:24 BST

+ https://huggingface.co/mradermacher/XBai-o4-GGUF

MetaStoneTec/XBai-o4

https://huggingface.co/MetaStoneTec/XBai-o4

https://huggingface.co/mradermacher/XBai-o4-GGUF

https://x.com/JacksonAtkinsX/status/1951133405271257355

https://github.com/MetaStone-AI/XBai-o4/blob/main/test/task.py

class Infer_Task:
    def __init__(self, model_dir, score_api_url, response_api_url, branch=3, temperature=0.7, max_tokens=1024*32):

sudo sysctl iogpu.wired_limit_mb=88000

# flash attention cached; access on http://127.0.0.1:8081
build/bin/llama-server --port 8081 --model models/XBai-o4.Q6_K.gguf --temp 0.7 --ctx-size 65536 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

# extend context (262144) 256K->512K (524288), flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --model models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf --temp 0.7 --top_k 20 --top_p 0.8 --min_p 0 --ctx-size 525288 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

Wang Magic @WangMagic_
Thanks for your interest in our work. For an optimal experience during inference, we recommend setting the temperature to 0.6, top_p to 0.95, and the output token limit to 32768 for better performance (which can also be extended to 64k with yarn).

# flash attention cached; access on http://127.0.0.1:8081
build/bin/llama-server --port 8081 --model models/XBai-o4.Q6_K.gguf --temp 0.6 --top_p 0.95 --ctx-size 32768 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

# extend context (262144) 256K->512K (524288), flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf --temp 0.7 --top_k 20 --top_p 0.8 --min_p 0 --ctx-size 525288 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

+ (torch311) ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-server --help

(torch311) ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-server --help

----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--completion-bash                       print source-able bash completion script for llama.cpp
--verbose-prompt                        print a verbose prompt before generation (default: false)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : low(-1), normal(0), medium(1), high(2),
                                        realtime(3) (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 4096, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--swa-full                              use full-size SWA cache (default: false)
                                        [(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
                                        (env: LLAMA_ARG_SWA_FULL)
--kv-unified, -kvu                      use single unified KV buffer for the KV cache of all sequences
                                        (default: false)
                                        [(more info)](https://github.com/ggml-org/llama.cpp/pull/14363)
                                        (env: LLAMA_ARG_KV_SPLIT)
-fa,   --flash-attn                     enable Flash Attention (default: disabled)
                                        (env: LLAMA_ARG_FLASH_ATTN)
--no-perf                               disable internal libllama performance timings (default: false)
                                        (env: LLAMA_ARG_NO_PERF)
-e,    --escape                         process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
--no-escape                             do not process escape sequences
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: 1.0)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: 32.0)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-nkvo, --no-kv-offload                  disable KV offload
                                        (env: LLAMA_ARG_NO_KV_OFFLOAD)
-nr,   --no-repack                      disable weight repacking
                                        (env: LLAMA_ARG_NO_REPACK)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (default: 0.1, < 0 - disabled)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--no-mmap                               do not memory-map model (slower load but may reduce pageouts if not
                                        using mlock)
                                        (env: LLAMA_ARG_NO_MMAP)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
--override-tensor, -ot <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
--cpu-moe                               use CPU for Mixture of Experts (MoE) weights
                                        (env: LLAMA_ARG_CPU_MOE)
-ngl,  --gpu-layers, --n-gpu-layers N   number of layers to store in VRAM
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row}    how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs
                                        - row: split rows across GPUs
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE            advanced option to override model metadata by key. may be specified
                                        multiple times.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false
--no-op-offload                         disable offloading host tensor operations to device (default: false)
--lora FNAME                            path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE               path to LoRA adapter with user defined scaling (can be repeated to use
                                        multiple adapters)
--control-vector FNAME                  add a control vector
                                        note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE     add a control vector with user defined scaling SCALE
                                        note: this argument can be repeated to add multiple scaled control
                                        vectors
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: unsloth/phi-4-GGUF:q4_k_m
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hfd,  -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_HFD_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
--log-colors                            Enable colored logging
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
--offline                               Offline mode: forces use of cache, prevents network access
                                        (env: LLAMA_OFFLINE)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored.
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefix in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE        KV cache data type for K for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE        KV cache data type for V for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default:
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampling-seq, --sampler-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default:
                                        edskypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp N                                temperature (default: 0.8)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
--top-p N                               top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.1, 0.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.0, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.1, 1.0 = disabled)
--typical N                             locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--dynatemp-range N                      dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.0)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.0)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir) (default: '')
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead


----- example-specific params -----

--no-context-shift                      disables context shift on infinite text generation (default: disabled)
                                        (env: LLAMA_ARG_NO_CONTEXT_SHIFT)
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
--no-warmup                             skip warming up the model with an empty run
--spm-infill                            use Suffix/Prefix/Middle pattern for infill (instead of
                                        Prefix/Suffix/Middle) as some models prefer this. (default: disabled)
--pooling {none,mean,cls,last,rank}     pooling type for embeddings, use model default if unspecified
                                        (env: LLAMA_ARG_POOLING)
-cb,   --cont-batching                  enable continuous batching (a.k.a dynamic batching) (default: enabled)
                                        (env: LLAMA_ARG_CONT_BATCHING)
-nocb, --no-cont-batching               disable continuous batching
                                        (env: LLAMA_ARG_NO_CONT_BATCHING)
--mmproj FILE                           path to a multimodal projector file. see tools/mtmd/README.md
                                        note: if -hf is used, this argument can be omitted
                                        (env: LLAMA_ARG_MMPROJ)
--mmproj-url URL                        URL to a multimodal projector file. see tools/mtmd/README.md
                                        (env: LLAMA_ARG_MMPROJ_URL)
--no-mmproj                             explicitly disable multimodal projector, useful when using -hf
                                        (env: LLAMA_ARG_NO_MMPROJ)
--no-mmproj-offload                     do not offload multimodal projector to GPU
                                        (env: LLAMA_ARG_NO_MMPROJ_OFFLOAD)
-a,    --alias STRING                   set alias for model name (to be used by REST API)
                                        (env: LLAMA_ARG_ALIAS)
--host HOST                             ip address to listen, or bind to an UNIX socket if the address ends
                                        with .sock (default: 127.0.0.1)
                                        (env: LLAMA_ARG_HOST)
--port PORT                             port to listen (default: 8080)
                                        (env: LLAMA_ARG_PORT)
--path PATH                             path to serve static files from (default: )
                                        (env: LLAMA_ARG_STATIC_PATH)
--api-prefix PREFIX                     prefix path the server serves from, without the trailing slash
                                        (default: )
                                        (env: LLAMA_ARG_API_PREFIX)
--no-webui                              Disable the Web UI (default: enabled)
                                        (env: LLAMA_ARG_NO_WEBUI)
--embedding, --embeddings               restrict to only support embedding use case; use only with dedicated
                                        embedding models (default: disabled)
                                        (env: LLAMA_ARG_EMBEDDINGS)
--reranking, --rerank                   enable reranking endpoint on server (default: disabled)
                                        (env: LLAMA_ARG_RERANKING)
--api-key KEY                           API key to use for authentication (default: none)
                                        (env: LLAMA_API_KEY)
--api-key-file FNAME                    path to file containing API keys (default: none)
--ssl-key-file FNAME                    path to file a PEM-encoded SSL private key
                                        (env: LLAMA_ARG_SSL_KEY_FILE)
--ssl-cert-file FNAME                   path to file a PEM-encoded SSL certificate
                                        (env: LLAMA_ARG_SSL_CERT_FILE)
--chat-template-kwargs STRING           sets additional params for the json template parser
                                        (env: LLAMA_CHAT_TEMPLATE_KWARGS)
-to,   --timeout N                      server read/write timeout in seconds (default: 600)
                                        (env: LLAMA_ARG_TIMEOUT)
--threads-http N                        number of threads used to process HTTP requests (default: -1)
                                        (env: LLAMA_ARG_THREADS_HTTP)
--cache-reuse N                         min chunk size to attempt reusing from the cache via KV shifting
                                        (default: 0)
                                        [(card)](https://ggml.ai/f0.png)
                                        (env: LLAMA_ARG_CACHE_REUSE)
--metrics                               enable prometheus compatible metrics endpoint (default: disabled)
                                        (env: LLAMA_ARG_ENDPOINT_METRICS)
--slots                                 enable slots monitoring endpoint (default: disabled)
                                        (env: LLAMA_ARG_ENDPOINT_SLOTS)
--props                                 enable changing global properties via POST /props (default: disabled)
                                        (env: LLAMA_ARG_ENDPOINT_PROPS)
--no-slots                              disables slots monitoring endpoint
                                        (env: LLAMA_ARG_NO_ENDPOINT_SLOTS)
--slot-save-path PATH                   path to save slot kv cache (default: disabled)
--jinja                                 use jinja template for chat (default: disabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               controls whether thought tags are allowed and/or extracted from the
                                        response, and in which format they're returned; one of:
                                        - none: leaves thoughts unparsed in `message.content`
                                        - deepseek: puts thoughts in `message.reasoning_content` (except in
                                        streaming mode, which behaves as `none`)
                                        (default: deepseek)
                                        (env: LLAMA_ARG_THINK)
--reasoning-budget N                    controls the amount of thinking allowed; currently only one of: -1 for
                                        unrestricted thinking budget, or 0 to disable thinking (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2,
                                        deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge,
                                        granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys,
                                        llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm,
                                        mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7,
                                        mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world,
                                        smolvlm, vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2,
                                        deepseek3, exaone3, exaone4, falcon3, gemma, gigachat, glmedge,
                                        granite, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys,
                                        llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm,
                                        mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7,
                                        mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world,
                                        smolvlm, vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--no-prefill-assistant                  whether to prefill the assistant's response if the last message is an
                                        assistant message (default: prefill enabled)
                                        when this flag is set, if the last message is an assistant message
                                        then it will be treated as a full message and not prefilled

                                        (env: LLAMA_ARG_NO_PREFILL_ASSISTANT)
-sps,  --slot-prompt-similarity SIMILARITY
                                        how much the prompt of a request must match the prompt of a slot in
                                        order to use that slot (default: 0.50, 0.0 = disabled)
--lora-init-without-apply               load LoRA adapters without applying them (apply later via POST
                                        /lora-adapters) (default: disabled)
--draft-max, --draft, --draft-n N       number of tokens to draft for speculative decoding (default: 16)
                                        (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
                                        (default: 0)
                                        (env: LLAMA_ARG_DRAFT_MIN)
--draft-p-min P                         minimum speculative decoding probability (greedy) (default: 0.8)
                                        (env: LLAMA_ARG_DRAFT_P_MIN)
-cd,   --ctx-size-draft N               size of the prompt context for the draft model (default: 0, 0 = loaded
                                        from model)
                                        (env: LLAMA_ARG_CTX_SIZE_DRAFT)
-devd, --device-draft <dev1,dev2,..>    comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        number of layers to store in VRAM for the draft model
                                        (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
-md,   --model-draft FNAME              draft model for speculative decoding (default: unused)
                                        (env: LLAMA_ARG_MODEL_DRAFT)
--spec-replace TARGET DRAFT             translate the string in TARGET into DRAFT if the draft model and main
                                        model are not compatible
-mv,   --model-vocoder FNAME            vocoder model for audio generation (default: unused)
--tts-use-guide-tokens                  Use guide tokens to improve TTS word recall
--embd-bge-small-en-default             use default bge-small-en-v1.5 model (note: can download weights from
                                        the internet)
--embd-e5-small-en-default              use default e5-small-v2 model (note: can download weights from the
                                        internet)
--embd-gte-small-default                use default gte-small model (note: can download weights from the
                                        internet)
--fim-qwen-1.5b-default                 use default Qwen 2.5 Coder 1.5B (note: can download weights from the
                                        internet)
--fim-qwen-3b-default                   use default Qwen 2.5 Coder 3B (note: can download weights from the
                                        internet)
--fim-qwen-7b-default                   use default Qwen 2.5 Coder 7B (note: can download weights from the
                                        internet)
--fim-qwen-7b-spec                      use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can
                                        download weights from the internet)
--fim-qwen-14b-spec                     use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note:
                                        can download weights from the internet)

+ mradermacher/KAT-V1-40B-GGUF

Kwaipilot/KAT-V1-40B

https://huggingface.co/Kwaipilot/KAT-V1-40B

mradermacher/KAT-V1-40B-GGUF



https://huggingface.co/mradermacher/KAT-V1-40B-GGUF

+ rope_scaling: {

https://huggingface.co/forestliutc/UloRL

Inference Parameters
128k setting:
temperature=0.85
top_p=0.95
top_k=20
max_tokens=131072

140k setting (with Yarn)
temperature=0.85
top_p=0.95
top_k=20
max_tokens=143360
rope_scaling: {
  "rope_type": "yarn",
  "factor": 1.5,
  "original_max_position_embeddings": 95232
}

DL the file

https://huggingface.co/mradermacher/UloRL-GGUF

Rename to full name

(torch311) ljubomir@macbook2(:):~/llama.cpp$ mviv ~/Downloads/UloRL.Q6_K.gguf models/Qwen3-30B-A3B-UloRL.Q6_K.gguf
/Users/ljubomir/Downloads/UloRL.Q6_K.gguf -> models/Qwen3-30B-A3B-UloRL.Q6_K.gguf

(torch311) ljubomir@macbook2(:):~/llama.cpp$ ls -t ~/LJ-books-papers/
UloRL-An_Ultra-Long_Output_Reinforcement_Learning_Approach_for_Advancing_Large_Language_Models_Reasoning_Abilities-jul2025-arxiv-2507.19766v1.pdf

(torch311) ljubomir@macbook2(:):~/llama.cpp$ ls -lt ~/LJ-books-papers/
-rw-r--r--@   1 ljubomir  staff    1588325  3 Aug 21:29 UloRL-An_Ultra-Long_Output_Reinforcement_Learning_Approach_for_Advancing_Large_Language_Models_Reasoning_Abilities-jul2025-arxiv-2507.19766v1.pdf

# flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Qwen3-30B-A3B-UloRL.Q6_K.gguf --temp 0.8 --top_k 20 --top_p 0.95 --min_p 0.05 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

# extend context (262144) 256K->512K (524288), flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Qwen3-30B-A3B-UloRL.Q6_K.gguf --temp 0.8 --top_k 20 --top_p 0.85  --ctx-size 131072 --rope-scaling yarn --rope-scale 1.5 --yarn-orig-ctx 95232 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

+ LJ Thu 7 Aug 2025 07:28:17 BST — https://huggingface.co/unsloth/GLM-4.5-Air-GGUF

https://z.ai/blog/glm-4.5

https://github.com/zai-org/GLM-4.5

https://huggingface.co/zai-org/GLM-4.5-Air

https://huggingface.co/unsloth/GLM-4.5-Air-GGUF

(torch311) ljubomir@macbook2(:):~/llama.cpp$ l models/GLM-*
-rw-r--r--@ 1 ljubomir  staff    46G  6 Aug 20:48 models/GLM-4.5-Air-IQ4_NL-00001-of-00002.gguf
-rw-r--r--@ 1 ljubomir  staff    12G  6 Aug 19:10 models/GLM-4.5-Air-IQ4_NL-00002-of-00002.gguf

# flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/GLM-4.5-Air-IQ4_NL-00001-of-00002.gguf --temp 0.8 --top_k 40 --top_p 0.95 --min_p 0.05 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

(torch311) ljubomir@macbook2(:):~/llama.cpp$ cat ~/LJ-what-next-job-everything-q.txt |xclip_put

(torch311) ljubomir@macbook2(:):~/llama.cpp$ srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 205586
slot update_slots: id  0 | task 0 | input truncated, n_ctx = 131072, n_keep = 0, n_left = 131072, n_prompt_tokens = 74514
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.027485
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.054970
slot update_slots: id  0 | task 0 | kv cache rm [4096, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 6144, n_tokens = 2048, progress = 0.082454
slot update_slots: id  0 | task 0 | kv cache rm [6144, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 8192, n_tokens = 2048, progress = 0.109939
slot update_slots: id  0 | task 0 | kv cache rm [8192, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 10240, n_tokens = 2048, progress = 0.137424
slot update_slots: id  0 | task 0 | kv cache rm [10240, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 12288, n_tokens = 2048, progress = 0.164909
slot update_slots: id  0 | task 0 | kv cache rm [12288, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 14336, n_tokens = 2048, progress = 0.192393
slot update_slots: id  0 | task 0 | kv cache rm [14336, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16384, n_tokens = 2048, progress = 0.219878
slot update_slots: id  0 | task 0 | kv cache rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 18432, n_tokens = 2048, progress = 0.247363
slot update_slots: id  0 | task 0 | kv cache rm [18432, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 20480, n_tokens = 2048, progress = 0.274848
slot update_slots: id  0 | task 0 | kv cache rm [20480, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 22528, n_tokens = 2048, progress = 0.302332
slot update_slots: id  0 | task 0 | kv cache rm [22528, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 24576, n_tokens = 2048, progress = 0.329817
slot update_slots: id  0 | task 0 | kv cache rm [24576, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 26624, n_tokens = 2048, progress = 0.357302
slot update_slots: id  0 | task 0 | kv cache rm [26624, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 28672, n_tokens = 2048, progress = 0.384787
slot update_slots: id  0 | task 0 | kv cache rm [28672, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 30720, n_tokens = 2048, progress = 0.412272
slot update_slots: id  0 | task 0 | kv cache rm [30720, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 32768, n_tokens = 2048, progress = 0.439756
slot update_slots: id  0 | task 0 | kv cache rm [32768, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 34816, n_tokens = 2048, progress = 0.467241
slot update_slots: id  0 | task 0 | kv cache rm [34816, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 36864, n_tokens = 2048, progress = 0.494726
slot update_slots: id  0 | task 0 | kv cache rm [36864, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 38912, n_tokens = 2048, progress = 0.522211
slot update_slots: id  0 | task 0 | kv cache rm [38912, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 40960, n_tokens = 2048, progress = 0.549695
slot update_slots: id  0 | task 0 | kv cache rm [40960, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 43008, n_tokens = 2048, progress = 0.577180
slot update_slots: id  0 | task 0 | kv cache rm [43008, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 45056, n_tokens = 2048, progress = 0.604665
slot update_slots: id  0 | task 0 | kv cache rm [45056, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 47104, n_tokens = 2048, progress = 0.632150
slot update_slots: id  0 | task 0 | kv cache rm [47104, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 49152, n_tokens = 2048, progress = 0.659634
slot update_slots: id  0 | task 0 | kv cache rm [49152, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 51200, n_tokens = 2048, progress = 0.687119
slot update_slots: id  0 | task 0 | kv cache rm [51200, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 53248, n_tokens = 2048, progress = 0.714604
slot update_slots: id  0 | task 0 | kv cache rm [53248, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 55296, n_tokens = 2048, progress = 0.742089
slot update_slots: id  0 | task 0 | kv cache rm [55296, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 57344, n_tokens = 2048, progress = 0.769574
slot update_slots: id  0 | task 0 | kv cache rm [57344, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 59392, n_tokens = 2048, progress = 0.797058
slot update_slots: id  0 | task 0 | kv cache rm [59392, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 61440, n_tokens = 2048, progress = 0.824543
slot update_slots: id  0 | task 0 | kv cache rm [61440, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 63488, n_tokens = 2048, progress = 0.852028
slot update_slots: id  0 | task 0 | kv cache rm [63488, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 65536, n_tokens = 2048, progress = 0.879513
slot update_slots: id  0 | task 0 | kv cache rm [65536, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 67584, n_tokens = 2048, progress = 0.906997
slot update_slots: id  0 | task 0 | kv cache rm [67584, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 69632, n_tokens = 2048, progress = 0.934482
slot update_slots: id  0 | task 0 | kv cache rm [69632, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 71680, n_tokens = 2048, progress = 0.961967
slot update_slots: id  0 | task 0 | kv cache rm [71680, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 73728, n_tokens = 2048, progress = 0.989452
slot update_slots: id  0 | task 0 | kv cache rm [73728, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 74514, n_tokens = 786, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 74514, n_tokens = 786
slot      release: id  0 | task 0 | stop processing: n_past = 77593, truncated = 1
slot print_timing: id  0 | task 0 |
prompt eval time = 4040281.84 ms / 74514 tokens (   54.22 ms per token,    18.44 tokens per second)
       eval time =  912694.36 ms /  3080 tokens (  296.33 ms per token,     3.37 tokens per second)
      total time = 4952976.21 ms / 77594 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

https://x.com/Zai_org/status/1953340512963825732
Z.ai @Zai_org
In the spirit of full transparency, we are officially sharing the default sampling configurations used for http://Z.ai.

http://Z.ai Chat: This configuration is optimized to encourage more creative and diverse responses, using a temperature of 0.95 and a top_p of 0.7.

API & Leaderboard Testing: This setup is calibrated for more focused and predictable outputs with a temperature of 0.6 and a top_p of 0.95.

# flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/GLM-4.5-Air-IQ4_NL-00001-of-00002.gguf --temp 0.95 --top_k 40 --top_p 0.7 --min_p 0.05 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

(torch311) ljubomir@macbook2(:):~/llama.cpp$ cat ~/LJ-what-next-job-everything-q.txt |xclip_put

LJ Thu  7 Aug 2025 07:28:17 BST

+ LJ Sat 9 Aug 2025 07:49:03 BST — https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

Processing Ultra-Long Texts
To support ultra-long context processing (up to 1 million tokens), we integrate two key techniques:

Dual Chunk Attention (DCA): A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.

https://arxiv.org/abs/2402.17463
[Submitted on 27 Feb 2024 (v1), last revised 29 May 2024 (this version, v2)]
Training-Free Long-Context Scaling of Large Language Models

MInference: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions.
Together, these innovations significantly improve both generation quality and inference efficiency for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a 3× speedup compared to standard attention implementations.

https://arxiv.org/abs/2407.02490
[Submitted on 2 Jul 2024 (v1), last revised 30 Oct 2024 (this version, v2)]
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

For full technical details, see the Qwen2.5-1M Technical Report.

https://arxiv.org/abs/2501.15383
[Submitted on 26 Jan 2025]
Qwen2.5-1M Technical Report

How to Enable 1M Token Context
To effectively process a 1 million token context, users will require approximately 240 GB of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.

Step 1: Update Configuration File
Download the model and replace the content of your config.json with config_1m.json, which includes the config for length extrapolation and sparse attention.

export MODELNAME=Qwen3-30B-A3B-Thinking-2507
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json

Step 2: Launch Model Server
After updating the config, proceed with either vLLM or SGLang for serving the model.

Option 1: Using vLLM
To run Qwen with 1M context support:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Then launch the server with Dual Chunk Flash Attention enabled:

VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
vllm serve ./Qwen3-30B-A3B-Thinking-2507 \
  --tensor-parallel-size 4 \
  --max-model-len 1010000 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 131072 \
  --enforce-eager \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.85 \
  --enable-reasoning --reasoning-parser deepseek_r1

Key Parameters
Parameter	Purpose
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN	Enables the custom attention kernel for long-context efficiency
--max-model-len 1010000	Sets maximum context length to ~1M tokens
--enable-chunked-prefill	Allows chunked prefill for very long inputs (avoids OOM)
--max-num-batched-tokens 131072	Controls batch size during prefill; balances throughput and memory
--enforce-eager	Disables CUDA graph capture (required for dual chunk attention)
--max-num-seqs 1	Limits concurrent sequences due to extreme memory usage
--gpu-memory-utilization 0.85	Set the fraction of GPU memory to be used for the model executor
Option 2: Using SGLang
First, clone and install the specialized branch:

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Launch the server with DCA support:

python3 -m sglang.launch_server \
    --model-path ./Qwen3-30B-A3B-Thinking-2507 \
    --context-length 1010000 \
    --mem-frac 0.75 \
    --attention-backend dual_chunk_flash_attn \
    --tp 4 \
    --chunked-prefill-size 131072 \
    --reasoning-parser deepseek-r1

Key Parameters
Parameter	Purpose
--attention-backend dual_chunk_flash_attn	Activates Dual Chunk Flash Attention
--context-length 1010000	Defines max input length
--mem-frac 0.75	The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors.
--tp 4	Tensor parallelism size (matches model sharding)
--chunked-prefill-size 131072	Prefill chunk size for handling long inputs without OOM
Troubleshooting:
Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."

The VRAM reserved for the KV cache is insufficient.

vLLM: Consider reducing the max_model_len or increasing the tensor_parallel_size and gpu_memory_utilization. Alternatively, you can reduce max_num_batched_tokens, although this may significantly slow down inference.
SGLang: Consider reducing the context-length or increasing the tp and mem-frac. Alternatively, you can reduce chunked-prefill-size, although this may significantly slow down inference.
Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."

The VRAM reserved for activation weights is insufficient. You can try lowering gpu_memory_utilization or mem-frac, but be aware that this might reduce the VRAM available for the KV cache.

Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."

The input is too lengthy. Consider using a shorter sequence or increasing the max_model_len or context-length.

Long-Context Performance
We test the model on an 1M version of the RULER benchmark.

Model Name	Acc avg	4k	8k	16k	32k	64k	96k	128k	192k	256k	384k	512k	640k	768k	896k	1000k
Qwen3-30B-A3B (Thinking)	70.6	96.7	94.4	94.5	93.4	82.6	78.4	74.5	70.6	63.1	60.0	56.3	51.0	48.4	47.2	48.2
Qwen3-30B-A3B-Thinking-2507 (Full Attention)	91.4	99.6	100.0	99.8	99.2	97.4	96.8	96.8	94.8	89.4	90.2	84.0	82.6	81.9	80.1	77.5
Qwen3-30B-A3B-Thinking-2507 (Sparse Attention)	91.5	100.0	99.2	99.1	98.5	97.3	97.1	96.9	95.8	89.0	89.3	85.5	84.8	80.0	79.9	79.6
All models are evaluated with Dual Chunk Attention enabled.
Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each).
To avoid overly verbose reasoning, we set the thinking budget to 8,192 tokens.
Best Practices
To achieve optimal performance, we recommend the following settings:

Sampling Parameters:

We suggest using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0.
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.

Citation
If you find our work helpful, feel free to give us a cite.

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report},
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388},
}



https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507

Qwen / Qwen3-30B-A3B-Instruct-2507

Highlights
We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements:

Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage.
Substantial gains in long-tail knowledge coverage across multiple languages.
Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation.
Enhanced capabilities in 256K long-context understanding.
image/jpeg

Model Overview
Qwen3-30B-A3B-Instruct-2507 has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 30.5B in total and 3.3B activated
Number of Paramaters (Non-Embedding): 29.9B
Number of Layers: 48
Number of Attention Heads (GQA): 32 for Q and 4 for KV
Number of Experts: 128
Number of Activated Experts: 8
Context Length: 262,144 natively.
NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Performance
Deepseek-V3-0324	GPT-4o-0327	Gemini-2.5-Flash Non-Thinking	Qwen3-235B-A22B Non-Thinking	Qwen3-30B-A3B Non-Thinking	Qwen3-30B-A3B-Instruct-2507
Knowledge
MMLU-Pro	81.2	79.8	81.1	75.2	69.1	78.4
MMLU-Redux	90.4	91.3	90.6	89.2	84.1	89.3
GPQA	68.4	66.9	78.3	62.9	54.8	70.4
SuperGPQA	57.3	51.0	54.6	48.2	42.2	53.4
Reasoning
AIME25	46.6	26.7	61.6	24.7	21.6	61.3
HMMT25	27.5	7.9	45.8	10.0	12.0	43.0
ZebraLogic	83.4	52.6	57.9	37.7	33.2	90.0
LiveBench 20241125	66.9	63.7	69.1	62.5	59.4	69.0
Coding
LiveCodeBench v6 (25.02-25.05)	45.2	35.8	40.1	32.9	29.0	43.2
MultiPL-E	82.2	82.7	77.7	79.3	74.6	83.8
Aider-Polyglot	55.1	45.3	44.0	59.6	24.4	35.6
Alignment
IFEval	82.3	83.9	84.3	83.2	83.7	84.7
Arena-Hard v2*	45.6	61.9	58.3	52.0	24.8	69.0
Creative Writing v3	81.6	84.9	84.6	80.4	68.1	86.0
WritingBench	74.5	75.5	80.5	77.0	72.2	85.5
Agent
BFCL-v3	64.7	66.5	66.1	68.0	58.6	65.1
TAU1-Retail	49.6	60.3#	65.2	65.2	38.3	59.1
TAU1-Airline	32.0	42.8#	48.0	32.0	18.0	40.0
TAU2-Retail	71.1	66.7#	64.3	64.9	31.6	57.0
TAU2-Airline	36.0	42.0#	42.5	36.0	18.0	38.0
TAU2-Telecom	34.0	29.8#	16.9	24.6	18.4	12.3
Multilingualism
MultiIF	66.5	70.4	69.4	70.2	70.8	67.9
MMLU-ProX	75.8	76.2	78.3	73.2	65.1	72.0
INCLUDE	80.1	82.1	83.8	75.6	67.8	71.9
PolyMATH	32.2	25.5	41.9	27.0	23.3	43.1
*: For reproducibility, we report the win rates evaluated by GPT-4.1.

#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable.

Quickstart
The code of Qwen3-MoE has been in the latest Hugging Face transformers and we advise you to use the latest version of transformers.

With transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3_moe'

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.5 or to create an OpenAI-compatible API endpoint:

SGLang:
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --context-length 262144

vLLM:
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --max-model-len 262144

Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as 32,768.

For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.

Agentic Use
Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.

To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself.

from qwen_agent.agents import Assistant

# Define LLM
llm_cfg = {
    'model': 'Qwen3-30B-A3B-Instruct-2507',

    # Use a custom endpoint compatible with OpenAI API:
    'model_server': 'http://localhost:8000/v1',  # api_base
    'api_key': 'EMPTY',
}

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            'time': {
                'command': 'uvx',
                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
            },
            "fetch": {
                "command": "uvx",
                "args": ["mcp-server-fetch"]
            }
        }
    },
  'code_interpreter',  # Built-in tools
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Processing Ultra-Long Texts
To support ultra-long context processing (up to 1 million tokens), we integrate two key techniques:

Dual Chunk Attention (DCA): A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.

https://arxiv.org/abs/2402.17463
[Submitted on 27 Feb 2024 (v1), last revised 29 May 2024 (this version, v2)]
Training-Free Long-Context Scaling of Large Language Models

MInference: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions.
Together, these innovations significantly improve both generation quality and inference efficiency for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a 3× speedup compared to standard attention implementations.

https://arxiv.org/abs/2407.02490
[Submitted on 2 Jul 2024 (v1), last revised 30 Oct 2024 (this version, v2)]
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

For full technical details, see the Qwen2.5-1M Technical Report.

How to Enable 1M Token Context
To effectively process a 1 million token context, users will require approximately 240 GB of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.

Step 1: Update Configuration File
Download the model and replace the content of your config.json with config_1m.json, which includes the config for length extrapolation and sparse attention.

export MODELNAME=Qwen3-30B-A3B-Instruct-2507
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json

Step 2: Launch Model Server
After updating the config, proceed with either vLLM or SGLang for serving the model.

Option 1: Using vLLM
To run Qwen with 1M context support:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Then launch the server with Dual Chunk Flash Attention enabled:

VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
vllm serve ./Qwen3-30B-A3B-Instruct-2507 \
  --tensor-parallel-size 4 \
  --max-model-len 1010000 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 131072 \
  --enforce-eager \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.85

Key Parameters
Parameter	Purpose
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN	Enables the custom attention kernel for long-context efficiency
--max-model-len 1010000	Sets maximum context length to ~1M tokens
--enable-chunked-prefill	Allows chunked prefill for very long inputs (avoids OOM)
--max-num-batched-tokens 131072	Controls batch size during prefill; balances throughput and memory
--enforce-eager	Disables CUDA graph capture (required for dual chunk attention)
--max-num-seqs 1	Limits concurrent sequences due to extreme memory usage
--gpu-memory-utilization 0.85	Set the fraction of GPU memory to be used for the model executor
Option 2: Using SGLang
First, clone and install the specialized branch:

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Launch the server with DCA support:

python3 -m sglang.launch_server \
    --model-path ./Qwen3-30B-A3B-Instruct-2507 \
    --context-length 1010000 \
    --mem-frac 0.75 \
    --attention-backend dual_chunk_flash_attn \
    --tp 4 \
    --chunked-prefill-size 131072

Key Parameters
Parameter	Purpose
--attention-backend dual_chunk_flash_attn	Activates Dual Chunk Flash Attention
--context-length 1010000	Defines max input length
--mem-frac 0.75	The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors.
--tp 4	Tensor parallelism size (matches model sharding)
--chunked-prefill-size 131072	Prefill chunk size for handling long inputs without OOM
Troubleshooting:
Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static."

The VRAM reserved for the KV cache is insufficient.

vLLM: Consider reducing the max_model_len or increasing the tensor_parallel_size and gpu_memory_utilization. Alternatively, you can reduce max_num_batched_tokens, although this may significantly slow down inference.
SGLang: Consider reducing the context-length or increasing the tp and mem-frac. Alternatively, you can reduce chunked-prefill-size, although this may significantly slow down inference.
Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."

The VRAM reserved for activation weights is insufficient. You can try lowering gpu_memory_utilization or mem-frac, but be aware that this might reduce the VRAM available for the KV cache.

Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)."

The input is too lengthy. Consider using a shorter sequence or increasing the max_model_len or context-length.

Long-Context Performance
We test the model on an 1M version of the RULER benchmark.

Model Name	Acc avg	4k	8k	16k	32k	64k	96k	128k	192k	256k	384k	512k	640k	768k	896k	1000k
Qwen3-30B-A3B (Non-Thinking)	72.0	97.1	96.1	95.0	92.2	82.6	79.7	76.9	70.2	66.3	61.9	55.4	52.6	51.5	52.0	50.9
Qwen3-30B-A3B-Instruct-2507 (Full Attention)	86.8	98.0	96.7	96.9	97.2	93.4	91.0	89.1	89.8	82.5	83.6	78.4	79.7	77.6	75.7	72.8
Qwen3-30B-A3B-Instruct-2507 (Sparse Attention)	86.8	98.0	97.1	96.3	95.1	93.6	92.5	88.1	87.7	82.9	85.7	80.7	80.0	76.9	75.5	72.2
All models are evaluated with Dual Chunk Attention enabled.
Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each).
Best Practices
To achieve optimal performance, we recommend the following settings:

Sampling Parameters:

We suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models.

Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking.

Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
Citation
If you find our work helpful, feel free to give us a cite.

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report},
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388},
}

Downloads last month
188,304
Safetensors
Model size
30.5B params
Tensor type
BF16

Files info
Qwen/Qwen3-30B-A3B-Instruct-2507

Clone this model repository

# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

# If you want to clone without large files - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

# Make sure hf CLI is installed: pip install -U "huggingface_hub[cli]"
hf download Qwen/Qwen3-30B-A3B-Thinking-2507

https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507/discussions/5
How does this compare to other "1M context"-s UloRL and Unsloth?
ljupco 1 minute ago • edited less than a minute ago

Thanks for OSS-ing this - you are making life so much fun! I got only 88gb vram (share of 96gb ram) to play with this on a macbook. Curious to see what transpires... But before that, what's similar what's different between this, and -

"An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities" at https://huggingface.co/forestliutc/UloRL; or
This by the Unsloth guys https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF
? Thanks for your help - LJ

LJ Sat  9 Aug 2025 07:49:03 BST

+ https://www.reddit.com/r/LocalLLaMA/comments/1mfzzt4/experience_with_glm45air_claude_code/?chainedPosts=t3_1mkw4ug

https://www.reddit.com/r/LocalLLaMA/comments/1mfzzt4/experience_with_glm45air_claude_code/?chainedPosts=t3_1mkw4ug

r/LocalLLaMA
• 7 days ago Leflakk

Experience with GLM-4.5-Air + claude code?
Discussion
Hi guys,

I am actually running GLM-4.5-Air with vllm (4x3090) and even if it's quite early I'm quite impressed the model isn't "lost" and can handle some tasks through cc (python code modifications). There are some errors during the executions and the model need to retry but need to do more tests to better understand the limits. I also encounter some context limit errors unfortunately.

What is your experience actually? Any tip is wellcome

For info, I use AWQ with the latest (nightly) version of vllm with following cmd:

vllm serve cpatonn/GLM-4.5-Air-AWQ --reasoning-parser glm45 -tp 2 -pp 2 --dtype float16 --max-model-len 70000 --enable-auto-tool-choice --tool-call-parser glm45 --host 127.0.0.1 --port 8123 --api-key xxxx

Then claude-code-router with following config:

{
  "LOG": true,
  "Providers": [
    {
    "name": "openai",
    "api_base_url": "http://localhost:8123/v1/chat/completions",
    "api_key": "xxxx",
    "models": ["cpatonn/GLM-4.5-Air-AWQ"]
    }
  ],
  "Router": {
    "default": "openai,cpatonn/GLM-4.5-Air-AWQ",
    "background": "openai,cpatonn/GLM-4.5-Air-AWQ",
    "think": "openai,cpatonn/GLM-4.5-Air-AWQ",
    "longContext": "openai,cpatonn/GLM-4.5-Air-AWQ",
    "longContextThreshold": 64000,
    "webSearch": "openai,cpatonn/GLM-4.5-Air-AWQ"
  }
}

Use llama.cpp instead of vllm

(torch311) ljubomir@macbook2(:):~/z/itrade/src$ l ~/llama.cpp/models/GLM-*
-rw-r--r--@ 1 ljubomir  staff    46G  6 Aug 20:48 /Users/ljubomir/llama.cpp/models/GLM-4.5-Air-IQ4_NL-00001-of-00002.gguf
-rw-r--r--@ 1 ljubomir  staff    12G  6 Aug 19:10 /Users/ljubomir/llama.cpp/models/GLM-4.5-Air-IQ4_NL-00002-of-00002.gguf

build/bin/llama-server --port 8080 --model models/GLM-4.5-Air-IQ4_NL-00001-of-00002.gguf --temp 0.95 --top_k 40 --top_p 0.7 --min_p 0.05 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

$ cat ~/claude-code-router
{
  "LOG": true,
  "Providers": [
    {
      "name": "openai",
      "api_base_url": "http://localhost:8080/v1/chat/completions",
      "api_key": "xxxx",
      "models": ["unsloth/GLM-4.5-Air-GGUF"]
    }
  ],
  "Router": {
    "default": "openai,unsloth/GLM-4.5-Air-GGUF",
    "background": "openai,unsloth/GLM-4.5-Air-GGUF",
    "think": "openai,unsloth/GLM-4.5-Air-GGUF",
    "longContext": "openai,unsloth/GLM-4.5-Air-GGUF",
    "longContextThreshold": 64000,
    "webSearch": "openai,unsloth/GLM-4.5-Air-GGUF"
  }
}

+ LJ Mon 18 Aug 2025 10:39:21 BST — Competitive performance on academic benchmarks like AIME-24 AIME-25, AMC-23, MATH-500 and GPQA considering model size.

https://huggingface.co/ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Summary
Apriel-Nemotron-15b-Thinker is a 15 billion‑parameter reasoning model in ServiceNow’s Apriel SLM series which achieves competitive performance against similarly sized state-of-the-art models like o1‑mini, QWQ‑32b, and EXAONE‑Deep‑32b, all while maintaining only half the memory footprint of those alternatives. It builds upon the Apriel‑15b‑base checkpoint through a three‑stage training pipeline (CPT, SFT and GRPO).

Highlights

Half the size of SOTA models like QWQ-32b and EXAONE-32b and hence memory efficient.
It consumes 40% less tokens compared to QWQ-32b, making it super efficient in production. 🚀🚀🚀
On par or outperforms on tasks like - MBPP, BFCL, Enterprise RAG, MT Bench, MixEval, IFEval and Multi-Challenge making it great for Agentic / Enterprise tasks.
Competitive performance on academic benchmarks like AIME-24 AIME-25, AMC-23, MATH-500 and GPQA considering model size.

https://arxiv.org/abs/2508.10948
[Submitted on 13 Aug 2025]
Apriel-Nemotron-15B-Thinker
While large language models (LLMs) have achieved remarkable reasoning capabilities across domains like code, math and other enterprise tasks, their significant memory and computational costs often preclude their use in practical enterprise settings. To this end, we introduce Apriel-Nemotron-15B-Thinker, a 15-billion parameter model in the ServiceNow Apriel SLM series that achieves performance against medium sized state-of-the-art models such as o1-mini, QWQ32B, and EXAONE-Deep-32B while maintaining only half the memory footprint of those alternatives. Apriel-Nemotron-15B-Thinker model is trained in a four stage training pipeline including 1) Base Model upscaling, 2) Continual Pre-training 3) Supervised Fine-tuning (SFT) and 4) Reinforcement Learning using GRPO. Comprehensive evaluations across a diverse suite of benchmarks consistently demonstrate that our Apriel-Nemotron-15B-Thinker model matches or exceeds the performance of its 32-billion parameter counterparts, despite being less than half their size.

https://huggingface.co/bartowski/ServiceNow-AI_Apriel-Nemotron-15b-Thinker-GGUF

https://huggingface.co/bartowski/ServiceNow-AI_Apriel-Nemotron-15b-Thinker-GGUF/blob/main/ServiceNow-AI_Apriel-Nemotron-15b-Thinker-Q6_K_L.gguf

build/bin/llama-server --port 8080 --model models/ServiceNow-AI_Apriel-Nemotron-15b-Thinker-Q6_K_L.gguf --temp 0.6 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

Seems using the homerew compiler doesn't work, one must use the MacOS compiler and libraries

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$

brew install libomp

==> Caveats
==> libomp
libomp is keg-only, which means it was not symlinked into /opt/homebrew,
because it can override GCC headers and result in broken builds.

For compilers to find libomp you may need to set:
  export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
  export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"

export CC=gcc-15
export CXX=g++-15
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
env |egrep 'CC|CXX|FLAGS'

or using clang from MacOS, but pointing to general libomp from homebrew

unset CC CXX LDFLAGS CPPFLAGS
export OpenMP_C_FLAGS="-I/opt/homebrew/opt/libomp/include"
export OpenMP_CXX_FLAGS="-I/opt/homebrew/opt/libomp/include"
export OpenMP_C_LIBRARIES="-L/opt/homebrew/opt/libomp/lib -lomp"
export OpenMP_CXX_LIBRARIES="-L/opt/homebrew/opt/libomp/lib -lomp"
env |egrep 'CC|CXX|FLAGS|OpenMP'

unset CC CXX LDFLAGS CPPFLAGS OpenMP_C_FLAGS OpenMP_CXX_FLAGS OpenMP_C_LIBRARIES OpenMP_CXX_LIBRARIES CMAKE_OPENMP_C_FLAGS CMAKE_OPENMP_CXX_FLAGS

# Set variables to point directly to the libomp Cellar location
export CPPFLAGS="-I/opt/homebrew/Cellar/libomp/20.1.8/include"
export LDFLAGS="-L/opt/homebrew/Cellar/libomp/20.1.8/lib -lomp"
# Use -fopenmp as the standard flag for AppleClang to enable OpenMP
export CMAKE_OPENMP_C_FLAGS="-fopenmp"
export CMAKE_OPENMP_CXX_FLAGS="-fopenmp"
export OpenMP_C_FLAGS="-I/opt/homebrew/Cellar/libomp/20.1.8/include"
export OpenMP_CXX_FLAGS="-I/opt/homebrew/Cellar/libomp/20.1.8/include"
export OpenMP_C_LIBRARIES="-L/opt/homebrew/Cellar/libomp/20.1.8/lib -lomp"
export OpenMP_CXX_LIBRARIES="-L/opt/homebrew/Cellar/libomp/20.1.8/lib -lomp"
env |egrep 'CC|CXX|FLAGS|OpenMP|OPENMP'

Results in many many errors.

Seems using the homerew compiler doesn't work, one must use the MacOS compiler and libraries

ljubomir@macbook2(:):~/llama.cpp$ brew reinstall libomp
==> Fetching downloads for: libomp
==> Downloading https://ghcr.io/v2/homebrew/core/libomp/manifests/20.1.8
Already downloaded: /Users/ljubomir/Library/Caches/Homebrew/downloads/009dd6eb448288b6172c6eadf901a907fd0e3971359a6190eec3ce22273cdef0--libomp-20.1.8.bottle_manifest.json
==> Fetching libomp
==> Downloading https://ghcr.io/v2/homebrew/core/libomp/blobs/sha256:7e2e7b434187fffff654343b29a1d108a901d19eb5de892bb3ead7d72fdd0ddf
Already downloaded: /Users/ljubomir/Library/Caches/Homebrew/downloads/fe1cce4f5e781ff276363e66d6692c8a266b2acd9a56aad3288824c5cbcc9f9a--libomp--20.1.8.arm64_sequoia.bottle.tar.gz
==> Reinstalling libomp
==> Pouring libomp--20.1.8.arm64_sequoia.bottle.tar.gz
==> Caveats
libomp is keg-only, which means it was not symlinked into /opt/homebrew,
because it can override GCC headers and result in broken builds.

For compilers to find libomp you may need to set:
  export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
  export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"

ljubomir@macbook2(:):~/llama.cpp$ find /opt/homebrew |grep -i libomp
/opt/homebrew/opt/libomp
/opt/homebrew/Cellar/llvm/19.1.7_1/lib/libomp.dylib
/opt/homebrew/Cellar/open-mpi/5.0.8/lib/openmpi/libompi_dbg_msgq.so
/opt/homebrew/Cellar/open-mpi/5.0.7/lib/openmpi/libompi_dbg_msgq.so
/opt/homebrew/Cellar/libomp
/opt/homebrew/Cellar/libomp/20.1.8
/opt/homebrew/Cellar/libomp/20.1.8/INSTALL_RECEIPT.json
/opt/homebrew/Cellar/libomp/20.1.8/.brew
/opt/homebrew/Cellar/libomp/20.1.8/.brew/libomp.rb
/opt/homebrew/Cellar/libomp/20.1.8/include
/opt/homebrew/Cellar/libomp/20.1.8/include/ompx.h
/opt/homebrew/Cellar/libomp/20.1.8/include/ompt.h
/opt/homebrew/Cellar/libomp/20.1.8/include/omp.h
/opt/homebrew/Cellar/libomp/20.1.8/include/omp-tools.h
/opt/homebrew/Cellar/libomp/20.1.8/sbom.spdx.json
/opt/homebrew/Cellar/libomp/20.1.8/lib
/opt/homebrew/Cellar/libomp/20.1.8/lib/libomp.dylib
/opt/homebrew/Cellar/libomp/20.1.8/lib/libomp.a

ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$

git pull
mviv build{,.1}

# Unset the variables you set earlier to avoid conflicts
unset CXXFLAGS CFLAGS LDFLAGS CPPFLAGS OpenMP_C_FLAGS OpenMP_CXX_FLAGS OpenMP_C_LIBRARIES OpenMP_CXX_LIBRARIES CMAKE_OPENMP_C_FLAGS CMAKE_OPENMP_CXX_FLAGS

# Tell CMake where to find the libomp package.
# The Recommended Solution: Use CMAKE_PREFIX_PATH
# The most reliable and modern way to tell CMake where to find dependencies installed by Homebrew is to use the CMAKE_PREFIX_PATH. This variable points CMake to the root directory of the installed package.
# Point CMake to your libomp installation and run it. The brew --prefix libomp command dynamically finds the correct path, so you don't have to hardcode the version number.
# This works because the FindOpenMP module will search within the CMAKE_PREFIX_PATH for the necessary include/omp.h header and lib/libomp.dylib library, automatically configuring everything it needs.
export CMAKE_PREFIX_PATH=$(brew --prefix libomp)

env |egrep 'CC|CXX|FLAGS|OpenMP|OPENMP|CMAKE'

# Remove the old build directory
rm -rf build
cmake . -B ./build

cmake --build build --config Release -j

# https://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/ServiceNow-AI_Apriel-Nemotron-15b-Thinker-Q6_K_L.gguf --temp 0.6  --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

LJ Mon 18 Aug 2025 10:39:21 BST

+ LJ Thu 21 Aug 2025 01:05:30 BST — https://huggingface.co/bartowski/nvidia_OpenReasoning-Nemotron-32B-GGUF/blob/main/nvidia_OpenReasoning-Nemotron-32B-Q6_K_L.gguf

https://huggingface.co/collections/nvidia/openreasoning-nemotron-687730dae0170059860f1f01

https://huggingface.co/nvidia/OpenReasoning-Nemotron-32B

https://huggingface.co/bartowski/nvidia_OpenReasoning-Nemotron-32B-GGUF/blob/main/nvidia_OpenReasoning-Nemotron-32B-Q6_K_L.gguf

https://www.reddit.com/r/LocalLLaMA/comments/1m394zh/new_models_from_nvidia_openreasoningnemotron/

# https://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/nvidia_OpenReasoning-Nemotron-32B-Q6_K_L.gguf --model-draft models/nvidia_OpenReasoning-Nemotron-1.5B-Q8_0.gguf --temp 0.1 --top-p 0.9 --top-k 50 --ctx-size 65536 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

LJ Thu 21 Aug 2025 01:05:30 BST

+ Update qwen-cde

Update qwen-cde

$ npm install -g @qwen-code/qwen-code

Check version

$ qwen --version

# extend context (262144) 256K->1M (1048576), flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf --temp 0.7 --top_k 20 --top_p 0.8 --min_p 0 --ctx-size 1048576 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

https://github.com/QwenLM/qwen-code

export OPENAI_API_KEY="your_api_key_here"
export OPENAI_BASE_URL="your_api_endpoint"
export OPENAI_MODEL="your_model_choice"

NB claude use DeepSeek API is

env -u ANTHROPIC_API_KEY ANTHROPIC_BASE_URL="https://api.deepseek.com/anthropic" ANTHROPIC_AUTH_TOKEN="${DEEPSEEK_API_KEY}" ANTHROPIC_MODEL="deepseek-chat" ANTHROPIC_SMALL_FAST_MODEL="deepseek-chat" claude

OPENAI_API_KEY=123
OPENAI_BASE_URL=http://localhost:[port]/v1
OPENAI_MODEL=qwen/qwen3-coder-30b

Run qwen-code

env OPENAI_API_KEY=123 OPENAI_BASE_URL=http://localhost:8080/v1 OPENAI_MODEL=qwen/qwen3-coder-30ba-3b qwen

Use OpenAI auth.

# access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf --temp 0.7 --top_k 20 --top_p 0.8 --min_p 0 --repeat-penalty 1.05 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF

Re-downloaded quant from Unsloth as they differ

ljubomir@macbook2(:):~/llama.cpp$ ls -la models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf*
-rw-r--r--@ 1 ljubomir  staff  17310784864 24 Aug 06:14 models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf
-rw-r--r--@ 1 ljubomir  staff  17310784928 31 Jul 19:48 models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf.1

OK - the new file works, where the old one didn't => old file deleted.

This works:

ljubomir@macbook2(:):~/z/itrade/contrib/crm/src$ env OPENAI_API_KEY=123 OPENAI_BASE_URL=http://localhost:8080/v1 OPENAI_MODEL=qwen/qwen3-coder-30ba-3b qwen

The 1st time asked to login, options 1) Qwen 2) OpenAI - choose #2 OpenAI

# access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf --temp 0.7 --top_k 20 --top_p 0.8 --min_p 0 --repeat-penalty 1.05 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

sudo sysctl iogpu.wired_limit_mb=88000
# extend context (262144) 256K->1M (1048576), flash attention cached; access on http://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf --temp 0.7 --top_k 20 --top_p 0.8 --min_p 0 --repeat-penalty 1.05  --ctx-size 1048576 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

qwen-code-localhost-8080() {
  env OPENAI_API_KEY=123 OPENAI_BASE_URL=http://localhost:8080/v1 OPENAI_MODEL=qwen/qwen3-coder-30ba-3b qwen
}

+ https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locally

https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locally

Basics
🌠
Qwen3-Coder: How to Run Locally
Run Qwen3-Coder-30B-A3B-Instruct and 480B-A35B locally with Unsloth Dynamic quants.

Qwen3-Coder is Qwen’s new series of coding agent models, available in 30B (Qwen3-Coder-Flash) and 480B parameters. Qwen3-480B-A35B-Instruct achieves SOTA coding performance rivalling Claude Sonnet-4, GPT-4.1, and Kimi K2, with 61.8% on Aider Polygot and support for 256K (extendable to 1M) token context.

We also uploaded Qwen3-Coder with native 1M context length extended by YaRN and full-precision 8bit and 16bit versions. Unsloth also now supports fine-tuning and RL of Qwen3-Coder.

UPDATE: We fixed tool-calling for Qwen3-Coder! You can now use tool-calling seamlessly in llama.cpp, Ollama, LMStudio, Open WebUI, Jan etc. This issue was universal and affected all uploads (not just Unsloth), and we've communicated with the Qwen team about our fixes! Read more

Run 30B-A3BRun 480B-A35B

Does Unsloth Dynamic Quants work? Yes, and very well. In third-party testing on the Aider Polyglot benchmark, the UD-Q4_K_XL (276GB) dynamic quant nearly matched the full bf16 (960GB) Qwen3-coder model, scoring 60.9% vs 61.8%. More details here.

Qwen3 Coder - Unsloth Dynamic 2.0 GGUFs:
Dynamic 2.0 GGUF (to run)
1M Context Dynamic 2.0 GGUF
30B-A3B-Instruct

480B-A35B-Instruct

30B-A3B-Instruct

480B-A35B-Instruct

🖥️ Running Qwen3-Coder
Below are guides for the 30B-A3B and 480B-A35B variants of the model.

⚙️ Recommended Settings
Qwen recommends these inference settings for both models:

temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.05

Temperature of 0.7

Top_K of 20

Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)

Top_P of 0.8

Repetition Penalty of 1.05

Chat template:

Copy
<|im_start|>user
Hey there!<|im_end|>
<|im_start|>assistant
What is 1+1?<|im_end|>
<|im_start|>user
2<|im_end|>
<|im_start|>assistant
Recommended context output: 65,536 tokens (can be increased). Details here.

Chat template/prompt format with newlines un-rendered

Copy
<|im_start|>user\nHey there!<|im_end|>\n<|im_start|>assistant\nWhat is 1+1?<|im_end|>\n<|im_start|>user\n2<|im_end|>\n<|im_start|>assistant\n
Chat template for tool calling (Getting the current temperature for San Francisco). More details here for how to format tool calls.

Copy
<|im_start|>user
What's the temperature in San Francisco now? How about tomorrow?<|im_end|>
<|im_start|>assistant
<tool_call>\n<function=get_current_temperature>\n<parameter=location>\nSan Francisco, CA, USA
</parameter>\n</function>\n</tool_call><|im_end|>
<|im_start|>user
<tool_response>
{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}
</tool_response>\n<|im_end|>
Reminder that this model supports only non-thinking mode and does not generate <think></think> blocks in its output. Meanwhile, specifying enable_thinking=False is no longer required.

Run Qwen3-Coder-30B-A3B-Instruct:
To achieve inference speeds of 6+ tokens per second for our Dynamic 4-bit quant, have at least 18GB of unified memory (combined VRAM and RAM) or 18GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. E.g. the UD_Q8_K_XL quant (full precision), which is 32.5GB, will require at least 33GB of unified memory (VRAM + RAM) or 33GB of RAM for optimal performance.

NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

Given that this is a non thinking model, there is no need to set thinking=False and the model does not generate <think> </think> blocks.

Follow the best practices above. They're the same as the 480B model.

🦙 Ollama: Run Qwen3-Coder-30B-A3B-Instruct Tutorial
Install ollama if you haven't already! You can only run models up to 32B in size.

Copy
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

Copy
ollama run hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL
✨ Llama.cpp: Run Qwen3-Coder-30B-A3B-Instruct Tutorial
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

Copy
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
You can directly pull from HuggingFace via:

Copy
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \
    --jinja -ngl 99 --threads -1 --ctx-size 32684 \
    --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

Copy
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF",
    local_dir = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)
Run Qwen3-Coder-480B-A35B-Instruct:
To achieve inference speeds of 6+ tokens per second for our 1-bit quant, we recommend at least 150GB of unified memory (combined VRAM and RAM) or 150GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. E.g. the Q2_K_XL quant, which is 180GB, will require at least 180GB of unified memory (VRAM + RAM) or 180GB of RAM for optimal performance.

NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.

Follow the best practices above.  They're the same as the 30B model.

📖 Llama.cpp: Run Qwen3-Coder-480B-A35B-Instruct Tutorial
For Coder-480B-A35B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

If you want a full precision unquantized version, use our Q8_K_XL, Q8_0 or BF16 versions!

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

Copy
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
You can directly use llama.cpp to download the model but I normally suggest using huggingface_hub To use llama.cpp directly, do:

Copy
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF:Q2_K_XL \
    --threads -1 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05
Or, download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

Copy
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF",
    local_dir = "unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"],
)
Run the model in conversation mode and try any prompt.

Edit --threads -1 for the number of CPU threads, --ctx-size 262114 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1  GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.

Copy
./llama.cpp/llama-cli \
    --model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \
    --threads -1 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --temp 0.7 \
    --min-p 0.0 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05
Also don't forget about the new Qwen3 update. Run Qwen3-235B-A22B-Instruct-2507 locally with llama.cpp.

🛠️ Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, -ot ".ffn_.*_exps.=CPU"  offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

The latest llama.cpp release also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

📐How to fit long context (256K to 1M)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it.

We also uploaded 1 million context length GGUFs via YaRN scaling here.

🧰 Tool Calling Fixes
We managed to fix tool calling via llama.cpp --jinja specifically for serving through llama-server! If you’re downloading our 30B-A3B quants, no need to worry as these already include our fixes. For the 480B-A35B model, please:

Download the first file at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/UD-Q2_K_XL for UD-Q2_K_XL, and replace your current file

Use snapshot_download as usual as in https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locally#llama.cpp-run-qwen3-tutorial which will auto override the old files

Use the new chat template via --chat-template-file. See GGUF chat template or chat_template.jinja

As an extra, we also made 1 single 150GB UD-IQ1_M file (so Ollama works) at https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/blob/main/Qwen3-Coder-480B-A35B-Instruct-UD-IQ1_M.gguf

This should solve issues like: https://github.com/ggml-org/llama.cpp/issues/14915

Using Tool Calling
To format the prompts for tool calling, let's showcase it with an example.

I created a Python function called get_current_temperature which is a function which should get the current temperature for a location. For now we created a placeholder function which will always return 21.6 degrees celsius. You should change this to a true function!!

Copy
def get_current_temperature(location: str, unit: str = "celsius"):
    """Get current temperature at a location.

    Args:
        location: The location to get the temperature for, in the format "City, State, Country".
        unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])

    Returns:
        the temperature, the location, and the unit in a dict
    """
    return {
        "temperature": 26.1, # PRE_CONFIGURED -> you change this!
        "location": location,
        "unit": unit,
    }
Then use the tokenizer to create the entire prompt:

Copy
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct")

messages = [
    {'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
    {'content': "", 'role': 'assistant', 'function_call': None, 'tool_calls': [
        {'id': 'ID', 'function': {'arguments': {"location": "San Francisco, CA, USA"}, 'name': 'get_current_temperature'}, 'type': 'function'},
    ]},
    {'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'ID'},
]

prompt = tokenizer.apply_chat_template(messages, tokenize = False)

https://www.reddit.com/r/LocalLLaMA/comments/1mi9i1g/advice_on_running_qwen3coder30ba3b_locally/

+ LJ Sun 24 Aug 2025 18:17:34 BST — https://github.com/ggml-org/llama.cpp/discussions/15396

https://github.com/ggml-org/llama.cpp/discussions/15396

gpt-oss-120b

build/bin/llama-server --port 8080 --model models/gpt-oss-120b-MXFP4-00001-of-00002.gguf --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

openai/gpt-oss-20b

https://huggingface.co/unsloth/gpt-oss-20b-GGUF

Reasoning levels
You can adjust the reasoning level that suits your task across three levels:

Low: Fast responses for general dialogue.
Medium: Balanced speed and detail.
High: Deep and detailed analysis.
The reasoning level can be set in the system prompts, e.g., "Reasoning: high".

https://docs.unsloth.ai/basics/gpt-oss-how-to-run-and-fine-tune

The gpt-oss models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

The gpt-oss models offer three distinct levels of reasoning effort you can choose from:

Low: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.

Medium: A balance between performance and speed.

High: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

https://www.reddit.com/r/LocalLLaMA/comments/1mlomlb/my_thoughts_on_gptoss120b/

Still testing, but here's my early efforts:
./llama.cpp/llama-cli \
--model gpt-oss-120b-UD-Q4_K_XL-00001-of-00002.gguf \
--n-cpu-moe 27 \
--n-gpu-layers 999 \
--ctx-size 120000 \
--flash-attn \
--threads 10 \
--temp 1.0 \
--min-p 0.0 \
--top-p 1.0 \
--no-mmap \
--top-k 40 \
--interactive

If you have the GGUF downloaded, just pass --chat-template-kwargs '{\"reasoning_effort\": \"high\"}' as a parameter

build/bin/llama-server --port 8080 --model models/gpt-oss-120b-MXFP4-00001-of-00002.gguf  --temp 1 --top_k 40 --top_p 1 --min_p 0 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --chat-template-kwargs '{\"reasoning_effort\": \"high\"}' --jinja &

⚙️ Recommended Settings
OpenAI recommends these inference settings for both models:

temperature=1.0, top_p=1.0, top_k=0

Temperature of 1.0

Top_K = 0 (or experiment with 100 for possible better results)

Top_P = 1.0

Recommended minimum context: 16,384

Maximum context length window: 131,072

Chat template:

Copy
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-05\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there!<|end|><|start|>user<|message|>What is 1+1?<|end|><|start|>assistant
The end of sentence/generation token: EOS is <|return|>

https://huggingface.co/unsloth/gpt-oss-120b-GGUF

gpt-oss-20b-Q8_0.gguf

build/bin/llama-server --port 8080 --model models/gpt-oss-20b-Q8_0.gguf --temp 1 --top_k 40 --top_p 1 --min_p 0 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --chat-template-kwargs '{\"reasoning_effort\": \"high\"}' --jinja &

https://huggingface.co/unsloth/gpt-oss-120b-GGUF

openai/gpt-oss-120b

https://github.com/openai/codex/blob/main/codex-rs/config.md

https://platform.openai.com/docs/guides/tools-local-shell

https://www.reddit.com/r/LocalLLaMA/comments/1miq7sp/gptoss_support_merged_into_codex/

Finally! From the README.md:

Codex can run fully locally against an OpenAI-compatible OSS host (like Ollama) using the --oss flag:

Interactive UI:

codex --oss

Non-interactive (programmatic) mode:

echo "Refactor utils" | codex exec --oss

Model selection when using --oss:

If you omit -m/--model, Codex defaults to -m gpt-oss:20b and will verify it exists locally (downloading if needed).

To pick a different size, pass one of:

-m "gpt-oss:20b"

-m "gpt-oss:120b"

Point Codex at your own OSS host:

By default, --oss talks to http://localhost:11434/v1.

To use a different host, set one of these environment variables before running Codex:

CODEX_OSS_BASE_URL, for example:

CODEX_OSS_BASE_URL="http://my-ollama.example.com:11434/v1" codex --oss -m gpt-oss:20b

or CODEX_OSS_PORT (when the host is localhost):

CODEX_OSS_PORT=11434 codex --oss


https://github.com/openai/gpt-oss

Codex
We support codex as a client for gpt-oss. To run the 20b version, set this to ~/.codex/config.toml:

disable_response_storage = true
show_reasoning_content = true

[model_providers.local]
name = "local"
base_url = "http://localhost:11434/v1"

[profiles.oss]
model = "gpt-oss:20b"
model_provider = "local"

Still doesn't work, the --oss is for ollama, but not for llama.cpp

# The reasoning level can be set in the system prompts, "Reasoning: low", "Reasoning: medium", or "Reasoning: high".
# build/bin/llama-server --port 8080 --model models/gpt-oss-120b-MXFP4-00001-of-00002.gguf  --temp 1 --top_k 40 --top_p 1 --min_p 0 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --chat-template-kwargs '{"reasoning_effort": "high"}' --jinja &
codex-localhost-8080-gpt-oss-120b() {
  env OPENAI_API_KEY="123" CODEX_OSS_BASE_URL="http://localhost:8080/v1" codex --oss -m "gpt-oss:120b" -c model_reasoning_effort="high"
}
# build/bin/llama-server --port 8080 --model models/gpt-oss-20b-Q8_0.gguf --temp 1 --top_k 40 --top_p 1 --min_p 0 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --chat-template-kwargs '{"reasoning_effort": "high"}' --jinja &
codex-localhost-8080-gpt-oss-20b() {
  env OPENAI_API_KEY="123" CODEX_OSS_BASE_URL="http://localhost:8080/v1" codex --oss -m "gpt-oss:20b" -c model_reasoning_effort="high"
}

But qwen coder works

# build/bin/llama-server --port 8080 --model models/gpt-oss-120b-MXFP4-00001-of-00002.gguf  --temp 1 --top_k 40 --top_p 1 --min_p 0 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --chat-template-kwargs '{"reasoning_effort": "high"}' --jinja &
qwen-code-localhost-8080-gpt-oss-120b() {
  env OPENAI_API_KEY="123" OPENAI_BASE_URL="http://localhost:8080/v1" OPENAI_MODEL="openai/gpt-oss-120b" qwen
}
# build/bin/llama-server --port 8080 --model models/gpt-oss-20b-Q8_0.gguf --temp 1 --top_k 40 --top_p 1 --min_p 0 --ctx-size 131072 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --chat-template-kwargs '{"reasoning_effort": "high"}' --jinja &
qwen-code-localhost-8080-gpt-oss-20b() {
  env OPENAI_API_KEY="123" OPENAI_BASE_URL="http://localhost:8080/v1" OPENAI_MODEL="openai/gpt-oss-20b" qwen
}


LJ Sun 24 Aug 2025 18:17:34 BST

+ unsloth/Seed-OSS-36B-Instruct-GGUF

unsloth/Seed-OSS-36B-Instruct-GGUF

https://huggingface.co/unsloth/Seed-OSS-36B-Instruct-GGUF

We recommend sampling with temperature=1.1 and top_p=0.95.

Native Long Context: Trained with up-to-512K long context natively.


ljubomir@gigul2(422663.llama.cpp:0):~/llama.cpp$

git pull
mviv build{,.1}

# Unset the variables you set earlier to avoid conflicts
unset CC CXX CXXFLAGS CFLAGS LDFLAGS CPPFLAGS OpenMP_C_FLAGS OpenMP_CXX_FLAGS OpenMP_C_LIBRARIES OpenMP_CXX_LIBRARIES CMAKE_OPENMP_C_FLAGS CMAKE_OPENMP_CXX_FLAGS

# Tell CMake where to find the libomp package.
# The Recommended Solution: Use CMAKE_PREFIX_PATH
# The most reliable and modern way to tell CMake where to find dependencies installed by Homebrew is to use the CMAKE_PREFIX_PATH. This variable points CMake to the root directory of the installed package.
# Point CMake to your libomp installation and run it. The brew --prefix libomp command dynamically finds the correct path, so you don't have to hardcode the version number.
# This works because the FindOpenMP module will search within the CMAKE_PREFIX_PATH for the necessary include/omp.h header and lib/libomp.dylib library, automatically configuring everything it needs.
export CMAKE_PREFIX_PATH=$(brew --prefix libomp)

env |egrep 'CC|CXX|FLAGS|OpenMP|OPENMP|CMAKE'

# Remove the old build directory
rm -rf build
cmake . -B ./build

cmake --build build --config Release -j


(torch313) ljubomir@macbook2(:):~/llama.cpp$ l models/Seed-OSS-36B-Instruct-IQ4_NL.gguf
-rw-r--r--@ 1 ljubomir  staff    19G 25 Aug 21:40 models/Seed-OSS-36B-Instruct-IQ4_NL.gguf

sudo sysctl iogpu.wired_limit_mb=88000
# https://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Seed-OSS-36B-Instruct-IQ4_NL.gguf --temp 1.1 --top_p 1 --ctx-size 524288 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

build/bin/llama-server --port 8080 --model models/Seed-OSS-36B-Instruct-IQ4_NL.gguf --temp 1.1 --top_p 1 --ctx-size 100000 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &

+ LJ Tue 26 Aug 2025 11:50:26 BST — Token-Efficient Performance: Achieves a +5.2% absolute improvement on subjective, humanities-centric tasks with only 5K training samples, outperforming a 671B DeepSeek-V3 model.

https://huggingface.co/inclusionAI/Rubicon-Preview

inclusionAI/Rubicon-Preview

Rubicon
📄 Paper • 🤗 Model

This is the model card for Rubicon-preview, a 30B-A3B parameter model trained with a novel reinforcement learning framework using "rubric anchors" to excel at open-ended, creative, and humanities-centric tasks.

Highlights
We introduce Rubicon, a novel framework using rubric anchors for reinforcement learning. Our model, Rubicon-preview, demonstrates the following key highlights:

Token-Efficient Performance: Achieves a +5.2% absolute improvement on subjective, humanities-centric tasks with only 5K training samples, outperforming a 671B DeepSeek-V3 model.
Stylistic Controllability: Leverages rubric anchors to precisely guide output style, producing responses that are more human-like, emotionally expressive, and less formulaic.
Preservation of General Abilities: Avoids performance degradation on general tasks—a common side effect of specialized RL—while delivering additional gains on reasoning benchmarks like AIME 2024 (+4.1%).

https://arxiv.org/abs/2508.12790

Computer Science > Artificial Intelligence
[Submitted on 18 Aug 2025]
Reinforcement Learning with Rubric Anchors
Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, Junbo Zhao
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI's o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the "AI-like" tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.

# Conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
)

Trained on top of https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507, context length 256K

ljubomir@macbook2(:):~/llama.cpp$ l models/Rubicon-Preview.i1-Q6_K.gguf
-rw-r--r--@ 1 ljubomir  staff    23G 26 Aug 09:14 models/Rubicon-Preview.i1-Q6_K.gguf

sudo sysctl iogpu.wired_limit_mb=88000
# https://127.0.0.1:8080
build/bin/llama-server --port 8080 --model models/Rubicon-Preview.i1-Q6_K.gguf --temp 0.6 --top_p 0.95 --ctx-size 262144 --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --jinja &


LJ Tue 26 Aug 2025 11:50:26 BST