huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --local-dir models/ --local-dir-use-symlinks False
Clone llama.cpp, cd to the dir, build:
Check examples/server/README.md and
ljubomir@thinkpad2(:):~/llama.cpp$ curl -L https://huggingface.co/TheBloke/dolphin-2.2-yi-34b-200k-GGUF/resolve/main/dolphin-2.2-yi-34b-200k.Q5_K_M.gguf --output models/dolphin-2.2-yi-34b-200k.Q5_K_M.gguf
ljubomir@thinkpad2(:):~/llama.cpp$ curl -L https://huggingface.co/TheBloke/stablelm-zephyr-3b-GGUF/resolve/main/stablelm-zephyr-3b.Q4_K_M.gguf --output models/stablelm-zephyr-3b.Q4_K_M.gguf
https://simonwillison.net/2023/Nov/29/llamafile/
1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...:
https://huggingface.co/TheBloke/Orca-2-13B-GGUF
./main -ngl 10 -m models/goliath-120b.Q4_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks for …
huggingface-cli login
./main -ngl 10 -m models/nous-capybara-34b.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -i -ins -p "Write an email to my accountant that I have been late updating my company transactions in QuikBooks…
Check the openchat_3.5 model, compare with llama-2
./main -m models/llama-2-13b.Q4_K_M.gguf -t 6 -p "Llamas are"
Llama 70b model
./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8
https://replicate.com/blog/run-llama-locally
https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF
curl -L "https://huggingface.co/TheBloke/CodeLlama-13B-Python-GGUF/resolve/main/codellama-13b-python.Q4_K_M.gguf" -o models/codellama-13b-python.Q4_K_M.gguf
llama-cli --hf-repo ggml-org/models --model mistral-7b-v0.2-iq3_s-imat.gguf -p "I like big" -r "."
Llama-3
Llama-3-8B gguf from
make LLAMA_CUDA=1 -j
Llama 3.1a
ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git tag -m 'LJ save before LLama 3.2 trying Llama-3.2-1B-Instruct-GGUF' tag_20241016_LJ_before_Llama_3.2_merge_rc1
(torch) ljubomir@gigul2(2251797.torch:0):~/llama.cpp$ git tag -m 'After updating to run QwQ-32B-Preview-GGUF' tag_20241129_LJ_after_QwQ_32B-preview_merge_rc1
ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ make clean
(torch) ljubomir@gigul2(797966.llama.cpp:0):~/llama.cpp$ git pull origin master
LJ Fri 31 Jan 2025 00:08:33 GMT — brew install llama.cpp
LJ Fri 31 Jan 2025 10:42:28 GMT — 1M context models
LJ Sat 1 Feb 2025 09:19:30 GMT — VSCode llama auto-complete llama-vscode addon
LJ Sun 2 Feb 2025 13:16:11 GMT — git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF/blob/main/README.md
./llama.cpp/build/bin/llama-cli \
https://unsloth.ai/blog/deepseek-r1-0528
LJ Fri 30 May 2025 10:25:30 BST — cmake -B build -DLLAMA_CURL=OFF
LJ Sat 31 May 2025 07:21:47 BST — https://unsloth.ai/blog/deepseek-r1-0528
LJ Sat 31 May 2025 12:45:36 BST — ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
LJ Sat 31 May 2025 13:50:34 BST — ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
LJ Sat 31 May 2025 13:50:56 BST — ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
LJ Sun 1 Jun 2025 08:55:50 BST — Try running Qwen3-235B-A22B-GGUF
LJ Sun 1 Jun 2025 07:14:27 BST — ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli --help
https://www.reddit.com/r/LocalLLaMA/comments/186phti/m1m2m3_increase_vram_allocation_with_sudo_sysctl/
https://github.com/ggml-org/llama.cpp/discussions/2182#discussioncomment-7698315
LJ Thu 26 Jun 2025 00:04:46 BST — ln -s ~/.lmstudio/models/rednote-hilab/dots.llm1.inst/dots.llm1.inst-UD-TQ1_0.gguf
LJ Thu 26 Jun 2025 01:38:40 BST — Run with larger context with flash attention
Gemma 3n is here! 🎉
LJ Sun 6 Jul 2025 09:51:26 BST — cd llama.cpp/
LJ Sat 19 Jul 2025 14:43:56 BST — https://huggingface.co/unsloth/ERNIE-4.5-21B-A3B-PT-GGUF
LJ Thu 10 Jul 2025 21:11:04 BST — Benchmark for Devstral? #4058
LJ Fri 11 Jul 2025 20:37:40 BST — https://huggingface.co/unsloth/Hunyuan-A13B-Instruct-GGUF
https://huggingface.co/mradermacher/Seed-X-PPO-7B-GGUF
https://huggingface.co/DavidAU/Qwen2.5-2X7B-Coder-Instruct-OlympicCoder-19B
LJ Sat 12 Jul 2025 11:24:17 BST — # 1a) Start mlx server: $ uvx --from mlx-lm mlx_lm.server --model mlx-community/Qwen3-8B-8bit
LJ Sat 12 Jul 2025 11:37:41 BST — mistralai/Devstral-Small-2507
LJ Sat 12 Jul 2025 11:37:51 BST — unsloth/Kimi-Dev-72B-GGUF
https://www.reddit.com/r/LocalLLaMA/comments/1jtwcdo/guide_for_quickly_setting_up_aider_qwq_and_qwen/
https://www.reddit.com/r/LocalLLaMA/comments/1m7ci3s/howto_use_qwen3coder_or_any_other_llm_with_claude/
https://gist.github.com/ivanfioravanti/44b4284be930b3c340cc1696d60c6143
LJ Mon 28 Jul 2025 08:41:59 BST — https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
LJ Sat 2 Aug 2025 06:22:24 BST — https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-1M-GGUF
https://huggingface.co/mradermacher/XBai-o4-GGUF
(torch311) ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-server --help
mradermacher/KAT-V1-40B-GGUF
rope_scaling: {
LJ Thu 7 Aug 2025 07:28:17 BST — https://huggingface.co/unsloth/GLM-4.5-Air-GGUF
LJ Sat 9 Aug 2025 07:49:03 BST — https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
https://www.reddit.com/r/LocalLLaMA/comments/1mfzzt4/experience_with_glm45air_claude_code/?chainedPosts=t3_1mkw4ug
LJ Mon 18 Aug 2025 10:39:21 BST — Competitive performance on academic benchmarks like AIME-24 AIME-25, AMC-23, MATH-500 and GPQA considering model size.
LJ Thu 21 Aug 2025 01:05:30 BST — https://huggingface.co/bartowski/nvidia_OpenReasoning-Nemotron-32B-GGUF/blob/main/nvidia_OpenReasoning-Nemotron-32B-Q6_K_L.gguf
Update qwen-cde
https://docs.unsloth.ai/basics/qwen3-coder-how-to-run-locally
LJ Sun 24 Aug 2025 18:17:34 BST — https://github.com/ggml-org/llama.cpp/discussions/15396
unsloth/Seed-OSS-36B-Instruct-GGUF
LJ Tue 26 Aug 2025 11:50:26 BST — Token-Efficient Performance: Achieves a +5.2% absolute improvement on subjective, humanities-centric tasks with only 5K training samples, outperforming a 671B DeepSeek-V3 model.