Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

Discussion:

(too old to reply)

Petter Reinholdtsen

2024-02-13 07:40:01 UTC

I tried building the CPU edition on one machine and run it on another,
and experienced illegal instruction exceptions. I suspect this mean one
need to be careful when selecting build profile to ensure it work on all
supported Debian platforms.

I would be happy to help getting this up and running. Please let me
know when you have published a git repo with the packaging rules.

--
Happy hacking
Petter Reinholdtsen

Petter Reinholdtsen

2024-03-08 19:30:01 UTC

Permalink

[Christian Kastner 2024-02-13]

I'll push a first draft soon, though it will definitely not be
upload-ready for the above reasons.

Where can I find the first draft?

--
Happy hacking
Petter Reinholdtsen

Petter Reinholdtsen

2024-03-10 10:10:01 UTC

Permalink

[Christian Kastner]

I'm open for better ideas, though.

I find in general that programs written with run time selection of
optimizations are far superiour to per host compilations, at least from
a system administration viewpoint. I guess such approach would require
rewriting llama.cpp, and have no idea how much work it would be.

I look forward to having a look at your git repo to see if there is
something there I can learn from for the whisper.cpp packaging.

--
Happy hacking
Petter Reinholdtsen

Cordell Bloor

2024-12-15 08:00:01 UTC

Permalink

Hi Christian and Petter,

I've discarded the simple package and now plan another approach: a
package that ships a helper to rebuild the utility when needed, similar
* Continuously developed upstream, no build suited for stable
* Build optimized for the current host's hardware, which is a key
feature. Building for our amd64 ISA standard would be absurd.
I'm open for better ideas, though.

Perhaps we are letting the perfect be the enemy of the good?

There are lots of fast-moving projects that get frozen at some version
for stable. While that can be annoying for maintenance, it is also
something that provides value. It's hard to build on top of something
that keeps changing.

I would also argue that you're taking on too much responsibility trying
to enable -march=native optimizations. It's true that you can get
significantly more performance using AVX instructions available on most
modern computers, but if llama.cpp really wanted they could implement
dynamic dispatch themselves. The CPU instruction set is also irrelevant
for the GPU-accelerated version of the package.

Why not deliver the basics before we try to do something fancy? In the
time that passed between the creation of this issue and now, Fedora
created their own llama.cpp package [1]. I think they had the right
idea. There's value in providing a working package to users today, even
if it's imperfect.

Sincerely,
Cory Bloor

[1]: https://packages.fedoraproject.org/pkgs/llama-cpp/llama-cpp/

Mo Zhou

2024-12-22 16:50:01 UTC

Permalink

Hi Christian,

Did you have a chance to test int8 and int4? They are heavily relying on
newer SIMD instructions especially things like AVX512, and maybe they
face a larger performance impact without -march=native. BTW, for recent
large language models, in fact int4 does not lose much performance[1],
and should be the default precision to run locally since it ought to be
anyway faster than CPU float point.

If llama.cpp really lose a lot of int4 performance without SIMD, that
could be more demotivating to be honest.

I'm also a llama.cpp user through Ollama[2]'s convenient wrapping work.
It is too complicated to consider for packaging -- I mention it here in
order to give you a better idea on how the ecosystem uses llama.cpp, in
case you did not see it before.

To my point of view, llama.cpp is more suitable for source-based
distributions like Gentoo. In the past I proposed something similar for
Debian but the community was not interested in that.

In terms of the BLAS/MKL-like approach for SIMD capability dispatching
... I bet focusing on something else is more worthwhile.

Thanks for the update!

[1] https://arxiv.org/html/2402.16775v1
[2] https://ollama.com/

Hi Cory,

Post by Cordell Bloor
I would also argue that you're taking on too much responsibility trying
to enable -march=native optimizations. It's true that you can get
significantly more performance using AVX instructions available on most
modern computers

I just tested a 3.2B model with f16, and with AVX and other features
turned off, and tokens/s went down by a factor of 25x.

Post by Cordell Bloor
but if llama.cpp really wanted they could implement dynamic dispatch themselves.

Upstream seems to want people to just clone, configure, and build
locally. I don't think we can infer much regarding other design choices.

Post by Cordell Bloor
Why not deliver the basics before we try to do something fancy?
[...] There's value in providing a working package to users today, even
if it's imperfect.

Performance is crippled too badly for any practical use. We can't ship
this. Especially since it is so easy to use upstream.
llamap.cpp is intentionally designed to be trivial to deploy: no
dependencies by default, and the simplest of all build processes. It
doesn't benefit that much from packaging, compared to other software.
The approach I plan to look into is to build and ship all backends, to
make them user-selectable. Similar to what Mo does for MKL.
Best,
Christian

Cordell Bloor

2024-12-28 02:40:01 UTC

Permalink

Hi Mo,

Apart from source-based alternative distribution for Debian, "bumping
amd64
baseline for selected packages" is another project I proposed long
https://github.com/SIMDebian/SIMDebian
Software like Eigen3, TensorFlow can heavily benefit from the baseline
bump.
At that time PyTorch did not have dispatch, but now it has already.

You probably already know this, but Ubuntu is exploring the possibility
of an x86-64-v3 variant [1][2]. That would include AVX2, FMA, and F16C
instructions (among others), essentially setting Intel Haswell (2013)
and AMD Excavator (2011) as the baseline.

Sincerely,
Cory Bloor

[1]:
https://ubuntu.com/blog/optimising-ubuntu-performance-on-amd64-architecture
[2]:
https://ubuntu.com/blog/profile-workloads-on-x86-64-v3-to-enable-future-performance-gains

Cordell Bloor

2025-01-27 09:00:01 UTC

Permalink

Hi Christian,

Could we just sidestep this whole question of native instructions by
building llama.cpp with the BLAS backend? The OpenBLAS library will do
CPU feature detection, so the parts of llama.cpp that call out to BLAS
will make good use of available vector instructions. My benchmarking
suggests that this may be sufficient to achieve reasonable CPU
performance (albeit still imperfect). To prove this, I've included some
benchmarks on my Ryzen 5950X workstation (with 64GB of DDR4 RAM @ 3600
MHz) running Debian Unstable.

First the results of the OpenMP backend with -march=native:

$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| modelÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â |Â Â Â Â Â Â size |Â Â Â Â params | backendÂ Â Â
| threads |Â Â Â Â Â Â Â Â Â test |Â Â Â Â Â Â Â Â Â Â Â Â Â t/s |
| ------------------------------ | ---------: | ---------: | ----------
| ------: | ------------: | ---------------: |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | CPUÂ Â Â Â Â Â Â
|Â Â Â Â Â 16 |Â Â Â Â Â Â Â Â pp512 |Â Â Â Â 48.63 Â± 0.04 |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | CPUÂ Â Â Â Â Â Â
|Â Â Â Â Â 16 |Â Â Â Â Â Â Â Â tg128 |Â Â Â Â Â 9.73 Â± 0.05 |

The above results set the baseline for CPU performance. If we disable
all vector instructions beyond those available in x86_64, we get:

$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| modelÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â |Â Â Â Â Â Â size |Â Â Â Â params | backendÂ Â Â
| threads |Â Â Â Â Â Â Â Â Â test |Â Â Â Â Â Â Â Â Â Â Â Â Â t/s |
| ------------------------------ | ---------: | ---------: | ----------
| ------: | ------------: | ---------------: |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | CPUÂ Â Â Â Â Â Â
|Â Â Â Â Â 16 |Â Â Â Â Â Â Â Â pp512 |Â Â Â Â Â 3.51 Â± 0.00 |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | CPUÂ Â Â Â Â Â Â
|Â Â Â Â Â 16 |Â Â Â Â Â Â Â Â tg128 |Â Â Â Â Â 3.34 Â± 0.01 |

However, if we enable the BLAS backend and
installÂ libopenblas-pthread-dev and libopenblas64-pthread-dev, it improves:

$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| modelÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â |Â Â Â Â Â Â size |Â Â Â Â params | backendÂ Â Â
| threads |Â Â Â Â Â Â Â Â Â test |Â Â Â Â Â Â Â Â Â Â Â Â Â t/s |
| ------------------------------ | ---------: | ---------: | ----------
| ------: | ------------: | ---------------: |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | BLASÂ Â Â Â Â Â
|Â Â Â Â Â 16 |Â Â Â Â Â Â Â Â pp512 |Â Â Â Â 54.64 Â± 0.64 |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | BLASÂ Â Â Â Â Â
|Â Â Â Â Â 16 |Â Â Â Â Â Â Â Â tg128 |Â Â Â Â Â 3.34 Â± 0.01 |

pp512 is the prompt processing benchmark, while tg128 is the text
generation benchmark. So, you can see that this greatly improves the
prompt processing while leaving token generation unchanged.

In my opinion, this may be sufficient. When using llama-cpp as a chat
server, the entire conversation history is passed as the prompt for each
server response. As such, the prompt processing speed is very important.
While a 3x slowdown in the text generation is not ideal, this at least
brings the model into the realm of usable. For modestly long
conversations, PP: 54 t/s and TG: 3.3 t/s may very well be faster than
PP: 48 t/s and TG: 9.7. After a single message, the prompt is going to
be at least as long as a message, and I think the 6 t/s gain in PP will
offset the 6 t/s loss in TG. From that point on, the tradeoff is a
complete win.

With that all said, a GPU implementation blows the CPU implementation
out of the water. With all host vector instructions disabled, but
hipBLAS enabled, this is what I get on my Radeon RX 6800 XT:

$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:Â Â Â no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Â Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
| modelÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â |Â Â Â Â Â Â size |Â Â Â Â params | backendÂ Â Â
| ngl |Â Â Â Â Â Â Â Â Â test |Â Â Â Â Â Â Â Â Â Â Â Â Â t/s |
| ------------------------------ | ---------: | ---------: | ----------
| --: | ------------: | ---------------: |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | CUDAÂ Â Â Â Â Â
|Â 99 |Â Â Â Â Â Â Â Â pp512 |Â Â 1196.90 Â± 1.27 |
| llama 7B Q5_K - MediumÂ Â Â Â Â Â Â Â |Â Â 4.78 GiB |Â Â Â Â 7.24 B | CUDAÂ Â Â Â Â Â
|Â 99 |Â Â Â Â Â Â Â Â tg128 |Â Â Â Â 60.28 Â± 0.05 |

When compared to the CPU implementation with vector instructions, the
prompt processing is >20x faster on the GPU and the text generation is
6x faster.

Still, I know it's usable without the GPU. I setup the llama-server and
spent hours chatting with it on an adventure through a fantasy world. It
was kinda slow, but still enjoyable. Only afterwards did I realize that
I'd started the server without assigning any layers to the GPU! I must
have been getting PP: 48 t/s and TG: 9.7 t/s. It was kind of slow, but
still totally usable.

The full suite of benchmark data is attached.

Sincerely,
Cory Bloor

Petter Reinholdtsen

2025-01-27 09:10:01 UTC

Permalink

[Cordell Bloor]

Post by Cordell Bloor
Could we just sidestep this whole question of native instructions by
building llama.cpp with the BLAS backend?

I like the idea. Perhaps something for whisper.cpp too.

--
Happy hacking
Petter Reinholdtsen

Cordell Bloor

2025-01-27 09:20:02 UTC

Permalink

Minor correction.

After a single message, the prompt is going to be at least as long as
a message, and I think the 6 t/s gain in PP will offset the 6 t/s loss
in TG. From that point on, the tradeoff is a complete win.

My brain is clearly a little fried. The implied math is nonsense. It's
not until PP and TG take the same amount of _time_ that trading 6 t/s
from TG to PP becomes a net benefit. Since PP is so much faster than TG,
that won't happen until probably 10-20 messages into the conversation.

And, frankly, I'm probably extrapolating a bit too much from a 10%
performance difference on one workstation. In any case, the important
bit was really that OpenBLAS brings the Prompt Processing step back up
to rough parity. Maybe that's enough.

Sincerely,
Cory Bloor

M. Zhou

2025-01-27 14:40:01 UTC

Permalink

Hi Cory,

Post by Cordell Bloor
Could we just sidestep this whole question of native instructions by
building llama.cpp with the BLAS backend?

I was going to ship BLAS as one of the backends, but you do raise an
interesting point: why ship the "regular" backend at all if we have BLAS
guaranteed on Debian.

BLAS itself only handles float32, float64, complex float32, and complex
float64 datatypes, which are typically "s", "d", "c", "z" prefixes in the
API. The quantized neural networks are not likely running in float point
mode, but int mode like int4 and int8.

Quoted from llama.cpp documentation:
"""
Building the program with BLAS support may lead to some performance
improvements in prompt processing using batch sizes higher than 32
(the default is 512). Using BLAS doesn't affect the generation performance.
"""
https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#blas-build

And not just one, but many implementations of BLAS that can easily be
switched, thanks to Mo's work with the alternatives subsystem.

You may take libtorch2.5 as a reference, while building against
libblas-dev, we may manually recommend high-performance BLAS
implementations when the user installs:

Recommends: libopenblas0 | libblis4 | libmkl-rt | libblas3

The actual line for libtorch2.5 is outdated for copying.
Please use the one above.

Petter Reinholdtsen

2025-02-05 21:00:01 UTC

Permalink

Where can I find the draft packaging for llama.cpp now? Is there a
public git repo somewhere?

--
Happy hacking
Petter Reinholdtsen

Petter Reinholdtsen

2025-02-06 00:40:01 UTC

Permalink

[Christian Kastner]

Repo is here [1].

Very good. Built just fine here. I checked in a few minor fixes.

I noticed llama.cpp depend on llama.cpp-backend with no concrete
dependency first. This lead to unpredictable behaviour, and I suggest
depending on for example 'llama.cpp-cpu | llama.cpp-backend' to make
sure 'apt install llama.cpp' behave predictably.

I was sad to discover the server example is missing, as it is the
llama.cpp progam I use the most. Without it, I will have to continue
using my own build.

I thought it best to upload now and fix the remaining issues above while
the package sits in NEW.

Very good.

I hope to get whisper.cpp to the same state, so it can have a fighting
chance to get into testing before the freeze.

--
Happy hacking
Petter Reinholdtsen

M. Zhou

2025-02-06 01:50:02 UTC

Permalink

Post by Petter Reinholdtsen
I was sad to discover the server example is missing, as it is the
llama.cpp progam I use the most. Without it, I will have to continue
using my own build.

I second this. llama-server is also the service endpoint for DebGPT.

I pushed a fix for ppc64el. The hwcaps works correctly for power9, given the baseline is power 8.

(chroot:unstable-ppc64el-sbuild) ***@debian-project-1 /h/d/llama.cpp.pkg [1]# ldd (which llama-cli)
linux-vdso64.so.1 (0x00007fffa4810000)
libeatmydata.so => /lib/powerpc64le-linux-gnu/libeatmydata.so (0x00007fffa4600000)
libllama.so => /usr/lib/powerpc64le-linux-gnu/llama.cpp/glibc-hwcaps/power9/libllama.so (0x00007fffa4450000)
libggml.so => /usr/lib/powerpc64le-linux-gnu/llama.cpp/glibc-hwcaps/power9/libggml.so (0x00007fffa4420000)
libggml-base.so => /usr/lib/powerpc64le-linux-gnu/llama.cpp/glibc-hwcaps/power9/libggml-base.so (0x00007fffa4330000)
libstdc++.so.6 => /lib/powerpc64le-linux-gnu/libstdc++.so.6 (0x00007fffa3fd0000)
libm.so.6 => /lib/powerpc64le-linux-gnu/libm.so.6 (0x00007fffa3e90000)
libgcc_s.so.1 => /lib/powerpc64le-linux-gnu/libgcc_s.so.1 (0x00007fffa3e50000)
libc.so.6 => /lib/powerpc64le-linux-gnu/libc.so.6 (0x00007fffa3be0000)
/lib64/ld64.so.2 (0x00007fffa4820000)
libggml-cpu.so => /usr/lib/powerpc64le-linux-gnu/llama.cpp/glibc-hwcaps/power9/libggml-cpu.so (0x00007fffa3b20000)
libgomp.so.1 => /lib/powerpc64le-linux-gnu/libgomp.so.1 (0x00007fffa3a90000)

Petter Reinholdtsen

2025-02-06 08:00:01 UTC

Permalink

[M. Zhou]

Post by M. Zhou
I second this. llama-server is also the service endpoint for DebGPT.

So, what exactly need to happen for llama-server to be included in the
package?

I found this in d/copyright:

DFSG compliance
---------------
The server example contains a number of minified and generated files in the
frontend. These seem to be essential to the example, so the server example
has been removed entirely, for now.

I guess some build mechanics need to be included to build the minified
and generated files, but do not know which one are the problem.
According to examples/server/README.md the "Web UI" is buitl using 'npm
run build', so I guess some nodejs dependencies are needed. Sadly, I do
not know how to convince npm to not download random stuff from the
Internet.

--
Happy hacking
Petter Reinholdtsen

M. Zhou

2025-02-06 14:20:01 UTC

Permalink

I meant to ask anyway: performance-wise, is it comparable to your local
build? I mean, I wouldn't know what in the code would alter this, but I
built and tested this on platti.d.o and performance was poor, so another
data point would be useful.

For ppc64el, the llama.cpp-blas backend is way slower than the -cpu backend.
I did not test on amd64. But on ppc64el the package does not feel different
than local build.

CPU is slow anyway. How does HIP performs?

phi-4-q4.gguf | power9, cpu (8-threads) | 0.62 tokens/s
phi-4-q4.gguf | amd64, 13900H | 6.7 tokens/s

GPU is way faster than this. The phi-4 model does not fit in my nvidia GPU.
No number for GPU this time.

Petter Reinholdtsen

2025-02-06 08:30:01 UTC

Permalink

[Christian Kastner]

Look fine, though I deliberately skipped the poetry dependency for now
as it looked more like a false positive.

Aha. I just trusted lintian-brush on this one, did not investigate.

This was my intention, but I initially wasn't sure what the default
would be (-cpu or -blas). Looks like I forgot to add one before
upload.

Given that every machine it can be installed on got a CPU, but not all
of them got a supported GPU, I beieve -cpu is the most sensible default.

It'll be re-enabled soon. The were a few generated and minified files
in that example, so I just opted to skip those for now, and focus on
the build process.

Great to hear. :)

Seeing as how closely llama.cpp and whisper.cpp are related, in the
ideal case, you should be able to just carry over some patches, and
mostly just copy d/rules, as llama.cpp and whisper.cpp share the ggml
library on a source basis.

I hope so too, but I guess we will soon find out. My initial draft on
<URL: https://salsa.debian.org/deeplearning-team/whisper.cpp > will need
a lot of updates to bring it in line with this new approach. :)

--
Happy hacking
Petter Reinholdtsen