-
Notifications
You must be signed in to change notification settings - Fork 83
Implementation of Phi-3-mini-4k-instruct (Q8_0 or Q4_0). #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey Sascha, that's awesome! To avoid duplicated work, I also ported Mistral/Codestral, Gemma 1 and 2 (no sliding window attention yet), Qwen2 and Phi3. |
I see you also implemented the right RoPE. Nice! |
That's great! I didn't like the copies.
I could take a look at Q6_K, if there's no objection. I was planning on looking into that part anyway. |
Great! I found the graphical representation of the quantized blocks quite understandable here: https://www.modular.com/blog/whats-new-in-max-24-4-max-on-macos-fast-local-llama3-native-quantization-and-gguf-support Also, a small note on performance; in the current implementation if you mix several implementations e.g. Q4_0 and Q6_K, the compiler will generate a not-so-good version of the matmul. This can be fixed with with minor adjustments, it's been in my backlog for some time. |
@mukel I have a Q6_K-implementation and first measurements, but I'm not really happy with them. Q8_0 with 512 bits runs at 8.2 token/s, Q8_0 with 256 bits runs at 6.1 token/s but Q6_K with 256 bits runs at 1.5 token/s only. Q6_K with 512 bits runs at 0.34 tokens/s only. Initially, I assumed that 512 bits would be ideal for Q6_K quantization. Q6_K is more complicated than Q8_0 or Q4_0, I expected Q6_K to be a bit slower, but not to this extent. The current dot-method with 256 bits:
But perhaps there is a better layout of the bit-operations :-). |
Are you on Apple silicon or Intel? |
You should not store vectors in arrays nor fields, otherwise they get materialized, thus slow. It may work but I wouldn't trust C2 escape analysis here. |
I'm on Intel (i7-11800H, openjdk 21.0.2 2024-01-16) and AMD (Ryzen 9 7900X, openjdk 22 2024-03-19).
The vectorDot512-method didn't use arrays but vectorDot256 and vectorDot128 of Q6_K used arrays. I will try without arrays. The JVM announced S_512_BIT as preferred size. I'll have to check the code to see if I missed something. |
@mukel Do you plan to distinguish between llama3.java (simple) and llama3.java (extended)? The first one would be a nice 2.000 liner. The second one could have some extensions (Q6_K, server, ...). |
The goal is to promote Llama3.java into a larger effort e.g. https://github.com/llama4j to implement more models in the same place and share common parts. |
Hi @srogmann, can you share the Q6_K implementations (all the variants)? |
Hi @mukel , can you share the Q6_K implementations (all the variants)? Yes, I will share the Q6_K implementation(s) (it's on another device). Some remarks: I had a look at https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml.c because I wasn't satisfied with the performance of my Q6_K implementation. In architectural #define AVX2 of function ggml_vec_dot_q6_K_q8_K in https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-quants.c there is a 256 bit edition only:
So I'm not disappointed that my 512-bit implementation was not as fast as I hoped. But I was surprised to see, that the second factor of the dot product in ggml_vec_dot_q6_K_q8_K is Q8_K, not FLOAT32. This gives the Q6_K-dot implementation more performance and its implementation is more compact at the tail. |
I'm very surprised as well. EDIT: Confirmed: ggml-org/llama.cpp#3477 |
I remember reading somewhere how tensors with dimensions % 256 != 0 were problematic, this may be an explanation. |
This is a PR to run Phi-3-Mini-4K. It only includes
Phi3.java
. I wrote this file because phi-3 is faster at simple tasks.I intentionally left
Llama3.java
unchanged, even though some synergies could have been achieved. The currentLlama3.java
is a beautiful, complete example of a transformer model. It might be a shame to introduce additions for phi-3, which would actually be beneficial for reusability.The nice thing is that thanks to the roughly 2,000 lines in
Llama3.java
, phi-3 can be added with only 800 lines (or a bit less if debug lines would be removed) . But phi-3 is not llama-3, so it would be plausible if you decided that phi-3 does not belong here. Because who knows where this ends (Gemma-2 is also interesting ;-) ).