Skip to content

Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants #14903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Jul 27, 2025

Here's an initial version of an Integer Dot mul_mat_vec shader. So far it seems to improve performance with q4_1 and q5_1, but reduce it with q4_0, q5_0 and q8_0. My guess is that this is because of the 32-bit loads in q4_1 and q5_1, while the rest use 16-bit loads.

@jeffbolznv Would you mind taking a look and letting me know if I have any obvious performance issues in the shader?

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 27, 2025
@0cc4m
Copy link
Collaborator Author

0cc4m commented Jul 27, 2025

Here are performance results from my tests:

AMD Radeon Pro VII
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 3720 runs -   326.01 us/run - 134.48 MFLOP/run - 412.51 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   274.52 us/run - 134.48 MFLOP/run - 489.87 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    95.15 us/run - 117.44 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   114.44 us/run - 117.44 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   136.38 us/run - 117.44 MFLOP/run - 861.11 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   149.87 us/run - 117.44 MFLOP/run - 783.61 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.03 us/run - 117.44 MFLOP/run - 782.80 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   121.87 us/run - 234.88 MFLOP/run -   1.93 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   181.40 us/run - 234.88 MFLOP/run -   1.29 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   166.30 us/run - 234.88 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   206.09 us/run - 234.88 MFLOP/run -   1.14 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.76 us/run - 234.88 MFLOP/run -   1.19 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.56 us/run - 352.32 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4544 runs -   229.63 us/run - 352.32 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5396 runs -   189.94 us/run - 352.32 MFLOP/run -   1.85 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   259.13 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   258.81 us/run - 352.32 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.43 us/run - 469.76 MFLOP/run -   2.52 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3621 runs -   278.23 us/run - 469.76 MFLOP/run -   1.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   218.20 us/run - 469.76 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   307.29 us/run - 469.76 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2769 runs -   382.97 us/run - 469.76 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4617 runs -   224.90 us/run - 587.20 MFLOP/run -   2.61 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3078 runs -   330.95 us/run - 587.20 MFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4104 runs -   250.29 us/run - 587.20 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   365.23 us/run - 587.20 MFLOP/run -   1.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   452.07 us/run - 587.20 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.45 us/run - 939.52 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   682.41 us/run - 939.52 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   335.38 us/run - 939.52 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1391 runs -   725.50 us/run - 939.52 MFLOP/run -   1.30 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1498 runs -   677.66 us/run - 939.52 MFLOP/run -   1.39 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7371.35 us/run -  60.13 GFLOP/run -   8.16 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7697.38 us/run -  60.13 GFLOP/run -   7.81 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7584.95 us/run -  60.13 GFLOP/run -   7.93 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7931.54 us/run -  60.13 GFLOP/run -   7.58 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8015.00 us/run -  60.13 GFLOP/run -   7.50 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 3720 runs -   326.21 us/run - 134.48 MFLOP/run - 412.25 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   274.08 us/run - 134.48 MFLOP/run - 490.66 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   129.72 us/run - 117.44 MFLOP/run - 905.32 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    62.43 us/run - 117.44 MFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.69 us/run - 117.44 MFLOP/run - 754.32 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    83.28 us/run - 117.44 MFLOP/run -   1.41 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   216.83 us/run - 117.44 MFLOP/run - 541.62 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6390 runs -   165.83 us/run - 234.88 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.15 us/run - 234.88 MFLOP/run -   3.35 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   200.41 us/run - 234.88 MFLOP/run -   1.17 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    92.60 us/run - 234.88 MFLOP/run -   2.54 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4686 runs -   232.55 us/run - 234.88 MFLOP/run -   1.01 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   150.32 us/run - 352.32 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11360 runs -    89.56 us/run - 352.32 MFLOP/run -   3.93 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   196.72 us/run - 352.32 MFLOP/run -   1.79 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   111.35 us/run - 352.32 MFLOP/run -   3.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3976 runs -   254.72 us/run - 352.32 MFLOP/run -   1.38 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5751 runs -   175.38 us/run - 469.76 MFLOP/run -   2.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8733 runs -   115.33 us/run - 469.76 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4899 runs -   206.11 us/run - 469.76 MFLOP/run -   2.28 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   133.48 us/run - 469.76 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   267.06 us/run - 469.76 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5130 runs -   199.10 us/run - 587.20 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6840 runs -   147.29 us/run - 587.20 MFLOP/run -   3.99 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4446 runs -   228.99 us/run - 587.20 MFLOP/run -   2.56 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5472 runs -   186.59 us/run - 587.20 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3420 runs -   296.54 us/run - 587.20 MFLOP/run -   1.98 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4922 runs -   205.31 us/run - 939.52 MFLOP/run -   4.58 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7276 runs -   138.46 us/run - 939.52 MFLOP/run -   6.79 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4173 runs -   245.35 us/run - 939.52 MFLOP/run -   3.83 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6313 runs -   160.81 us/run - 939.52 MFLOP/run -   5.84 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3210 runs -   318.22 us/run - 939.52 MFLOP/run -   2.95 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      136 runs -  7386.12 us/run -  60.13 GFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      130 runs -  7693.49 us/run -  60.13 GFLOP/run -   7.82 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      132 runs -  7594.42 us/run -  60.13 GFLOP/run -   7.92 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      128 runs -  7918.03 us/run -  60.13 GFLOP/run -   7.59 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      126 runs -  8004.06 us/run -  60.13 GFLOP/run -   7.51 TFLOPS
Intel A770
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 9672 runs -   106.14 us/run - 134.48 MFLOP/run -   1.27 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   297.67 us/run - 134.48 MFLOP/run - 451.77 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   147.62 us/run - 117.44 MFLOP/run - 795.55 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   158.42 us/run - 117.44 MFLOP/run - 741.31 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2556 runs -   559.94 us/run - 117.44 MFLOP/run - 209.74 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   198.08 us/run - 117.44 MFLOP/run - 592.89 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   816.05 us/run - 117.44 MFLOP/run - 143.91 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   155.66 us/run - 234.88 MFLOP/run -   1.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   185.73 us/run - 234.88 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   483.76 us/run - 234.88 MFLOP/run - 485.54 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   201.83 us/run - 234.88 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1278 runs -   953.98 us/run - 234.88 MFLOP/run - 246.21 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6248 runs -   165.98 us/run - 352.32 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   210.20 us/run - 352.32 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1988 runs -   513.99 us/run - 352.32 MFLOP/run - 685.46 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4828 runs -   218.03 us/run - 352.32 MFLOP/run -   1.62 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1704 runs -   648.93 us/run - 352.32 MFLOP/run - 542.93 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   186.04 us/run - 469.76 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   265.17 us/run - 469.76 MFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2130 runs -   505.40 us/run - 469.76 MFLOP/run - 929.49 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4047 runs -   258.71 us/run - 469.76 MFLOP/run -   1.82 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1491 runs -   673.07 us/run - 469.76 MFLOP/run - 697.94 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3249 runs -   308.76 us/run - 587.20 MFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   465.28 us/run - 587.20 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1710 runs -   619.83 us/run - 587.20 MFLOP/run - 947.36 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2223 runs -   477.48 us/run - 587.20 MFLOP/run -   1.23 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1197 runs -   931.89 us/run - 587.20 MFLOP/run - 630.12 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3103 runs -   330.52 us/run - 939.52 MFLOP/run -   2.84 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2247 runs -   462.68 us/run - 939.52 MFLOP/run -   2.03 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               1712 runs -   589.40 us/run - 939.52 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2140 runs -   470.27 us/run - 939.52 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                963 runs -  1085.13 us/run - 939.52 MFLOP/run - 865.81 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5539.21 us/run -  60.13 GFLOP/run -  10.86 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      184 runs -  5460.43 us/run -  60.13 GFLOP/run -  11.01 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      174 runs -  5796.34 us/run -  60.13 GFLOP/run -  10.37 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      172 runs -  5816.45 us/run -  60.13 GFLOP/run -  10.34 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6317.52 us/run -  60.13 GFLOP/run -   9.52 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 9672 runs -   105.39 us/run - 134.48 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 3720 runs -   300.54 us/run - 134.48 MFLOP/run - 447.46 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5112 runs -   232.85 us/run - 117.44 MFLOP/run - 504.37 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   127.81 us/run - 117.44 MFLOP/run - 918.88 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4260 runs -   252.01 us/run - 117.44 MFLOP/run - 466.01 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   153.16 us/run - 117.44 MFLOP/run - 766.79 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4260 runs -   253.84 us/run - 117.44 MFLOP/run - 462.65 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   288.94 us/run - 234.88 MFLOP/run - 812.90 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9372 runs -   110.96 us/run - 234.88 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   317.45 us/run - 234.88 MFLOP/run - 739.90 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   135.61 us/run - 234.88 MFLOP/run -   1.73 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3834 runs -   264.55 us/run - 234.88 MFLOP/run - 887.85 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   297.55 us/run - 352.32 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7668 runs -   132.35 us/run - 352.32 MFLOP/run -   2.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3124 runs -   339.23 us/run - 352.32 MFLOP/run -   1.04 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6532 runs -   154.97 us/run - 352.32 MFLOP/run -   2.27 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3692 runs -   275.87 us/run - 352.32 MFLOP/run -   1.28 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3195 runs -   316.93 us/run - 469.76 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6816 runs -   146.76 us/run - 469.76 MFLOP/run -   3.20 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2982 runs -   352.12 us/run - 469.76 MFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5538 runs -   181.20 us/run - 469.76 MFLOP/run -   2.59 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3408 runs -   305.57 us/run - 469.76 MFLOP/run -   1.54 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3762 runs -   273.06 us/run - 587.20 MFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5643 runs -   179.14 us/run - 587.20 MFLOP/run -   3.28 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2736 runs -   369.60 us/run - 587.20 MFLOP/run -   1.59 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   212.93 us/run - 587.20 MFLOP/run -   2.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2907 runs -   361.02 us/run - 587.20 MFLOP/run -   1.63 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2568 runs -   400.11 us/run - 939.52 MFLOP/run -   2.35 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3424 runs -   300.82 us/run - 939.52 MFLOP/run -   3.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2354 runs -   435.22 us/run - 939.52 MFLOP/run -   2.16 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2996 runs -   337.42 us/run - 939.52 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2782 runs -   371.29 us/run - 939.52 MFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5502.12 us/run -  60.13 GFLOP/run -  10.93 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      182 runs -  5522.41 us/run -  60.13 GFLOP/run -  10.89 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      174 runs -  5776.55 us/run -  60.13 GFLOP/run -  10.41 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      166 runs -  6064.83 us/run -  60.13 GFLOP/run -   9.91 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      160 runs -  6308.83 us/run -  60.13 GFLOP/run -   9.53 TFLOPS
Nvidia RTX 3090
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                11160 runs -    94.56 us/run - 134.48 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 7440 runs -   134.50 us/run - 134.48 MFLOP/run - 999.84 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.24 us/run - 117.44 MFLOP/run -   2.38 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    54.12 us/run - 117.44 MFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    69.91 us/run - 117.44 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.77 us/run - 117.44 MFLOP/run -   1.66 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    82.06 us/run - 117.44 MFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    61.82 us/run - 234.88 MFLOP/run -   3.80 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13206 runs -    77.28 us/run - 234.88 MFLOP/run -   3.04 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12354 runs -    82.16 us/run - 234.88 MFLOP/run -   2.86 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    94.23 us/run - 234.88 MFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10650 runs -    95.96 us/run - 234.88 MFLOP/run -   2.45 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13064 runs -    77.12 us/run - 352.32 MFLOP/run -   4.57 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10508 runs -    96.38 us/run - 352.32 MFLOP/run -   3.66 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10792 runs -    94.85 us/run - 352.32 MFLOP/run -   3.71 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9088 runs -   112.82 us/run - 352.32 MFLOP/run -   3.12 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7952 runs -   126.59 us/run - 352.32 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10863 runs -    93.34 us/run - 469.76 MFLOP/run -   5.03 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8733 runs -   115.35 us/run - 469.76 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8946 runs -   112.26 us/run - 469.76 MFLOP/run -   4.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7455 runs -   136.60 us/run - 469.76 MFLOP/run -   3.44 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6603 runs -   156.48 us/run - 469.76 MFLOP/run -   3.00 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9063 runs -   111.42 us/run - 587.20 MFLOP/run -   5.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7353 runs -   138.83 us/run - 587.20 MFLOP/run -   4.23 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   127.26 us/run - 587.20 MFLOP/run -   4.61 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6498 runs -   156.34 us/run - 587.20 MFLOP/run -   3.76 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5472 runs -   185.98 us/run - 587.20 MFLOP/run -   3.16 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6099 runs -   165.53 us/run - 939.52 MFLOP/run -   5.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4708 runs -   213.55 us/run - 939.52 MFLOP/run -   4.40 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5671 runs -   179.37 us/run - 939.52 MFLOP/run -   5.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4387 runs -   229.11 us/run - 939.52 MFLOP/run -   4.10 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               3745 runs -   274.08 us/run - 939.52 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      904 runs -  1108.01 us/run -  60.13 GFLOP/run -  54.27 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      860 runs -  1164.53 us/run -  60.13 GFLOP/run -  51.63 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      736 runs -  1361.15 us/run -  60.13 GFLOP/run -  44.18 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      736 runs -  1360.98 us/run -  60.13 GFLOP/run -  44.18 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      912 runs -  1097.27 us/run -  60.13 GFLOP/run -  54.80 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                11160 runs -    94.68 us/run - 134.48 MFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                 8184 runs -   130.28 us/run - 134.48 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    50.12 us/run - 117.44 MFLOP/run -   2.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    48.13 us/run - 117.44 MFLOP/run -   2.44 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.03 us/run - 117.44 MFLOP/run -   2.10 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.74 us/run - 117.44 MFLOP/run -   2.07 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.46 us/run - 117.44 MFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.08 us/run - 234.88 MFLOP/run -   4.99 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20448 runs -    49.93 us/run - 234.88 MFLOP/run -   4.70 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.08 us/run - 234.88 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    58.47 us/run - 234.88 MFLOP/run -   4.02 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11502 runs -    88.02 us/run - 234.88 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19880 runs -    50.74 us/run - 352.32 MFLOP/run -   6.94 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19596 runs -    51.30 us/run - 352.32 MFLOP/run -   6.87 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15904 runs -    63.94 us/run - 352.32 MFLOP/run -   5.51 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16472 runs -    61.01 us/run - 352.32 MFLOP/run -   5.77 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11076 runs -    91.62 us/run - 352.32 MFLOP/run -   3.85 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17892 runs -    56.33 us/run - 469.76 MFLOP/run -   8.34 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17466 runs -    57.69 us/run - 469.76 MFLOP/run -   8.14 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15123 runs -    66.30 us/run - 469.76 MFLOP/run -   7.09 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.62 us/run - 469.76 MFLOP/run -   7.27 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10437 runs -    97.62 us/run - 469.76 MFLOP/run -   4.81 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15732 runs -    63.62 us/run - 587.20 MFLOP/run -   9.23 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16245 runs -    61.62 us/run - 587.20 MFLOP/run -   9.53 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14535 runs -    69.60 us/run - 587.20 MFLOP/run -   8.44 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14535 runs -    69.57 us/run - 587.20 MFLOP/run -   8.44 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9576 runs -   104.78 us/run - 587.20 MFLOP/run -   5.60 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12947 runs -    77.25 us/run - 939.52 MFLOP/run -  12.16 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11877 runs -    84.66 us/run - 939.52 MFLOP/run -  11.10 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11877 runs -    84.27 us/run - 939.52 MFLOP/run -  11.15 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11342 runs -    88.87 us/run - 939.52 MFLOP/run -  10.57 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7597 runs -   133.14 us/run - 939.52 MFLOP/run -   7.06 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      842 runs -  1187.83 us/run -  60.13 GFLOP/run -  50.62 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      784 runs -  1277.27 us/run -  60.13 GFLOP/run -  47.08 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      762 runs -  1313.98 us/run -  60.13 GFLOP/run -  45.76 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      738 runs -  1355.59 us/run -  60.13 GFLOP/run -  44.36 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      924 runs -  1083.58 us/run -  60.13 GFLOP/run -  55.49 TFLOPS
AMD RX 6800 XT
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared 

Master:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 7440 runs -   145.62 us/run - 134.48 MFLOP/run - 923.47 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                20088 runs -    50.37 us/run - 134.48 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21300 runs -    47.14 us/run - 117.44 MFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    55.37 us/run - 117.44 MFLOP/run -   2.12 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14484 runs -    70.00 us/run - 117.44 MFLOP/run -   1.68 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13632 runs -    74.29 us/run - 117.44 MFLOP/run -   1.58 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.72 us/run - 117.44 MFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16188 runs -    61.98 us/run - 234.88 MFLOP/run -   3.79 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12780 runs -    78.87 us/run - 234.88 MFLOP/run -   2.98 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11928 runs -    86.15 us/run - 234.88 MFLOP/run -   2.73 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    98.12 us/run - 234.88 MFLOP/run -   2.39 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11502 runs -    89.74 us/run - 234.88 MFLOP/run -   2.62 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13064 runs -    76.56 us/run - 352.32 MFLOP/run -   4.60 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9940 runs -   102.12 us/run - 352.32 MFLOP/run -   3.45 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -   100.07 us/run - 352.32 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8236 runs -   123.05 us/run - 352.32 MFLOP/run -   2.86 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8236 runs -   122.62 us/run - 352.32 MFLOP/run -   2.87 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10224 runs -    99.78 us/run - 469.76 MFLOP/run -   4.71 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               8520 runs -   119.36 us/run - 469.76 MFLOP/run -   3.94 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9159 runs -   110.68 us/run - 469.76 MFLOP/run -   4.24 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7242 runs -   139.27 us/run - 469.76 MFLOP/run -   3.37 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5964 runs -   167.74 us/run - 469.76 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7866 runs -   128.65 us/run - 587.20 MFLOP/run -   4.56 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               7011 runs -   144.22 us/run - 587.20 MFLOP/run -   4.07 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6669 runs -   150.20 us/run - 587.20 MFLOP/run -   3.91 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               6327 runs -   161.58 us/run - 587.20 MFLOP/run -   3.63 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4788 runs -   211.00 us/run - 587.20 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               5029 runs -   200.80 us/run - 939.52 MFLOP/run -   4.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4922 runs -   206.88 us/run - 939.52 MFLOP/run -   4.54 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4280 runs -   233.96 us/run - 939.52 MFLOP/run -   4.02 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               4494 runs -   225.62 us/run - 939.52 MFLOP/run -   4.16 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               2675 runs -   386.25 us/run - 939.52 MFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      348 runs -  2882.03 us/run -  60.13 GFLOP/run -  20.86 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      354 runs -  2837.71 us/run -  60.13 GFLOP/run -  21.19 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      342 runs -  2934.56 us/run -  60.13 GFLOP/run -  20.49 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      336 runs -  2993.35 us/run -  60.13 GFLOP/run -  20.09 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      306 runs -  3282.89 us/run -  60.13 GFLOP/run -  18.32 TFLOPS

PR:
  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                 7440 runs -   142.46 us/run - 134.48 MFLOP/run - 943.97 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                20832 runs -    48.66 us/run - 134.48 MFLOP/run -   2.76 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    33.86 us/run - 117.44 MFLOP/run -   3.47 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              36636 runs -    27.51 us/run - 117.44 MFLOP/run -   4.27 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24708 runs -    41.87 us/run - 117.44 MFLOP/run -   2.80 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              29820 runs -    34.24 us/run - 117.44 MFLOP/run -   3.43 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              23004 runs -    44.41 us/run - 117.44 MFLOP/run -   2.64 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              21726 runs -    46.39 us/run - 234.88 MFLOP/run -   5.06 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              35358 runs -    28.40 us/run - 234.88 MFLOP/run -   8.27 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18744 runs -    53.68 us/run - 234.88 MFLOP/run -   4.38 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              26412 runs -    38.09 us/run - 234.88 MFLOP/run -   6.17 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.58 us/run - 234.88 MFLOP/run -   4.83 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19028 runs -    52.74 us/run - 352.32 MFLOP/run -   6.68 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              24708 runs -    40.71 us/run - 352.32 MFLOP/run -   8.65 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              17040 runs -    58.88 us/run - 352.32 MFLOP/run -   5.98 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              19880 runs -    50.48 us/run - 352.32 MFLOP/run -   6.98 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              18176 runs -    55.56 us/run - 352.32 MFLOP/run -   6.34 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16401 runs -    61.21 us/run - 469.76 MFLOP/run -   7.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              20874 runs -    48.35 us/run - 469.76 MFLOP/run -   9.72 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14271 runs -    70.92 us/run - 469.76 MFLOP/run -   6.62 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15549 runs -    64.84 us/run - 469.76 MFLOP/run -   7.24 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              15762 runs -    63.88 us/run - 469.76 MFLOP/run -   7.35 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              16245 runs -    61.77 us/run - 587.20 MFLOP/run -   9.51 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14877 runs -    67.57 us/run - 587.20 MFLOP/run -   8.69 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              14022 runs -    71.52 us/run - 587.20 MFLOP/run -   8.21 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              12312 runs -    81.39 us/run - 587.20 MFLOP/run -   7.21 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              13680 runs -    73.56 us/run - 587.20 MFLOP/run -   7.98 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9844 runs -   102.64 us/run - 939.52 MFLOP/run -   9.15 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              11021 runs -    91.60 us/run - 939.52 MFLOP/run -  10.26 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9202 runs -   108.77 us/run - 939.52 MFLOP/run -   8.64 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):               9095 runs -   110.57 us/run - 939.52 MFLOP/run -   8.50 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):              10486 runs -    95.77 us/run - 939.52 MFLOP/run -   9.81 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      362 runs -  2774.96 us/run -  60.13 GFLOP/run -  21.67 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      356 runs -  2815.14 us/run -  60.13 GFLOP/run -  21.36 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      338 runs -  2968.24 us/run -  60.13 GFLOP/run -  20.26 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      326 runs -  3080.20 us/run -  60.13 GFLOP/run -  19.52 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      292 runs -  3442.73 us/run -  60.13 GFLOP/run -  17.47 TFLOPS

const uint b_block_idx = (j*p.batch_stride_b + col) / QUANT_K_Q8_1 + b_offset;
cache_b_ds = vec2(data_b[b_block_idx].ds);
[[unroll]] for (uint k = 0; k < 8; k++) {
cache_b_qs[k] = data_b[b_block_idx].qs[k];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a barrier after these shared memory stores, and either after the loads or before the stores for the next iteration.

Seems like you can cut down the loads by having the first 8 threads each do one of the iterations. And ds could just go straight to registers rather than the extra copy through shared memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not shared memory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Maybe it's worth loading the qs values through shared memory? If the issue is with too many small loads like you suggested, then copying through shared memory ought to help.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I guess I can't tell if the b_block_idx value is shared between threads. So maybe this idea doesn't work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea might be to add padding to the q8_1 struct so you can do 16B loads rather than 4B loads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that might be worth it. I know the cuda backend stacks 4 q8_1 blocks in a struct for that reason.

@jeffbolznv
Copy link
Collaborator

I did a quick before/after on some Q4_0 models, and it looks like the quantization is pretty expensive:

master:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        365.51 ± 1.33 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        364.74 ± 3.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        236.24 ± 7.06 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        237.61 ± 1.79 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.41 ± 0.87 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         60.44 ± 0.15 |

PR:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        340.06 ± 1.73 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        339.06 ± 2.71 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |       224.50 ± 10.15 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        227.18 ± 1.44 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         57.65 ± 0.07 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         57.67 ± 0.11 |

PR with quantize call removed:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128 -r 10 --prio 1 -m c:\models\Llama-3.2-3B-Instruct-Q4_0.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\GLM-4-32B-0414-Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        372.26 ± 1.13 |
| llama 3B Q4_0                  |   1.78 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        370.48 ± 3.75 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        242.30 ± 3.98 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        243.00 ± 1.00 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         59.49 ± 0.16 |
| glm4 32B Q4_0                  |  17.35 GiB |    32.57 B | Vulkan     |  99 |  1 |           tg128 |         59.28 ± 0.14 |

I don't think there's anything particularly wrong with how the quantization is implemented, it's such a small amount of work that it doesn't fill the GPU, and 5090 is just about the worst case for that. I don't have any great suggestions for what to do about this.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jul 28, 2025

Yeah, I also see that. We might have to pick a threshold from which using this quantize + integer dot shader path is worth it. Even without further tuning, there are definitely cases where it helps, for example batch 4 and 8 on RX 6800 XT:

Master:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.372 1378.10 6.499 78.78 6.871 149.04
512 512 2 2048 0.734 1394.93 11.341 90.29 12.075 169.60
512 512 4 4096 1.551 1320.62 18.337 111.69 19.887 205.96
512 512 8 8192 3.499 1170.69 34.641 118.24 38.139 214.79
512 512 16 16384 8.295 987.59 59.502 137.68 67.797 241.66
512 512 32 32768 21.548 760.35 85.820 190.91 107.368 305.19

PR:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 512 1 1024 0.372 1376.71 6.980 73.35 7.352 139.28
512 512 2 2048 0.721 1420.49 11.889 86.13 12.610 162.42
512 512 4 4096 1.562 1311.47 17.186 119.17 18.747 218.49
512 512 8 8192 3.482 1176.48 29.917 136.91 33.398 245.28
512 512 16 16384 8.253 992.55 59.530 137.61 67.783 241.71
512 512 32 32768 21.490 762.41 85.655 191.28 107.145 305.83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants