Very nice speedups for people running CPU inference on supported hardware, but unfortunately does not help CPU+GPU split according to comment on one of the PRs… That person says that for prompt evaluation, where these kernels would make a difference, llama.cpp performs all the calculations on the GPU. And during token generation it is IO-bound, so the faster CPU calculation becomes negligible.
Very nice speedups for people running CPU inference on supported hardware, but unfortunately does not help CPU+GPU split according to comment on one of the PRs… That person says that for prompt evaluation, where these kernels would make a difference, llama.cpp performs all the calculations on the GPU. And during token generation it is IO-bound, so the faster CPU calculation becomes negligible.