“This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing.”

This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.

This is probably also more or less related to recent advancements like Streaming-LLM

This won’t help once text gets inserted “in the middle” or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I’m able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.

It’s just a clever way to re-use the KV cache in one specific case. But I’ve wished for this for quite some time.

    • rufus@discuss.tchncs.deOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      8 months ago

      I wasn’t able to get good use out if the old ‘Smartcontext’ anyways and seems other people had the same problem. To me, this is a huge improvement. And it doesn’t even need extra memory or anything.

      I really like how the KoboldCPP dev(s(?)) and the llama.cpp community constantly implement all the crazy stuff.