You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.
how do you have the time to figure all these out and keep being up to date? do you do this at work?
Reading my own quote, I was being a bit dramatic. But at the very least it is super important to grasp some basic concepts (like MoE offloading and quantization), and watch for new releasing in LocalLlama or whatever. You kinda do have to follow things, yes.
how do you have the time to figure all these out and keep being up to date? do you do this at work?
As a hobby mostly, but its useful for work.
Reading my own quote, I was being a bit dramatic. But at the very least it is super important to grasp some basic concepts (like MoE offloading and quantization), and watch for new releasing in LocalLlama or whatever. You kinda do have to follow things, yes.