Local LLMs on my TrueNAS, and the frontend I had to build.
Three weekends ago I rebuilt my home server. The old box was a small TrueNAS NAS. The new one is the same role plus a local LLM inference machine, on an AMD Ryzen 7 8700G with 96 GB of DDR5. This post is the field log: which decisions actually mattered, where I lost time, and the moment I realized that once the hardware worked, the frontend layer was its own problem.
The hardware part is most of the words below. The frontend part is shorter but it's the reason this post lives on loomcycle.dev. I tried Open WebUI. I stopped using it after two days. The thing that's replacing it is a chat surface in a new loomboard repo, modelled on what Open WebUI's chat actually gets right and wired to the loomcycle substrate I already work in.
The short version. If you're planning a local-inference box: buy an APU, not a desktop chip with display graphics; the 8700G's Radeon 780M is the entry point, the 2-CU iGPUs on regular Ryzen chips are useless for this. Memory bandwidth is the bottleneck, not core count, so DDR5-6000 CL30 EXPO and lots of capacity beat more cores. The 780M's gfx1103 architecture is not officially supported by ROCm, but HSA_OVERRIDE_GFX_VERSION=11.0.2 plus prebuilt gfx1103 Tensile kernels gets it running at 24-48 tok/s on a small model. GTT memory lets the iGPU address tens of gigabytes regardless of the BIOS UMA cap. A PPT cap at 65 W drops a 90°C inference load to under 60°C with no measurable speed loss, since inference is memory-bound. Open WebUI feels like a wrapper, not a control surface, once you've been running agents with structured workspaces for a while.
The constraint: one box, three workloads, no DGX budget
This wasn't a clean-sheet build. I already had a small lab NAS running on an Intel N100 with 16 GB of DDR5. Fine as storage and a few small services, weak for everything else. Three workloads landed on me at the same time and I needed one box to host all of them:
- VMs for product pre-release testing. JobEmber.ai (loomcycle's production consumer) plus a sibling SaaS in stealth pre-release, plus the smaller experimental instances those workflows spawn. Each wants its own VM so changes don't bleed into each other; a few GB and a couple of vCPUs apiece.
- Loomcycle itself, running as a server. Not just the binary I push code into, but the multi-replica deployment I test against. Real memory and real cores.
- Local LLM inference. The thing I'd been doing through cloud providers and wanted to pull in-house, partly for cost and partly because some of the work involves untrusted-input agentic loops I'd rather not run through someone else's API quota.
The straightforward shape for "I need local inference at home" is a discrete-GPU rig, an NVIDIA DGX Spark, or one of the soldered-RAM Mac Studios / Strix Halo boxes. The Spark is $4,500-5,500. A Mac Studio with serious unified memory sits in the same band. Strix Halo (Ryzen AI MAX) is cheaper but everything is soldered: you commit to a fixed RAM amount and a fixed iGPU at purchase. I didn't have spare $4,500 for a Spark, and I didn't want to lock the spec at the chip and RAM I happened to pick this year.
So the question stopped being "what's the best inference box" and became "what's the cheapest single box that hosts all three workloads, without soldering me into a corner I'd regret in twelve months."
That reframe was load-bearing. It ruled out the Spark on price. It ruled out Strix Halo on rigidity. It ruled out a discrete-GPU build because the iGPU-plus-fast-system-RAM path is meaningfully cheaper for the model sizes I actually run, and a discrete card would have meant a bigger case, a bigger PSU, and a second thermal envelope to manage on a 24/7 box.
What was left: upgrade the existing NAS. AM5 socket, so the chip is socketed and I can swap it later. DDR5 in DIMMs, so I can add capacity or upgrade timing without rebuilding the system. An APU as the inference engine, because a single iGPU plus a generous pile of system RAM hits the memory-bandwidth-bound workload at the right price point. The three workloads coexist cleanly: the storage half is light enough that an 8-core APU handles it without strain when inference is idle, the VM workload sits in the middle, and the inference workload spikes the iGPU when it's needed.
The upgrade path mattered more than the absolute spec. AMD's next-generation APU drops into this socket. If a future Ryzen APU ships with a 16-CU or 20-CU iGPU on a stronger architecture, the swap is one chip and one BIOS flash. No motherboard, no RAM, no PSU, no case. That option costs nothing today and matters a lot in twelve to twenty-four months when the inference-per-watt floor moves. Strix Halo doesn't give you that option; the Mac Studio doesn't either; a discrete-GPU rig only does it if you commit to constant card-swapping and the corresponding case-and-PSU churn. AM5 plus DIMM DDR5 keeps everything socketed.
With that reframe locked, the rest of the build is a sequence of forced choices.
The hardware decision that shaped everything else
The single thing I wish I'd remembered earlier: an APU is not the same as a desktop CPU with "integrated graphics."
AMD's regular desktop chips (the Ryzen 7000 and 9000 series, including the 7900X) ship with a token 2-compute-unit iGPU. It exists to drive a display when you have no graphics card. For LLM inference it's useless. Intel's mainstream desktop chips are similar: the UHD 770 found on most LGA1700 CPUs is a 32-execution-unit part that technically works but is weak.
What I needed was a real APU. AMD's 8000G "Hawk Point" series pairs Zen 4 cores with a proper RDNA3 iGPU:
- Ryzen 7 8700G: Radeon 780M, 12 CUs, ~12.6 TFLOPS, plus an NPU
- Ryzen 5 8600G: Radeon 760M, 8 CUs
- Ryzen 5 8500G / Ryzen 3 8300G: Radeon 740M, 4 CUs (skip for LLM work)
The 780M has roughly three times the compute of the desktop UHD 770 and dwarfs the 2-CU display chips. Critically, there is no 12-core or 16-core APU with a strong iGPU in the AM5 socket. AMD caps the good-iGPU line at the 8-core 8700G. You can have many CPU cores OR a capable iGPU in one socketed chip, not both. That tradeoff is forced; pick which one you actually need.
The exotic tier worth knowing about is AMD's Strix Halo (Ryzen AI MAX) with the 40-CU Radeon 8060S and soldered LPDDR5X. Genuinely excellent for local LLMs, but more expensive and less flexible than a socketed APU. I didn't go there. For my use case the 8700G's headroom plus 96 GB of socketed DDR5 was the better economic shape.
Memory bandwidth, not core count, is the real lever
LLM inference is memory-bandwidth-bound, not compute-bound. This single fact reshapes every other hardware decision.
It means more CPU cores barely help past a point; they all wait on the same memory bus. It means a 12-core chip isn't meaningfully faster than an 8-core one for inference. And it means your memory configuration matters more than almost anything else on the board.
On AM5 the sweet spot is DDR5-6000 CL30 with an AMD EXPO profile. Not because faster kits don't exist, but because the 8700G's Phoenix memory controller realistically tops out around 6000-6400 MT/s with two DIMMs. A DDR5-8000 kit will simply downclock. Buy 6000 CL30, enable EXPO with one click in BIOS, done.
Buying trap worth flagging: memory kit suffixes encode their profile. Corsair kits ending in Z (e.g. CMK96GX5M2B6000Z30) are AMD EXPO; kits ending in C are Intel XMP only. G.Skill's "Neo" and "Flare X5" lines are EXPO; plain "Trident Z5 RGB" is XMP. XMP kits work on AM5 but lose the one-click profile. Match the profile to the platform.
How much RAM? More than you think. With 96 GB I can run 70B-class models, and the iGPU can address a serious chunk of that as I'll cover below.
The build, briefly
Compact always-on server, so Mini-ITX AM5 (170×170 mm). Looking across boards for an inference-plus-NAS box, the ones that mattered had: strong VRMs (so the iGPU's sustained load stays stable), reliable iGPU video outputs (some server-class W680 boards disable them entirely, dealbreaker), high memory OC headroom, and good networking (5GbE plus dual M.2 earned its keep for the NAS half).
Power: a 65 W APU plus a few drives idles around 120-160 W. I sized the PSU for hard-drive spin-up surge and 24/7 efficiency, not peak draw. Confirm the case's PSU form factor (SFX vs SFX-L vs Flex-ATX) before buying. That constraint ruled out more choices than wattage did.
BIOS settings that are easy to forget
Three settings make or break the build:
- Enable EXPO in the OC menu so your RAM runs at 6000 instead of the default 4800. Easy to forget; costs real performance.
- Set the integrated graphics frame buffer. The BIOS may only let you set 16 GB, and that's fine (GTT memory below explains why).
- Disable Secure Boot if you're installing Linux/TrueNAS. The classic
bad shim signatureerror at install is Secure Boot rejecting an unsigned bootloader. Turn it off and leave it off; it's the normal config for a headless server.
Migrating from the old TrueNAS: don't clone, restore
The instinct is to clone the boot drive. Resist it. Cloning a ZFS boot pool onto a different-sized disk fights the filesystem and usually breaks. The clean path is fresh install plus config restore:
- On the old system, download the configuration file (include the secret seed).
- Fresh-install the latest TrueNAS on the new boot drive.
- Upload the config. The system reapplies your users, shares, and settings.
Data pools are entirely separate from the boot drive. A dedicated ZFS pool on its own SSD is portable. Physically move the disk and zpool import it. There's no file-copy step; ZFS reads the existing pool and mounts it with all data intact. If you're moving data to a bigger disk, use ZFS replication (snapshot → send → receive) rather than copying files, because replication preserves dataset properties, permissions, and snapshots that a plain copy loses.
Two migration gotchas that cost me real time. First, anything outside the GUI doesn't transfer. Custom scripts, cron jobs, hand-edited config files live on the boot drive and vanish on a fresh install; inventory them first. Second, skip-version jumps can break app definitions. Your app data on the pools is safe, but the app wrappers may need redeploying; budget time to recreate a few containers pointing at their existing datasets.
When you mount the old pool under a new name, internal scripts that hardcoded the old pool name will silently break. A migration is a good moment to grep your scripts for those paths.
Getting the iGPU to actually do the work
Here's where most guides wave their hands. Getting the 780M (architecture gfx1103) to run inference is genuinely fiddly because gfx1103 is not officially supported by ROCm. Standard ROCm ships compute kernels for neighboring architectures and skips Phoenix. The result: Ollama detects the GPU, tries a matrix operation, the kernels don't exist, falls back silently to CPU. You see 100% CPU in ollama ps and wonder why the GPU is idle.
First thing to verify is device access. Wherever Ollama runs (host, container, or VM), it needs to see:
/dev/kfd # the ROCm compute interface
/dev/dri/renderD128 # the render node
If those aren't present, no amount of tuning helps. A VM without GPU passthrough is CPU-only by definition.
Once the devices are visible, the quick win is forcing ROCm to treat gfx1103 as a supported neighbor via environment variables:
HSA_OVERRIDE_GFX_VERSION=11.0.2
OLLAMA_IGPU_ENABLE=1
For some setups that's enough. For mine it wasn't. The next failure was a rocBLAS error: Cannot read TensileLibrary.dat for gfx1103, which means the override is accepted but the actual GPU kernels are missing. The reliable fix is installing prebuilt gfx1103 Tensile kernels (community builds exist that pull these from Fedora's ROCm packages). With those in place, Ollama reports library=ROCm compute=gfx1103 and runs at roughly 24-48 tokens/sec on a small model versus ~16 tok/s on CPU.
The trick that makes it shine: GTT memory
My BIOS capped the iGPU's dedicated frame buffer at 16 GB. Sounds like a hard ceiling on model size. It isn't, on Linux.
The iGPU can dynamically allocate far beyond the fixed UMA buffer through GTT (Graphics Translation Table) memory, typically up to about half your system RAM by default, and adjustable higher. On a 96 GB box that means the iGPU can address tens of gigabytes regardless of the BIOS setting. The payoff is dramatic: a 24 GB model running at 100% GPU on an integrated graphics core, with a 128K context window. That is the entire reason the APU-plus-lots-of-RAM combination is special. The iGPU isn't boxed into a tiny VRAM partition the way a discrete card is.
Tuning Ollama for more GPU and bigger context
Once acceleration works, a handful of settings push it further. Set these as environment variables on the Ollama service:
OLLAMA_FLASH_ATTENTION=1: cuts KV-cache memory 30-50%OLLAMA_KV_CACHE_TYPE=q8_0: roughly halves KV-cache memory with negligible quality loss
These two matter more than they look. At 32K+ context the KV cache can consume more memory than the model weights themselves. Shrinking it frees room to fit more model layers on the GPU.
The big per-model lever is num_gpu, the number of layers offloaded to the GPU (there is no OLLAMA_GPU_LAYERS env var; it's a model option). Set it high (e.g. 99) to push as many layers as possible onto the iGPU. Ollama fits what it can and spills the rest to CPU. Context length is set per-model with num_ctx, or globally with OLLAMA_CONTEXT_LENGTH. Some models default to a tiny 4K context regardless of capability; set it explicitly.
A reality check on model size: even with everything tuned, a 24 GB model leans hard on memory bandwidth and will be slower than a 7-10 GB model that fits comfortably. For responsive interactive use, smaller models in the 4-10 GB range hit 100% GPU and feel fast. Pick the smallest model that's good enough for the task.
Tools that are the wrong fit
Two course-corrections that save frustration:
- vLLM is built for datacenter GPUs (CUDA, or supported-ROCm cards). It does not support the 780M and isn't a real CPU engine. On consumer iGPU/CPU hardware it's the wrong tool. Use Ollama or llama.cpp; they expose the same OpenAI-compatible API endpoint most apps expect.
- Ollama generates text, not images. It runs language and vision-input models; it has no diffusion support. For local image generation you need a separate stack (ComfyUI, Automatic1111), and diffusion is more GPU-hungry than LLMs, so temper expectations on an iGPU.
The thermal surprise
After all this you may notice the CPU running 85-90°C during inference while the CPU is only 20-30% loaded. Looks alarming. The explanation is simple: on an APU, the iGPU shares the same physical package as the CPU cores, and there's one temperature sensor. When inference runs "100% GPU," the iGPU is heating the package, and that shows up as "CPU temperature." It's iGPU heat wearing a CPU label.
Within spec (the 8700G's limit is 95°C) but warm for a 24/7 box. The fix is free and lives in BIOS. Because the heat is package power, capping PPT (Package Power Tracking) constrains it directly:
AMD Overclocking → Precision Boost Overdrive → PBO Limits: Manual
PPT Limit [mW]: 65000 # 65 W (units are milliwatts)
TDC Limit [mA]: 75000
EDC Limit [mA]: 150000
Since inference is memory-bound, capping power costs almost no real-world speed. In my testing this dropped a 90°C load to under 60°C: a 25-30°C improvement that killed any need for a cooler upgrade or water cooling (which on an unattended 24/7 server you should avoid anyway; a pump is a failure point an air cooler doesn't have).
Optional extra margin: the GFX Curve Optimizer undervolts the iGPU specifically. Use the Negative sign (Positive raises voltage and heat), start at a small magnitude like 10, and test stability. Undervolting isn't guaranteed stable on every chip.
And then: the frontend layer
Once the hardware worked, I needed a chat surface. The default move is Open WebUI: ChatGPT-style interface, conversation history, document RAG, web search. I installed it. (One container-networking footgun in case you go this route: when both Open WebUI and Ollama run as containers, localhost inside one container doesn't reach the other. Point Open WebUI at the host's LAN IP, not localhost.)
I used Open WebUI for two days and uninstalled it. The thing I want to say up front: the chat surface itself is good. Message thread, conversation list, the in-thread rendering of model output, the keyboard shortcuts; all of that is well-built and I'd happily ship something with similar UX. The reasons I stopped using it sit underneath the chat surface, in the configuration and in the substrate it can reach.
Three concrete blockers:
- The configuration UI is weird. Settings live in places I had to hunt for, common operations require traversing menus that don't say what they do until you click into them, and the relationship between Workspace and Admin settings isn't documented in the UI itself. After two days I still wasn't sure which of several places held the "default model for new chats" setting.
- Providers and models have two unlinked configuration surfaces. One of them does nothing. There's a provider+model surface in Admin Settings, and there's a separate provider+model surface elsewhere in the app. They don't share state. After I edited what I thought was the canonical surface, the models I'd configured weren't showing up in the chat picker. The other surface was the one that mattered; the first one I'd been editing was, as far as I can tell, vestigial. Burning a configuration cycle on a UI that silently no-ops is the kind of thing that ends a "should I keep using this" evaluation early.
- Missing the loomcycle tools and primitives I'd built workflows around. Loomcycle's been my agentic substrate for the better part of a month of real work (the project itself started May 12, I've been on it daily since, and I'm still developing and fixing): Documents as structured workspaces, Channels for cross-agent handoffs, Interruption + mid-run steering on every interactive session (v1.1.1), per-principal MCP dispatch so the agent and I share the same per-scope SQLite file (v1.5.0). Open WebUI can't reach any of that. The chat is good; the substrate underneath it is the wrong one for me.
So I'm building the chat I wanted on top of the substrate I already use. The work lives in a new loomboard repo and follows the chat-first sequencing in RFC AC as it stands today.
The chat surface ships first. A standalone React + Vite SPA on the published @loomcycle/client, with the chat UX deliberately modelled on what Open WebUI gets right (clean thread, conversation list, in-place renderer) and the substrate hooks I missed:
- Each conversation is one loomcycle interactive session (RFC AI): the first message starts it, follow-ups steer it, reopening re-attaches by
run_idor replays the transcript. - The full tool loop renders inline: structured tool calls, structured tool results, the model's reasoning between them. Not a flat chat bubble; an actual record of what ran.
- Live token / throughput / context-window metrics on the conversation pane. A context-compaction button when the window starts to fill.
- Interruption answers in place. The agent asks a question; the option buttons or free-text box render right in the thread.
- Per-conversation model overrides. Pick a different provider / model / tier / thinking-depth on this conversation without mutating the shared
AgentDef. The runtime materializes a uniquely-named derivedAgentDeffor the conversation; the shared one is unaffected. - Reuses existing wire only: interactive sessions, Interruption,
compactRun,getTranscript,agentDef,listLibraryAgents,whoami. No new transports.
The board lands next, in the same app. A kanban view on top of loomcycle's Document and Path primitives. The chunks are the cards. Each chunk's status field is the column. Typed fields drive the chip rendering: a chunk of type publication shows its platform + date; a chunk of type review-finding shows its severity. State transitions go through agent teams defined as a directed graph (RFC AP). The kanban view emerges from the substrate; it isn't a separate product surface. The launch publishing plan I'd been hand-editing for three weeks is the first dogfood loop (v1.5.0 made co-authoring possible).
Chat is pre-alpha right now; the board has substrate plumbing in place but no UI yet. The point isn't that they're shipped. The point is the frontend was the bottleneck once the hardware worked, and the configuration mess + missing primitives in Open WebUI were the specific reasons to keep going on my own surface rather than absorb the cost of a not-quite-fitting third-party one.
In parallel, the substrate work continues. The two loomcycle pieces I'm head-down on right now are tenant authorization (a real multi-tenant trust boundary across the wire surfaces) and loomcycle running as a TrueNAS-dockerized application so the same machine that hosts the inference hosts the runtime cleanly. Both deserve their own writeup; that's the next blog topic.
What I'd tell someone starting fresh
The distilled lessons if you're planning a local-inference build:
- Buy an APU, not a CPU with display graphics. The 8700G's 780M is the entry point; the 2-CU iGPUs on regular chips are useless for this.
- Memory bandwidth is the bottleneck. Prioritize DDR5-6000 CL30 EXPO and lots of capacity over core count. More cores barely help.
- GTT memory breaks the VRAM ceiling. An iGPU with plenty of system RAM runs models a discrete card its size never could.
- Expect to fight ROCm on unofficial iGPUs. gfx1103 needs
HSA_OVERRIDE_GFX_VERSION=11.0.2and possibly added Tensile kernels. Verify/dev/kfdaccess first. - Tune the cache, not just the model. Flash attention plus
q8_0KV cache is what lets big context fit on a small GPU. - Use the right tool. Ollama or llama.cpp for consumer hardware; vLLM is for datacenter GPUs.
- Manage the package, not the cores. A PPT cap tames iGPU-driven heat for free, with negligible speed loss.
- The frontend is not free. A chat surface that's good at chat but can't reach your substrate's primitives still costs you every time you open the tab. If you've been building real agent workflows, the surface has to match the shape of the work, and that's often a reason to build the one that does.
None of this requires datacenter hardware or a discrete GPU. A single 8-core APU with generous, fast memory, tuned with an afternoon's patience, runs capable models locally, privately, and quietly. The hard parts are knowing which knobs exist and which marketing to ignore. Now you do.
Companion reading: v1.5.0 + co-authoring on the same chunks (why a chat surface that's only a chat surface stops fitting), 133 minutes on a local Qwen (the previous local-model field report, on the runtime side this time), Path + Document primitives (the substrate the loomboard MVP rides on).