双卡异构跑本地大模型提速方案
推荐指数 72.0 NO. 012 · 2026.06.14
发布2026/06/13Score127Comments45
为什么值得看
作者将 RTX 5080(16GB)与 RTX 3090(24GB)组合运行 Qwen 3.6 27B Q8 量化模型,通过 llama.cpp 的异构 GPU 支持实现 80+ tok/s 生成速度。这对显存不足但已有消费级显卡的用户是低成本扩容路径,避免了单买高端卡或整机更换。
编辑判断
消费级显卡异构组合长期被忽视,因为 CUDA 生态默认假设同型号 SLI/NVLink。llama.cpp 的分层卸载(layer offloading)实际上打破了这一限制,让不同架构、不同显存的卡协同工作。关键瓶颈不在计算而在 PCIe 带宽和内存复制延迟,作者用 DDR4 + SSD 缓冲的方案恰好绕过了显存墙。
这个配置的性价比远高于单买 RTX 4090 或转向云端 API:3090 二手价已跌至 500 美元以下,5080 用户追加成本极低。但需注意 llama.cpp 的 tensor split 对非对称显存支持仍不完美,大 batch 场景下小显存卡会成为瓶颈。适合个人开发者跑 30B 级别模型做原型验证,不建议用于高并发服务。
社区反馈
意见分歧 45 条评论
核心争论:本地部署 vs 云端API:成本效率与自主可控的权衡之争
I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.
I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive I've switched from using t
Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory. Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the a