双卡异构跑本地大模型提速方案

#ARTICLE HackerNews 2026.06.14

推荐指数 72.0 NO. 012 · 2026.06.14

发布2026/06/13Score127Comments45

为什么值得看

作者将 RTX 5080（16GB）与 RTX 3090（24GB）组合运行 Qwen 3.6 27B Q8 量化模型，通过 llama.cpp 的异构 GPU 支持实现 80+ tok/s 生成速度。这对显存不足但已有消费级显卡的用户是低成本扩容路径，避免了单买高端卡或整机更换。

编辑判断

消费级显卡异构组合长期被忽视，因为 CUDA 生态默认假设同型号 SLI/NVLink。llama.cpp 的分层卸载（layer offloading）实际上打破了这一限制，让不同架构、不同显存的卡协同工作。关键瓶颈不在计算而在 PCIe 带宽和内存复制延迟，作者用 DDR4 + SSD 缓冲的方案恰好绕过了显存墙。

这个配置的性价比远高于单买 RTX 4090 或转向云端 API：3090 二手价已跌至 500 美元以下，5080 用户追加成本极低。但需注意 llama.cpp 的 tensor split 对非对称显存支持仍不完美，大 batch 场景下小显存卡会成为瓶颈。适合个人开发者跑 30B 级别模型做原型验证，不建议用于高并发服务。

社区反馈

意见分歧 45 条评论

核心争论：本地部署 vs 云端API：成本效率与自主可控的权衡之争

ComputerGuru

I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.

verdverm

I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive I've switched from using t

atq2119

Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory. Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the a

替代方案： OpenrouterDGX SparkSearXNGllamacpp-vulkan

查看原文 →