32 GB VRAM for less $1k sounds like a steal these days, and I’m sure it’s not getting cheaper any time soon.

Does anyone here use this GPU? Or any recent Arc Pros? I basically want someone to talk me out of driving to the nearest place that has it in stock and getting $1k poorer.

  • certified_expert@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    13 hours ago

    I have been using an Arc Pro B50. 16GiB. With torch to train models myself.

    No issue other than a bug that makes the intel dlivers not work well on linux kernel 6.18, so I’m avoiding that version of the kernel.

  • afk_strats@lemmy.world
    link
    fedilink
    English
    arrow-up
    22
    arrow-down
    2
    ·
    1 day ago

    I’m going to be brutal with you. I spent a few thousand dollars on 176GB of AMD vram because I was happy with getting vram for cheap and I hate Nvidia. It works and its nice to be able to run bigger models at usable performance, but if you need serious concurrency or good support for diffusion, you NEED Nvidia. AMD(and likewise Intel) just doesn’t have the environment support for non-server GPUs. Again, coming from someone who’s using this shit daily.

    If you understand this limitation, then yes those B70s are cool as are AMD Pro 9700 which might have slightly better support rn. You may consider nvidia V100s which are old and cheap. I always recommend people start with 3090s (as a general powerhouse) or a pair of 5060tis (for really hood llm support) though. It will make your life easy if you can afford the vram limitation

    • Scipitie@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 day ago

      Thanks for your experience! I’m in a similar boat regarding NVIDIA - plus the budget …

      At least in Europe the V100 are only available from China and with a huge markup.

      Used 5090 for 3,5k, even a 3060 used is still at 250$ plus.

      It’s crazy at the moment - I simply can’t afford self hosting LLMs which is a new thing to say :D

    • fonix232@fedia.io
      link
      fedilink
      arrow-up
      3
      ·
      1 day ago

      Wouldn’t using the Vulkan backend instead of ROCm help a ton with concurrency and diffusion, at a marginal (1-2%) performance loss?

      • afk_strats@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 day ago

        Vulkan helps with speed. Must benchmarks prove that out. Concurrency is a mixed bag. You can get some with llama.cpp bit vllm is concurrency king.

        Just a couple of weeks ago llama.cpp released tensor parallelism which helps, but its still a experimental feature.

        Unfortunately, I don’t know of any diffusion runners that work in vulkan. If someone has expertise, let me know!

        • fonix232@fedia.io
          link
          fedilink
          arrow-up
          2
          ·
          22 hours ago

          In my experience Vulkan actually drops performance ever so slightly, but it also improves compatibility (especially on the chips AMD advertises as “AI” then promptly forgets to support via ROCm, like the gfx1101/02/03 family, gfx1150, etc.), which is why I recommend it - as you said, and I’ve just noted, AMD is famous about doing as little in their AI toolkit for the average user as possible, leading to very limited support on consumer hardware.

    • pound_heap@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 day ago

      Thank you! This is really helpful. 32 GB V100 or pair of 5060ti s looks very interesting, and about the same price. Does running multiple GPUs require any special hardware? I mean apart from the motherboard with 2+ PCIe x16 slots?

      • afk_strats@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        edit-2
        1 day ago

        This is something I learned the hard way.

        Consumer hardware is limited by multiple factors when it comes to PCIe connectivity.

        • Physical layout. Easy how many slots you have to plu into, their size, and configuration.
        • Supported lanes from the CPU
        • chipset (motherboard) limitations

        Your graphics card might be a 16 lane card (referred to as “x16”), but sometimes, not all of them are used. Aforementioned 5060ti - I believe only uses x8. Some devices like graphics cards can use a physically smaller slot with an adapter for a loss in performance (a few frames in game play performance)

        Similarly, your motherboard might have a x16 slot and another x16 at the bottom. That second slot might only function as x8 or even x4. Does this matter? Sort of. Inta-card communication aka peer to peer communication can affect affect performance and that can compound with multiple cards.

        Even worse, some motherboards may have all sorts of connectivity but may have limitations like only 2 out of the bottom 4 slots, PCIe and m.2, can work at a time. ASK ME HOW I KNOW.

        Your CPU controls PCIe. It has a hard cap in how many PCIe devices it can handle and what speed. AMD tends to be better here.

        Enterprise gear suffers from none of this bs. Enterprise CPUs have a ton of PCIe lanes and enterprise motherboards usually match the physical size of their PCIe slots to their capacity and support full bifurcation*

        PCIe lanes are used up by and consumable by m.2, MCIO, and occulink to name a few. That means that you can connect a graphics card to either one is those of you can figure out the wires and power**

        • ** Bonus: bifurcation and how my $200 consumer motherboard runs 6 graphics cards.

        Bifurcation is a motherboard feature that lets you split PCIe capacity, so a 16x slot can support two x8 devices. My motherboard lets me do this on just the main slot and in a strange x8x4x4 configuration. I have an MCIO adapter (google it) which plugs into the PCIe and gives me 3 PCIe adapters with those corresponding speeds.

        it also has 2 m.2 slots which connect to the CPU. One is them, I use for a nvme ssd like a normal person. The other is an m.2 to PCIe adapter which gives me an x4 PCIe slot. For those keeping track, that’s 24 PCIe lanes so far. That’s the maximum my processor Intel 265k can handle

        But wait! The motherboard also has a kind of PCIe router and that thing can handle 8 more lanes! So I use the bottom 2 PCIe lanes on my motherboard for 2 cards at x4 each. The thing that kills me is that there are more m.2 ports. But the mobo will not be able to use any more than 2 devices at once. AND even though that bottom PCIe slot is sized at x16, electrically, its x4.

        Do your research (level1techs is great) and read the manuals to really understand this stuff before you buy

        My mobo for reference ASUS: TUF GAMING Z890-PRO WIFI

        • lavember@programming.dev
          link
          fedilink
          English
          arrow-up
          2
          ·
          10 hours ago

          How reliable is this setup for local inference? For instance how many tokens/sec?

          I’m asking cause I’d guess sharing bandwith like that would have some cost in speed

          • afk_strats@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            4 hours ago

            I find llama.cpp with Vulkan EXTREMELY reliable. I can have it running for days at once without a problem. As far as tokens/sec that’s that’s a complicated question because it depends on model, quant, sepculative, kv quant, context length, and card distribution. Generally:

            Models’ typical speeds at deep context for agentic use. Simple chats will be faster

            Model Quant Prompt Processing (tok/s) Token Generation (tok/s) Hardware Quality
            Qwen 3.5 397B Q2_K_M 100-120 18-22 2 x 7900 + 4 x Mi50 ★★★★★
            Gemma4 31B or Qwen3.5 27B Q8_0 400-800 20-25 2 x 7900xtx ★★★★
            Qwen 3.6 35B Q5_K_M 1000-2500 60-100 2 x 7900xtx ★★★★
            Qwen 3.5 122B Q4_0 200-300 30-35 4 x MI50 ★★★★
            gpt-oss 120b mxfp4 (native) 500-800 50-60 3 x Mi50 ★★
            Nemotron 3 Nano 30B IQ3_K_XXS 2500-3000 150-180 1 x 7900xtx
        • pound_heap@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          21 hours ago

          Wow, I didn’t think you were running 176GB worth of GPUs on a consumer board! I don’t have an extra board, and my gaming PC that has 9070XT is not a good basis for multi GPU build - it has a cheap mATX motherboard with too few slots and lanes. So it’s going to be a new build. Used EPYC boards look interesting for that.

          • afk_strats@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            16 hours ago

            I wish I bought an epyc board last year instead of my rig. Would have been far fewer headaches and, with the price of RAM, I would have quintupled in value now!

  • bloor@feddit.org
    link
    fedilink
    English
    arrow-up
    5
    ·
    1 day ago

    I have a B60 in my ProxMox machine explicitly for their SR-IOV Support, I also got gaming to run on a Windows VM. Have not yet tried LLMs, but I plan to.

  • cecilkorik@piefed.ca
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    1 day ago

    I’m not sure Intel has great drivers or compatibility for AI, it might be sort of janky and limiting. Even with an AMD card I’ve struggled a lot, there are still plenty of things that only support Nvidia, full stop.

    Edit to add: If you’re willing to consider this option, you might also be interested in say, an Nvidia P40 or similar card. P40s have 24GB VRAM and you can pick them up cheap as can be, like $100-$200 price range. You need to 3d print or buy a fan cooler bracket for them. They are janky and limiting in a different way, they are old datacenter-style cards, they don’t have fans or do video at all, they are a bit slow at AI tasks and only run on specific Nvidia drivers. But having plenty of VRAM for that low of a price is nice, and you can tune them down to about 125W so you can run a few of them if you can find enough slots or risers for them. AI usage does NOT require PCI-E X16 bandwidth, in my experience 4X is plenty and even 1X is probably okay with some penalty.

    Edit again: Ah it looks like I guess P40s have doubled in price now too, the VRAM scavengers finally got to them. Still, if you were willing to go up to 1k price point, it’s still viable.

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      18 hours ago

      I really like those Tesla cards. I picked up a Telsla P4 (it’s all I can fit into my 1L shoebox) for $40 USD. Yes, you have to do some jiggery pokery with Noctura fans and zip ties but…again…$40.

      The P40s are fantastic value for what they are.

    • pound_heap@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      Thanks! I was running some models on my RX 9070XT, but only Ollama works flawlessly. I couldn’t make llama.Cpp to run Gemma4 or the newer Qwen - maybe I’m hitting that incompatibility, but probably it’s the skill issue.

      P40 doesn’t look very appealing. 32 GB V100 costs about the same as 2xP40, less VRAM in total, but it’s faster, will use less power.

      But I’m not sure if I follow you on the PCIe… If I run a model that spans multiple GPUs, doesn’t PCIe bandwidth matter?

      • cecilkorik@piefed.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 day ago

        It matters a bit but not as much as you’d think, it also depends on what model and what runner you’re using I think. There’s a lot of optimizing that can be done, but in practice, you’re not going to notice huge difference in speed. Again, remember that these are not speed-demon cards to begin with, the main feature is the large VRAM capacity that allows them to run very large, powerful models without the speed penalty of dropping to system RAM, and that, at least where I understand it, is where the PCIe starts to matter a lot more. That said, maybe I’m wrong. I’m not an expert at this stuff and my experience is limited to what little hardware in what limited configurations I have available personally. Best of luck in your adventures, anything we can do to democratize machine learning technology to more people is worth pursuing I think.

  • lemonhead2@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 day ago

    buy a used one on eBay 🙂 doesn’t have to be the same model… but someone proble went got themselves a shiny new one and is getting rid of one they got last year…