I’m doing a lot of coding and what I would ideally like to have is a long context model (128k tokens) that I can use to throw in my whole codebase.

I’ve been experimenting e.g. with Claude and what usually works well is to attach e.g. the whole architecture of a CRUD app along with the most recent docs of the framework I’m using and it’s okay for menial tasks. But I am very uncomfortable sending any kind of data to these providers.

Unfortunately I don’t have a lot of space so I can’t build a proper desktop. My options are either renting out a VPS or going for something small like a MacStudio. I know speeds aren’t great, but I was wondering if using e.g. RAG for documentation could help me get decent speeds.

I’ve read that especially on larger contexts Macs become very slow. I’m not very convinced but I could get a new one probably at 50% off as a business expense, so the Apple tax isn’t as much an issue as the concern about speed.

Any ideas? Are there other mini pcs available that could have better architecture? Tried researching but couldn’t find a lot

Edit: I found some stats on GitHub on different models: https://github.com/ggerganov/llama.cpp/issues/10444

Based on that I also conclude that you’re gonna wait forever if you work with a large codebase.

  • 0x01@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    6 months ago

    I do this on my ultra, token speed is not great, depending on the model of course, a lot of source code sets are optimized for Nvidia and don’t even use native Mac gpu without modifying the code, defaulting to cpu. I’ve had to modify about half of what I run

    Ymmv but I find it’s actually cheaper to just use a hosted service

    If you want some specific numbers lmk

    • shaserlark@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 months ago

      Interesting, is there any kind of model you could run at reasonable speed?

      I guess over time it could amortize but if the usability sucks that may make it not worth it. OTOH really don’t want to send my data to any company.

      • Boomkop3@reddthat.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        6 months ago

        Then don’t go with an Apple chip. They’re impressive for how little power they consume. But any 50 watt chip will get absolutely destroyed by a 500 watt gpu, even one from almost a decade ago will beat it.

        And you’ll save money to boot, if you don’t count your power bill

        • GenderNeutralBro@lemmy.sdf.org
          link
          fedilink
          English
          arrow-up
          4
          ·
          6 months ago

          But any 50 watt chip will get absolutely destroyed by a 500 watt gpu

          If you are memory-bound (and since OP’s talking about 192GB, it’s pretty safe to assume they are), then it’s hard to make a direct comparison here.

          You’d need 8 high-end consumer GPUs to get 192GB. Not only is that insanely expensive to buy and run, but you won’t even be able to support it on a standard residential electrical circuit, or any consumer-level motherboard. Even 4 GPUs (which would be great for 70B models) would cost more than a Mac.

          The speed advantage you get from discrete GPUs rapidly disappears as your memory requirements exceed VRAM capacity. Partial offloading to GPU is better than nothing, but if we’re talking about standard PC hardware, it’s not going to be as fast as Apple Silicon for anything that requires a lot of memory.

          This might change in the near future as AMD and Intel catch up to Apple Silicon in terms of memory bandwidth and integrated NPU performance. Then you can sidestep the Apple tax, and perhaps you will be able to pair a discrete GPU and get a meaningful performance boost even with larger models.

            • shaserlark@sh.itjust.worksOP
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              6 months ago

              Yeah I found some stats now and indeed you’re gonna wait like an hour to process if you throw like 80-100k token into a powerful model. With APIs that kinda works instantly, not surprising but just to give a comparison. Bummer.

              • Boomkop3@reddthat.com
                link
                fedilink
                English
                arrow-up
                1
                ·
                edit-2
                6 months ago

                Application Programming Interface, are you talking about something on the internet? On a gpu driver? On your phone?

                Then also, what’s the size model you’re using? Define with int32? fp4? Somewhere in between? That’s where ram requirements come in

                I get that you’re trying to do a mic drop or something, but you’re not being very clear

              • Boomkop3@reddthat.com
                link
                fedilink
                English
                arrow-up
                1
                ·
                6 months ago

                Anyways, the important thing is the “TOPS” aka trillions of operations per second. Having enough ram in important, but if you don’t have a fast processor than you’re wasting ram while you can just stream it from a fast ssd.

                One such cases is when your system can’t handle more than 50 tops, like the apple m systems. Try an old gpu, and enjoy 1000’s of tops

  • tehnomad@lemm.ee
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 months ago

    The context cache doesn’t take up too much memory compared to the model. The main benefit of having a lot of VRAM is that you can run larger models. I think you’re better off buying a 24 GB Nvidia card from a cost and performance standpoint.

    • shaserlark@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 months ago

      Yeah I was thinking about running something like Code Qwen 72B which apparently requires 145GB Ram to run the full model. But if it’s super slow especially with large context and I can only run small models at acceptable speed anyway it may be worth going NVIDIA alone for CUDA.

      • tehnomad@lemm.ee
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 months ago

        I found a VRAM calculator for LLMs here: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator

        Wow it seems like for 128K context size you do need a lot of VRAM (~55 GB). Qwen 72B will take up ~39 GB so you would either need 4x 24GB Nvidia cards or the Mac Pro 192 GB RAM. Probably the cheapest option would be to deploy GPU instances on a service like Runpod. I think you would have to do a lot of processing before you get to the breakeven point of your own machine.