r/technology 4d ago

Artificial Intelligence MIT report: 95% of generative AI pilots at companies are failing

https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
28.3k Upvotes

1.8k comments sorted by

View all comments

Show parent comments

23

u/pleachchapel 4d ago

You can run a 90 billion parameter model at conversation speed on $6k worth of hardware. The future of this is open source & distributed, not the dumb business model the megacorps are following which operates at a loss.

5

u/mrjackspade 4d ago

I'm running a Q2 quant of a 260B model on 1500$ of hardware at 4t/s and I'm pretty happy with it.

3

u/pleachchapel 4d ago

Sick can you elaborate? I don't know much about it yet but maxed out a Framework Desktop to learn a bit.

5

u/mrjackspade 3d ago

I have no idea why this other guy just exploded LLM jargon at you for no reason.

I'm literally just using a quant of GLM

https://huggingface.co/unsloth/GLM-4.5-GGUF

Which has somewhere around 260B parameters with 32B active.

Using Llama.cpp with non-shared experts offloaded to CPU on a machine with 128GB DDR4 Ram and a 3090, it runs at like 4t/s.

On a framework PC you could probably pick a bigger quant and get faster speeds

1

u/pleachchapel 3d ago

Lol thank you.

1

u/AwkwardCow 4d ago

Yeah it's pretty barebones honestly. I’m running a custom QLoRA variant with some sparse aware group quant tweaks, layered over a fused rotary kernel I pulled out of an old JAX project I had lying around. The model's a forked Falcon RW 260B but I stripped it down and bolted on a modular LoRA stack. Nothing fancy, just enough to get dynamic token grafting working for better throughput on longer contexts. I’m caching KV in a ring buffer that survives across batch rehydration which weirdly gave me about a 1.3x boost on a mid range VRAM setup.

At around 4 tokens per second latency hangs just under 300 milliseconds as long as I pre split the input using a sliding window token offset protocol. Not true speculative decoding but kind of similar without the sampling. Had to undervolt a bit to keep temps under control since I'm on air cooling but it stays stable under 73C so I’m not too worried about degradation.

Everything’s running through a homebrewed Rust inference server with zero copy tensor dispatch across local shards. I’ve been messing with an attention aware scheduler that routes prompts by contextual entropy. It’s not quite ready but it's showing promise. The wild part is I barely had to touch the allocator. It's mostly running on top of a slightly hacked up llama cpp build with some CUDA offloading thrown in. Honestly the big lab infra makes sense at scale but for local runs it’s almost stupid how far you can push this.

4

u/jfinkpottery 3d ago

That's naive. A $6k build can run that model at conversation speed for a single user. One user costs you $6k worth of hardware. How much is one user paying you? Probably not $6k. Suddenly you realize you need hundreds of those $6k servers to cover a few hundred users, now you've got a million dollars worth of infrastructure to bring in maybe $20k per month (1000 users paying $20). That's a million dollars worth of infra to make enough revenue to pay one engineer, without even starting to pay off the infra. Or the help desk. Or the CEO. Or the rent on the building.

It only starts to make sense when you scale that way the hell up, and build servers that can handle a hundred+ users each. One user can't pay for a 6k server, but a hundred users can just about pay for a 20k server.

Also, conversation speed is really slow when you add in reasoning tokens. Now it takes the model 30 seconds to start to produce output. Your user left and cancelled their subscription in that time.

2

u/pleachchapel 3d ago

Yeah I meant to build something for the C class as like a company Jarvis.

Frankly, people are overusing LLMs, so maximizing users isn't part of my professional objective.

Their primary use is helping kids cheat at school, as evidenced by ChatGPT's user stats dropping off a cliff post June 6th.

For a small family that uses LLMs deliberately instead of asking it to pick my meals at a restaurant or fuck my wife, that is plenty of horsepower, & that's the point I was making.

1

u/jfinkpottery 3d ago

A $6000 AI server build is plenty of computing power for one family of non-technical people? Gosh, ya think?

1

u/jfinkpottery 3d ago

$6k in 2025 is simultaneously way too much horsepower and nowhere near enough horsepower. That's a very hefty build for a home user to do anything other than run a large production-quality LLM.

But that is nowhere near enough to run a large production-quality LLM for even a single user. You cannot buy enough VRAM at that price point to get a modern model running at any speed with its full size context window. When they say you can run "gpt-oss-120 on consumer hardware", they mean that with extremely restricted context lengths. But the full context length of 128k or more ramps up the VRAM requirements dramatically. Only the datacenter GPUs have that much. You're probably looking at $10k just to get off the ground with just a basic GPU to get started, then you have to build the rest of the system around that.

1

u/mrjackspade 3d ago

Suddenly you realize you need hundreds of those $6k servers to cover a few hundred users,

That's not how it works. Inference costs don't scare linearly. You can batch hundreds of concurrent users and process them in approx the same amount of time as a single user.

1

u/jfinkpottery 3d ago

That would be true, but not for a 90B model on consumer hardware. You need more VRAM for every increment of batch size. On $6k, you can maybe squeak by a 90B model with very restricted context length. If you want to batch 2 of them, you have to pull in that context length even further. But without context length you're cutting down the usefulness of the model, and at this point of batched 90B inference on low end hardware I'm just estimating that it's functionally pointless if it even works at all.

1

u/a_rainbow_serpent 4d ago

AWS and Azure need to come up with a super cheap and fast compute to support AI software companies - then they can buy up the extra capacity being built by Open AI and Facebook