Nvidia Is About To Collapse The Price Of AI Models

At CES Nvidia showed off a few interesting new things. The biggest of which is Jensen Huang’s new leather jacket. I mean look at that thing: is this a tech keynote or a fashion show?

Honestly, I’m a little surprised more people are not commenting on this. It’s Jensen’s best leather jacket to date.

There are also some other things. The new RTX 50 series. Surprisingly the vast majority of the coverage around them has been positive. I particularly like this comment on the Linus Tech Tips video:

Because as it has been pointed out the 50 series is a good value compared to the 40 series but not compared the 30 series. I think Nvidia knows the demand for new GPUs is dropping. At the keynote they showed off Cyberpunk running at 240 FPS at 8K. Seriously? 8K? Is anyone gaming at 8K? Also Cyberpunk was released in 2020. Is there honestly no better game to show off the 50 series’ power?

I don’t think there will be many people that want to buy these new cards. The old cards are good enough. Hence the price drop. And many of the improvements are in software, specifically their DLSS, which they’ve purposely restricted to the new 50 series.

But I was reading the comments and there was one notable critique: it caps out at 16GB of memory. Well, except for the 5090 if you want to pay 2 grand for a graphics card. But for normal people it caps out at 16GB and it has been pointed out that this is not a lot of memory.

Especially with machines using unified memory which allows the GPU and CPU to share memory. This is a much more efficient system because it means less memory is wasted. Using this system both my M1 Air and Steamdeck have 16GB of unified memory. The Steamdeck has the same memory as a one thousand dollar GPU without a computer attached. Now that is just sad.

Why be so frugal with the memory? Well, part of it is because Nvidia is overpriced. AMD’s cards have 24GB of memory at the same price, I don’t know why people love Nvidia so much. I’ve always avoided them because I always seem to have problems with Nvidia drivers.

But I think there is another reason Nvidia crippled the VRAM: so you can’t run large LLMs (large language models) on them. Large language models are what newer AI models use and consume gigabytes of memory. Nvidia would rather you buy 2 products from them rather than one GPU that does everything. Normally I would complain but their dedicated LLM machine is pretty impressive. Introducing Project Digits.

Project Digits is a compact Linux machine preloaded with all the fancy Nvidia AI software. Think of it like a Mac mini on steroids. They even showed this image of the computer.

Also interesting is if you zoom in you can tell that this image is AI-generated. The most valuable company in the world using AI to replace someone’s job. That’s an idea only Jensen could love. How much work would it be to put one of these on someone’s desk and take a picture? The other day I saw someone talking about their desk setup and then posting some random AI-generated picture. What are you doing? If you talk about your desk setup I expect a picture of your desk setup to be the hero image, not some AI nonsense. The only reason why you wouldn’t do that is if you’re not comfortable with your setup so then why should I read your article?

Anyways, Project Digits comes with 4TB of NVMe storage, 128GB of unified memory, and the latest Blackwell architecture. All for three thousand dollars. Not that much more than the GTX 5090. It really puts into perspective how bad of a value the 5090 is.

Also just for shits and giggles if you were to configure a Mac mini with an M4 Pro chip, 64GB of storage, and 4TB of storage it would be more expensive than Project Digits.

Those upgrade prices are really quite expensive, huh?

Nvidia claims Project Digits can run 200 billion parameter models and if you hook up two of these things together you can run 405 billion parameter models. 405 billion, that’s an unusually exact number. Why might they use that number? Because that’s what the size of the largest Llama model is. Nvidia is basically saying, “Now you can run the latest and greatest Llama models at home without having to pay for a server, something that would have cost a lot of money before.”

How much did it cost before? Well, it’s all proprietary information but what we do know is AWS’ P5 instance costs $98/hour so that’s $2,354/day. Two Project Digits machines costs six thousand dollars. So if you purchased 2 Project Digits machines you’d break even in only 3 days.

Now you may think that the P5 is overkill. However I looked at this thread:

And maybe not. People routinely cite prices significantly higher than the price of 2 Project Digits machines.

I think it’s safe to say that Project Digits is going to be a complete game changer for AI pricing. I expect companies to start buying racks of these Project Digits machines and bring down the price of running AI models significantly, the Llama models in particular because it is an open weight model meaning anyone can download it and run it. And a lot of companies are already running it.

I was quite disappointed with Llama’s pricing initially, its pricing was not competitive with models from Anthropic, OpenAI, and Google. But we could see that change.

And once Llama’s price comes down I could see the rest of the industry follow suit. It’s unknown how large some of these models are. Expect for Gemini Flash 8B. I’m pretty sure that one is only 8 billion parameters meaning it can fit on a consumer GPU. Honestly I’m a little disappointed that Flash 8B is only half the price of the full-fat Gemini Flash. If they wanted to I bet they could drop the price even further. There’s just no need because Gemini Flash is already the cheapest. Well, I think there’s some AWS model that’s technically cheaper but it’s AWS so it sucks, just like all the other junk AWS releases.

I’m currently working on an email app, Project Tejido, which will scan every single email using an LLM. I did some back-of-the-envelope calculations and it seemed like it would be a really good idea as it would be incredibly cheap to do. Well, I’m now working on the app and turns out my estimate on the number of tokens I would need per email was off… by two orders of magnitude. So it is costing me significantly more than expected. It’s still viable, but only barely unlike my initial calculation which told me it would be insanely cheap. I’m rooting for the price of LLMs to go down even further. Fingers crossed LLM costs go down by 2 orders of magnitude again.

Now I’m not sure LLMs will go down 2 orders of magnitude because that’s getting close to the price of electricity. But one order of magnitude? Maybe. Because what would it take to really bring down the price of LLMs? Competition. We haven’t really seen much competition recently. Sure, there’s GPT-4o Mini and Claude 3.5 Haiku, but GPT-4o Mini is pretty old and Claude 3.5 Haiku is actually more expensive than Claude 3.0 Haiku. They justify this cost increase by saying it’s better than 3.0 Haiku.

And that’s just the problem: the low end of models is competitive, but the high end, the ‘frontier’, models are not. What we need is for the frontier models to come down in price. And the only way to do that is to make compute ridiculously cheap. Nvidia’s Project Digits does just that and for that reason it is about to collapse the price of AI models.

Update: Many people seem to be bringing up memory speeds. Nvidia has not disclosed the memory bandwidth of the device but people have been estimating numbers between 273GB/s to 1TB/s. I don’t expect this to beat out cards that cost 5x as much but I assume it will still be fast enough for LLMs like Llama 405b which Jensen alluded to and therefore it will still be significantly cheaper than current similarly specced hardware.