Thinking of a Mac Mini or Mac Pro for Open-Source LLMs?

Thinking of a Mac Mini or Mac Pro for Open-Source LLMs? Cloud VMs Let You Test, Decide, and Save.

Are you considering investing in a Mac Mini or Mac Pro to experiment with open-source Large Language Models (LLMs)? Before making that significant investment, there’s a smarter way to test your needs: cloud virtual machines.

The Journey from API to Local LLMs

Like many developers, my AI journey started with ChatGPT and the OpenAI API. The API was great for integrating specific data, enabling fascinating use cases. But as my confidence grew and I wanted to experiment more, I began feeling constrained by API costs. Every experiment came with the anxiety of potential charges.

This went against one of the core reasons I fell in love with software development — the ability to tinker freely. Unlike physical crafting where mistakes can be costly, software development should feel more like playing with LEGO: invest once, then build and rebuild to your heart’s content.

Why Consider Local LLMs?

There are compelling reasons to explore running LLMs locally:

Cost control: Fixed hardware costs instead of per-query API charges
Privacy: Keep sensitive data under your control
Experimentation freedom: Test and iterate without usage constraints
Specialized use cases: Sometimes smaller models are sufficient

Enter Cloud VMs: The Perfect Testing Ground

Before investing in expensive hardware, cloud virtual machines offer an elegant solution:

Create and destroy different hardware configurations in minutes
Pay only for the time you use
Clear upfront costs versus unpredictable API charges
Test multiple models to find your sweet spot

There are many reasons and many options to start playing with local LLM. We will start with an overview and we will go in detail in one option — running them on a VM in cloud, and in particular on GCP, with an eye on expected costs.

In some interesting use cases we don’t need complex reasoning capabilities or need to access the infinite knowledge base that the big models offer but we want to leverage for example only on the language manipulation capabilities that small models can offer with lower and fixed costs.Also in some other use cases we might be using private data that we don’t want to share under unclear data usage and storing policies.

These are good reasons to be interested in small LLM to be used and managed locally. To start playing, in many cases, you probably can leverage on the hw you already have if not too old. But what if instead you would like to understand what are the capabilities for these small models at a reasonable price, experimenting with different models and hw until you find a sweet spot ?

There is a simple option that might address this need and it consists in activating a virtual machine with one of the main cloud provider and in a matter of minutes you can create and destroy machines with various configuration and pricing.Moreover with a cloud Virtual Machine you can pay only for the minutes used with some additional degree of clarity on costs upfront versus calling API where costs can escalate based on usage .

Hands-On Testing Guide

Let’s walk through the process of setting up and testing LLMs on a cloud VM. I’ll use Google Cloud Platform’s Compute Engine as an example, but the process is similar on other providers.

To quickly evaluate the capabilities of a specific virtual machine I will leverage on an open source project ( https://llm.aidatatools.com/ ) that measures the performances of small LLM in terms of average token per seconds after running 3 configurable sample queries.

Here are in detail the simple steps.

1. Initial Virtual Machine Setup.

Create a Linux VM (I started with 16GB RAM, e2 instance type)

2. Basic environment setup

Start a simple setup updating and upgrading packages, installing python and venv and activating a virtual environment

sudo apt update
sudo apt upgrade
sudo apt install python3-venv
python3 --version  # Ensure Python 3 is installed
python3 -m venv --help  # Check if venv is available

sudo apt install python3.11-venv

python3 -m venv myenv
source myenv/bin/activate

3. Install Required Software

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Install the benchmark sw

pip install llm-benchmark

4. Running Benchmarks

The llm_benchmark tool automatically:

Detects available hardware
Identifies compatible models based on RAM
Downloads and tests models with standardized prompts
Measures performance in tokens/second It can simply be run with:

llm_benchmark run

And here is part of an example output

-------Linux----------No GPU detected.
Total memory size : 15.63 GB
cpu_info: AMD EPYC 7B12
gpu_info: no_gpu
os_version: Debian GNU/Linux 12 (bookworm)
ollama_version: 0.4.7
------
Checking and pulling the following LLM models
phi3:3.8b
qwen2:7b
gemma2:9b
mistral:7b
llama3.1:8b
llava:7b
llava:13b
----------
----------
model_name =    mistral:7b
prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch.
eval rate:            3.63 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game
eval rate:            3.79 tokens/s
prompt = Create a dialogue between two characters that discusses economic crisis
eval rate:            3.87 tokens/s
prompt = In a forest, there are brave lions living there. Please continue the story.
eval rate:            3.90 tokens/s
prompt = I'd like to book a flight for 4 to Seattle in U.S.
eval rate:            3.61 tokens/s
--------------------
Average of eval rate:  3.76  tokens/s
----------------------------------------
....

5. Qualitative Testing

Running the benchmark can quickly provide a performance measurement in terms of tokens per seconds but to practically experience what 5 token/seconds means versus 10 token/s or 20 I suggest to simply try it running OIlama commands . For example we choose the model phi3.

ollama run phi3

And ask a simple question

Can you help me write a python program  that prints all the odd numbers until 5o and all even numbers from 5o up to 100 ?

This or other smarter questions can also help you understand if the model you pick is adequate for the task that you would like to accomplish

5. Testing Different Configurations

Basic CPU Configuration (16GB RAM, e2 instance)

Results showed modest performance:

3–6 tokens/second across models
Usable but noticeably slow for interactive use
Good for initial testing and model compatibility

Improved CPU (n2 instance)

No significant improvement over e2
Demonstrated that CPU type alone isn’t the bottleneck

Adding GPU Acceleration (Tesla T4)

Important note: You may need to request GPU quota increase and choose GPU-enabled regions.

Results showed dramatic improvement:

Bridging to Consumer Hardware

One of the most valuable aspects of this testing approach is the ability to compare results with consumer hardware options. The llm-benchmark tool used in our tests is the same one used by the community at llm.aidatatools.com, enabling direct comparisons.

Once you get the feeling of the performance at a specific token per second rate you can compare the results of consumer hw that you can buy and keep at home looking at the results page :https://llm.aidatatools.com/results-macos.php https://llm.aidatatools.com/results-windows.php https://llm.aidatatools.com/results-linux.php

Wrapping up

Here is the summary of the results of my little experiments and results published by other users on the aidatatools.com web site relatively to Mac consumer hw.I focus on Mac because the available options are limited and easier to compare but obviously you can consider also Windows or Linux pc especially with attached GPU.

In summary it seems that relatively inexpensive hw can run the open models although quite slow, the Apple M architecture can be a good improvement but to have a very good improvement you need to move to a GPU (even the basic T4) or the Pro/Max versions of M processors.

Notes on the importance of RAM

RAM plays a critical role in both the cost and capabilities of running open-source LLM models. While 16GB of RAM is sufficient for smaller models (3B–7B parameters), larger models may require 32GB or more. On cloud-based virtual machines (VMs), it is relatively easy to adjust RAM size based on your needs. However, on consumer hardware, especially devices with integrated RAM like those using Apple’s M-series architecture, upgrading RAM may be difficult or even impossible.

Although the amount of RAM does not directly affect performance in terms of tokens per second, it determines whether certain larger models can be loaded and used effectively. This, in turn, can impact the quality of the responses generated by the model. Techniques like quantization can reduce the memory requirements of larger models, enabling them to run on systems with less RAM. However, these techniques often come at the cost of some precision. To evaluate whether this tradeoff is acceptable, functional testing is essential. Cloud environments are particularly useful for experimenting with different configurations and determining the optimal balance.

On the other hand, RAM speed can influence performance, but this metric is not always readily available in reports, such as those found on aidatools.com.

Evaluating the costs of running in cloud (December 2024 Pricing)

Let’s break down the costs for different VM configurations and usage scenarios. All prices are in euros (€):

Basic CPU-Only Configurations (16GB RAM)

e2 standard instance: Iowa region: €0.22/hour, Milan region: €0.26/hour
Compute-optimized c2: Iowa region: €0.29/hour

GPU-Accelerated Configurations (16GB RAM + T4 GPU)

n1 standard instance: Iowa region: €0.60/hour, Frankfurt region: €0.74/hour

For different usage patterns, costs break down as follows:

Key Insights:

Regional Price Variation: European regions (Milan, Frankfurt) are about 20–25% more expensive than US regions (Iowa)
GPU Impact: Adding a T4 GPU roughly triples the hourly cost
Cost-Performance Trade-off: While GPU instances are more expensive, they offer 10x better performance (as seen in our benchmarks)

Additional Considerations:

Storage costs (100GB): €4–10/month depending on type
Data transfer costs may apply for large datasets
Preemptible instances can offer up to 60% discount but may be interrupted
Consider keeping storage between sessions if you’re doing regular experimentation For optimal cost management:

Right-size your RAM based on your target models
Choose US regions when latency isn’t critical
Use CPU-only instances for initial testing and setup
Switch to GPU instances only when running performance-critical workloads
Consider preemptible instances for non-time-critical experimentation

Alternative Testing Options to cheaply play with LLMs

Google Colab

Google Colab can be used for free but in this case you cannot run arbitrary program i.e. you cannot leverage on Ollama.You can however use the Huggingface Transformers library and even have free access to a T4 GPU for free and use it within the context of a workbook

Huggingface Endpoints

You can use preconfigured endpoints with already installed small LLM on Huggingface website: https://ui.endpoints.huggingface.co/

If we look for endpoints using Nvidia T4 for sake of comparison we can find a llama 3.2 available at a competitive cost of 0.5$/hour

Although most of the endpoints use at least an L4 GPU with a slightly higher price

These pre-configured LLM endpoints might be even easier to setup but might offer less flexibility.

CloudRun with GPU (Preview)

CloudRun is a serverless service provided by Google Cloud Platform (GCP) designed for running containerized applications. It offers several advantages, particularly for proof-of-concept projects or pilot implementations. One of its standout features is automatic scaling: CloudRun dynamically adjusts resources based on usage, scaling up during high demand and down to zero when idle. This downscaling to zero means you only pay for the exact seconds your service is in use, making it a cost-efficient choice for experimentation and intermittent workloads.

Currently, the ability to add GPUs to CloudRun is in public preview and is available in a limited number of zones.

Making Your Decision

Based on my testing, here’s a practical decision framework:

Start with cloud VMs to test different models and configurations
Monitor your usage patterns and costs
Compare performance needs with consumer hardware options
Consider hybrid approaches (cloud for testing, local for production or viceversa according to your use case)

Remember that the specific performance numbers matter less than understanding your workflow requirements. Use the cloud VM as a testing ground to determine:

Which models you actually use
Required response speeds for your use case
RAM requirements for your preferred models
Whether GPU acceleration is worth the cost

Looking Forward

The field of local LLMs is rapidly evolving. Starting with cloud VMs lets you experiment and understand your needs before making significant hardware investments. Whether you ultimately choose cloud VMs, local hardware, or a hybrid approach, hands-on testing is invaluable for making informed decisions.

Want to dive deeper? Check out these resources:

Google Cloud Platform Pricing Calculator

Want to estimate cost of VM with various configuration and up to date prices : https://cloud.google.com/products/calculator/

Chrome Remote Desktop Setup Guide

The illustrated example focuses on a simple text based interaction via ssh however VM can also be setup with a graphical user experience. In this case there is a slightly more complex setup but more importantly you might need to pay attention to latency and pick a region near to where you are or limit the UI features.An interesting resource is the following that guides you through how to setup a remote VM and graphically interact with it through Chrome Remote Desktop https://cloud.google.com/architecture/chrome-desktop-remote-on-compute-engine

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories.

Subscribe to our newsletter and YouTube channel to stay updated with the latest news and updates on generative AI. Let’s shape the future of AI together!