How to Set up and Run a Local LLM with Ollama and Llama 2

Last week I posted about coming off the cloud, and this week I’m looking at running an open source LLM locally on my Mac. If this feels like part of some “cloud repatriation” project, it isn’t: I’m just interested in tools I can control to add to any potential workflow chain.

Assuming your machine can spare the size and memory, what are the arguments for doing this? Apart from not having to pay the running costs of someone else’s server, you can run queries on your private data without any security concerns.

For this, I’m using Ollama. This is ”a tool that allows you to run open-source large language models (LLMs) locally on your machine”. They have access to a full list of open source models, which have different specializations — like bilingual models, compact-sized models, or code generation models. This started out as a Mac-based tool, but Windows is now available as a preview. It can also be used via Docker.

If you were looking for an LLM as part of a testing workflow, then this is where Ollama fits in:

A GenAI testing presentation from @patrickdubois

For testing, local LLMs controlled from Ollama are nicely self-contained, but their quality and speed suffer compared to the options you have on the cloud. Building a mock framework will result in much quicker tests, but setting these up — as the slide indicates — can be tedious.

I installed Ollama, opened my Warp terminal and was prompted to try the Llama 2 model (for now I’ll ignore the argument that this isn’t actually open source). I assumed I’d have to install the model first, but the run command took care of that:

Looking at the specs for the llama2 7b model, I was far from certain that my ancient pre-M1 Macbook with only 8 GB memory would even run it. But it did, just very slowly.

As you can see, there is already a terminal built in, so I made a quick test query:

This was not quick, but the model is clearly alive. Well, when I say “alive” I don’t quite mean that, as the model is trapped temporally at the point it was built:

If you were wondering, the correct answer to the arithmetic problem is actually 1,223,834,880. Even a cursory glance would show that it could not possibly end in six — and no doubt it would be different if I did it again. Remember, LLM’s are not intelligent, they are just extremely good at extracting linguistic meaning from their models. But you know this, of course.

The convenient console is nice, but I wanted to use the available API. Ollama sets itself up as a local server on port 11434. We can do a quick curl command to check that the API is responding. Here is a non-streaming (that is, not interactive) REST call via Warp with a JSON style payload:

> curl http://localhost:11434/api/generate -d ‘ { “model”: “llama2”, “prompt”: “Why is the sky blue?”, “stream”: false }’

> curl http://localhost:11434/api/generate –d ‘

{

“model”: “llama2”,

“prompt”: “Why is the sky blue?”,

“stream”: false

}’

The response was:

{ “model”:”llama2″, “created_at”:”2024-02-14T13:48:17.751003Z”, “response”: “nThe sky appears blue because of a phenomenon called Rayleigh..” “done”:true, “context”:[518,25580,29962,..], “total_duration”:347735712609, “load_duration”:6372308, “prompt_eval_duration”:6193512000, “eval_count”:368, “eval_duration”:341521220000 }

{

“model”:“llama2”,

“created_at”:“2024-02-14T13:48:17.751003Z”,

“response”: “nThe sky appears blue because of a phenomenon called Rayleigh..”

“done”:true,

“context”:[518,25580,29962,..],

“total_duration”:347735712609,

“load_duration”:6372308,

“prompt_eval_duration”:6193512000,

“eval_count”:368,

“eval_duration”:341521220000

}

The full response line — which covered Rayleigh scattering, light’s wavelength, and the sun’s angle — all looked correct to me.

The common route to gain programmatic control would be to use Python, and maybe a Jupyter Notebook. These are not my tools of choice, so I will try to use some C# bindings. I found some here. Fortunately, OllamaSharp is also available as a package via NuGet.

I’m not too keen on Visual Studio Code, but once you set up a C# console project with NuGet support, it is quick to get going. Here is the code to contact Ollama with a query:

using OllamaSharp; var uri = new Uri(“http://localhost:11434”); var ollama = new OllamaApiClient(uri); // select a model which should be used for further operations ollama. SelectedModel = “llama2”; ConversationContext context = null; context = await ollama.StreamCompletion( “How are you today?”, context, stream => Console.Write(stream.Response) );

using OllamaSharp;

var uri = new Uri(“http://localhost:11434”);

var ollama = new OllamaApiClient(uri);

// select a model which should be used for further operations ollama.

SelectedModel = “llama2”;

ConversationContext context = null;

context = await ollama.StreamCompletion(

“How are you today?”,

context, stream => Console.Write(stream.Response)

);

We eventually get the response directly in the debug console (the blue bit):

That’s nice.

OK, so now we are ready to ask something a little bit more specific. I’ve seen people asking for categorized summaries of their bank accounts, but before I would entrust it with that, let me try something more mundane. I’ll ask for a recipe based on the food in my fridge:

.. string question = “I have the following ingredients in my fridge: aubergine, milk, cheese, peppers. What food could I cook with this and other basic ingredients?”; context = await ollama.StreamCompletion( question, context, stream => Console.Write(stream.Response) );

string question =

“I have the following ingredients in my fridge:

aubergine, milk, cheese, peppers.

What food could I cook with this and other basic ingredients?”;

context = await ollama.StreamCompletion(

question,

context, stream => Console.Write(stream.Response)

);

It took a long time to do this (many hours in fact; I had time to go shopping!) and the Time To First Token (TTFT) was a good few minutes. The result is here:

Given that we did not train the LLM, and didn’t add any recipe texts via Retrieval-augmented generation (RAG) to improve the quality by supplementing the LLM’s internal representation, I think this answer is very impressive. It comprehended what “basic ingredients” meant, and each recipe covers a different style. It also intuited that I didn’t need every one of my ingredients to be used, and correctly figured the distinct ingredient was the aubergine.

I would certainly have the confidence to let this summarize a bank account with set categories, if that was a task I valued. The controllable nature of Ollama was impressive, even on my Macbook. As an added perspective, I talked to the historian/engineer Ian Miell about his use of the bigger Llama2 70b model on a somewhat heftier 128gb box to write a historical text from extracted sources. He also found it impressive, even with the odd ahistorical hallucination.

While things are still in flux with open source LLMs, especially around the issues of training data and bias, the maturity of the solutions is clearly improving, giving reasonable hope for future capability under considered conditions.

David has been a London-based professional software developer with Oracle Corp. and British Telecom, and a consultant helping teams work in a more agile fashion. He wrote a book on UI design and has been writing technical articles ever since….

Leave a Comment Cancel reply