gemfeed/2025-08-05-local-coding-llm-with-ollama.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427

# Local LLM for Coding with Ollama on macOS

> Published at 2025-08-04T16:43:39+03:00

```
      [::]
     _|  |_
   /  o  o  \                       |
  |    ∆    |  <-- Ollama          / \
  |  \___/  |                     /   \
   \_______/             LLM --> / 30B \
    |     |                     / Qwen3 \
   /|     |\                   /  Coder  \
  /_|     |_\_________________/ quantised \
```

## Table of Contents

* [⇢ Local LLM for Coding with Ollama on macOS](#local-llm-for-coding-with-ollama-on-macos)
* [⇢ ⇢ Why Local LLMs?](#why-local-llms)
* [⇢ ⇢ Hardware Considerations](#hardware-considerations)
* [⇢ ⇢ Basic Setup and Manual Code Prompting](#basic-setup-and-manual-code-prompting)
* [⇢ ⇢ ⇢ Installing Ollama and a Model](#installing-ollama-and-a-model)
* [⇢ ⇢ ⇢ Example Usage](#example-usage)
* [⇢ ⇢ Agentic Coding with Aider](#agentic-coding-with-aider)
* [⇢ ⇢ ⇢ Installation](#installation)
* [⇢ ⇢ ⇢ Agentic coding prompt](#agentic-coding-prompt)
* [⇢ ⇢ ⇢ Compilation & Execution](#compilation--execution)
* [⇢ ⇢ ⇢ The code](#the-code)
* [⇢ ⇢ In-Editor Code Completion](#in-editor-code-completion)
* [⇢ ⇢ ⇢ Installation of `lsp-ai`](#installation-of-lsp-ai)
* [⇢ ⇢ ⇢ Helix Configuration](#helix-configuration)
* [⇢ ⇢ ⇢ Code completion in action](#code-completion-in-action)
* [⇢ ⇢ Conclusion](#conclusion)

With all the AI buzz around coding assistants, and being a bit concerned about being dependent on third-party cloud providers here, I decided to explore the capabilities of local large language models (LLMs) using Ollama. 

Ollama is a powerful tool that brings local AI capabilities directly to your local hardware. By running AI models locally, you can enjoy the benefits of intelligent assistance without relying on cloud services. This document outlines my initial setup and experiences with Ollama, with a focus on coding tasks and agentic coding.

[https://ollama.com/](https://ollama.com/)  

## Why Local LLMs?

Using local AI models through Ollama offers several advantages:

* Data Privacy: Keep your code and data completely private by processing everything locally.
* Cost-Effective: Reduce reliance on expensive cloud API calls.
* Reliability: Works seamlessly even with spotty internet or offline.
* Speed: Avoid network latency and enjoy instant responses while coding. Although I mostly found Ollama slower than commercial LLM providers. However, that may change with the evolution of models and hardware.

## Hardware Considerations

Running large language models locally is currently limited by consumer hardware capabilities:

* GPU Memory: Most consumer-grade GPUs (even in 2025) top out at 16–24GB of VRAM, making it challenging to run larger models like the 30B (30 billion) parameter LLMs (they go up to the 100 billion and more).
* RAM Constraints: On my MacBook Pro with M3 CPU and 36GB RAM, I chose a 14B model (`qwen2.5-coder:14b-instruct`) as it represents a practical balance between capability and resource requirements.

For reference, here are some key points about running large LLMs locally:

* Models larger than 30B: I don't even think about running them locally. One (e.g. from Qwen, Deepseek or Kimi K2) with several hundred billion parameters could match the "performance" of commercial LLMs (Claude Sonnet 4, etc). Still, for personal use, the hardware demands are just too high (or temporarily "rent" it via the public cloud?).
* 30B models: Require at least 48GB of GPU VRAM for full inference without quantisation. Currently only feasible on high-end professional GPUs (or an Apple-silicone Mac with enough unified RAM).
* 14B models: Can run with 16-24GB GPU memory (VRAM), suitable for consumer-grade hardware (or use a quantised larger model)
* 7B-13B models: Best fit for mainstream consumer hardware, requiring minimal VRAM and running smoothly on mid-range GPUs, but with limited capabilities compared to larger models and more hallucinations.

The model I'll be mainly using in this blog post (`qwen2.5-coder:14b-instruct`) is particularly interesting as:

* `instruct`: Indicates this is the instruction-tuned variant, optimised for diverse tasks including coding
* `coder`: Tells me that this model was trained on a mix of code and text data, making it especially effective for programming assistance

[https://ollama.com/library/qwen2.5-coder](https://ollama.com/library/qwen2.5-coder)  
[https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct)  

For general thinking tasks, I found `deepseek-r1:14b` to be useful (in the future, I also want to try other `qwen` models here). For instance, I utilised `deepseek-r1:14b` to format this blog post and correct some English errors, demonstrating its effectiveness in natural language processing tasks. Additionally, it has proven invaluable for adding context and enhancing clarity in technical explanations, all while running locally on the MacBook Pro. Admittedly, it was a lot slower than "just using ChatGPT", but still within a minute or so. 

[https://ollama.com/library/deepseek-r1:14b](https://ollama.com/library/deepseek-r1:14b)  
[https://huggingface.co/deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)  

A quantised (as mentioned above) LLM which has been converted from high-precision connection (typically 16- or 32-bit floating point) representations to lower-precision formats, such as 8-bit integers. This reduces the overall memory footprint of the model, making it significantly smaller and enabling it to run more efficiently on hardware with limited resources or to allow higher throughput on GPUs and CPUs. The benefits of quantisation include reduced storage and faster inference times due to simpler computations and better memory bandwidth utilisation. However, quantisation can introduce a drop in model accuracy because the lower numerical precision means the model cannot represent parameter values as precisely. In some cases, it may lead to instability or unexpected outputs in specific tasks or edge cases.

## Basic Setup and Manual Code Prompting

### Installing Ollama and a Model

To install Ollama, performed these steps (this assumes that you have already installed Homebrew on your macOS system):

```sh
brew install ollama
rehash
ollama serve
```

Which started up the Ollama server with something like this (the screenshots shows already some requests made):

[![Ollama serving](./local-coding-LLM-with-ollama/ollama-serve.png "Ollama serving")](./local-coding-LLM-with-ollama/ollama-serve.png)  

And then, in a new terminal, I pulled the model with:

```sh
ollama pull qwen2.5-coder:14b-instruct
```

Now, I was ready to go! It wasn't so difficult. Now, let's see how I used this model for coding tasks.

### Example Usage

I run the following command to get a Go function for calculating Fibonacci numbers:

```sh
time echo "Write a function in golang to print out the Nth fibonacci number, \
  only the function without the boilerplate" | ollama run qwen2.5-coder:14b-instruct

Output:

func fibonacci(n int) int {
    if n <= 1 {
        return n
    }
    a, b := 0, 1
    for i := 2; i <= n; i++ {
        a, b = b, a+b
    }
    return b
}

Execution Metrics:

Executed in    4.90 secs      fish           external
   usr time   15.54 millis    0.31 millis   15.24 millis
   sys time   19.68 millis    1.02 millis   18.66 millis
```

> Note, after having written this blog post, I tried the same with the newer model `qwen3-coder:30b-a3b-q4_K_M` (which "just" came out, and it's a quantised 30B model), and it was much faster:

```
Executed in    1.83 secs      fish           external
   usr time   17.82 millis    4.40 millis   13.42 millis
   sys time   17.07 millis    1.57 millis   15.50 millis
```

[https://ollama.com/library/qwen3-coder:30b-a3b-q4_K_M](https://ollama.com/library/qwen3-coder:30b-a3b-q4_K_M)  

## Agentic Coding with Aider

### Installation

Aider is a tool that enables agentic coding by leveraging AI models (also local ones, as in our case). While setting up OpenAI Codex and OpenCode with Ollama proved challenging (those tools either didn't know how to work with the "tools" (the capability to execute external commands or to edit files for example) or didn't connect at all to Ollama for some reason), Aider worked smoothly.

To get started, the only thing I had to do was to install it via Homebrew, initialise a Git repository, and then start Aider with the Ollama model `ollama_chat/qwen2.5-coder:14b-instruct`:

```sh
brew install aider
mkdir -p ~/git/aitest && cd ~/git/aitest && git init
aider --model ollama_chat/qwen2.5-coder:14b-instruct
```

[https://aider.chat](https://aider.chat)  
[https://opencode.ai](https://opencode.ai)  
[https://github.com/openai/codex](https://github.com/openai/codex)  

### Agentic coding prompt

This is the prompt I gave:

```
Create a Go project with these files:

* `cmd/aitest/main.go`: CLI entry point
* `internal/version.go`: Version information (0.0.0), should be printed when the
   program was started with `-version` flag
* `internal/count.go`: File counting functionality, the program should print out
   the number of files in a given subdirectory (the directory is provided as a
   command line flag with `-dir`), if none flag is given, no counting should be
   done
* `README.md`: Installation and usage instructions
```

It then generated something, but did not work out of the box, as it had some issues with the imports and package names. So I had to do some follow-up prompts to fix those issues with something like this:

```
* Update import paths to match module name, github.com/yourname/aitest should be
  aitest in main.go
* The package names of internal/count.go and internal/version.go should be
  internal, and not count and version.
```

[![Aider fixing the packages](./local-coding-LLM-with-ollama/aider-fix-package.png "Aider fixing the packages")](./local-coding-LLM-with-ollama/aider-fix-package.png)  

### Compilation & Execution

Once done so, the project was ready and I could compile and run it:

```sh
go build cmd/aitest/main.go
./main -v
0.0.0
./main -dir .
Number of files in directory .: 4
```

### The code

The code it generated was simple, but functional. The `./cmd/aitest/main.go` file:

```go
package main

import (
	"flag"
	"fmt"
	"os"

	"aitest/internal"
)

func main() {
	var versionFlag bool
	flag.BoolVar(&versionFlag, "v", false, "print version")
	dir := flag.String("dir", "", "directory to count files in")
	flag.Parse()

	if versionFlag {
		fmt.Println(internal.GetVersion())
		return
	}

	if *dir != "" {
		fileCount, err := internal.CountFiles(*dir)
		if err != nil {
			fmt.Fprintf(os.Stderr, "Error counting files: %v\n", err)
			os.Exit(1)
		}
		fmt.Printf("Number of files in directory %s: %d\n", *dir, fileCount)
	} else {
		fmt.Println("No directory specified. No count given.")
	}
}
```

The `./internal/version.go` file:

```go
package internal

var Version = "0.0.0"

func GetVersion() string {
	return Version
}
```

The `./internal/count.go` file:

```go
package internal

import (
	"os"
)

func CountFiles(dir string) (int, error) {
	files, err := os.ReadDir(dir)
	if err != nil {
		return 0, err
	}

	count := 0
	for _, file := range files {
		if !file.IsDir() {
			count++
		}
	}

	return count, nil
}
```

The code is quite straightforward, especially for generating boilerplate code this will be useful for many use cases!

## In-Editor Code Completion

To leverage Ollama for real-time code completion in my editor, I have integrated it with Helix, my preferred text editor. Helix supports the LSP (Language Server Protocol), which enables advanced code completion features. The `lsp-ai` is an LSP server that can interface with Ollama models for code completion tasks.

[https://helix-editor.com](https://helix-editor.com)  
[https://github.com/SilasMarvin/lsp-ai](https://github.com/SilasMarvin/lsp-ai)  

### Installation of `lsp-ai`

I installed `lsp-ai` via Rust's Cargo package manager. (If you don't have Rust installed, you can install it via Homebrew as well.):

```sh
cargo install lsp-ai
```

### Helix Configuration

I edited `~/.config/helix/languages.toml` to include:

```
[[language]]
name = "go"
auto-format= true
diagnostic-severity = "hint"
formatter = { command = "goimports" }
language-servers = [ "gopls", "golangci-lint-lsp", "lsp-ai", "gpt" ]
```

Note that there is also a `gpt` language server configured, which is for GitHub Copilot, but it is out of scope of this blog post. Let's also configure `lsp-ai` settings in the same file:

```
[language-server.lsp-ai]
command = "lsp-ai"

[language-server.lsp-ai.config.memory]
file_store = { }

[language-server.lsp-ai.config.models.model1]
type = "ollama"
model =  "qwen2.5-coder"

[language-server.lsp-ai.config.models.model2]
type = "ollama"
model = "mistral-nemo:latest"

[language-server.lsp-ai.config.models.model3]
type = "ollama"
model = "deepseek-r1:14b"

[language-server.lsp-ai.config.completion]
model = "model1"

[language-server.lsp-ai.config.completion.parameters]
max_tokens = 64
max_context = 8096

## Configure the messages per your needs
[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "system"
content = "Instructions:\n- You are an AI programming assistant.\n- Given a
piece of code with the cursor location marked by \"<CURSOR>\", replace
\"<CURSOR>\" with the correct code or comment.\n- First, think step-by-step.\n
- Describe your plan for what to build in pseudocode, written out in great
detail.\n- Then output the code replacing the \"<CURSOR>\"\n- Ensure that your
completion fits within the language context of the provided code snippet (e.g.,
Go, Ruby, Bash, Java, Puppet DSL).\n\nRules:\n- Only respond with code or
comments.\n- Only replace \"<CURSOR>\"; do not include any previously written
code.\n- Never include \"<CURSOR>\" in your response\n- If the cursor is within
a comment, complete the comment meaningfully.\n- Handle ambiguous cases by
providing the most contextually appropriate completion.\n- Be consistent with
your responses."

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "user"
content = "func greet(name) {\n    print(f\"Hello, {<CURSOR>}\")\n}"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "assistant"
content = "name"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "user"
content = "func sum(a, b) {\n    return a + <CURSOR>\n}"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "assistant"
content = "b"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "user"
content = "func multiply(a, b int ) int {\n    a * <CURSOR>\n}"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "assistant"
content = "b"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "user"
content = "// <CURSOR>\nfunc add(a, b) {\n    return a + b\n}"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "assistant"
content = "Adds two numbers"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "user"
content = "// This function checks if a number is even\n<CURSOR>"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "assistant"
content = "func is_even(n) {\n    return n % 2 == 0\n}"

[[language-server.lsp-ai.config.completion.parameters.messages]]
role = "user"
content = "{CODE}"
```

As you can see, I have also added other models, such as Mistral Nemo and DeepSeek R1, so that I can switch between them in Helix. Other than that, the completion parameters are interesting. They define how the LLM should interact with the text in the text editor based on the given examples.

If you want to see more `lsp-ai` configuration examples, they are some for Vim and Helix in the `lsp-ai` git repository!

### Code completion in action

The screenshot shows how Ollama's `qwen2.5-coder` model provides code completion suggestions within the Helix editor. LSP auto-completion is triggered by leaving the cursor at position `<CURSOR>` for a short period in the code snippet, and Ollama responds with relevant completions based on the context.

[![Completing the fib-function](./local-coding-LLM-with-ollama/helix-lsp-ai.png "Completing the fib-function")](./local-coding-LLM-with-ollama/helix-lsp-ai.png)  

In the LSP auto-completion, the one prefixed with `ai - ` was generated by `qwen2.5-coder`, the other ones are from other LSP servers (GitHub Copilot, Go linter, Go language server, etc.).

I found GitHub Copilot to be still faster than `qwen2.5-coder:14b`, but the local LLM one is actually workable for me already. And, as mentioned earlier, things will likely improve in the future regarding local LLMs. So I am excited about the future of local LLMs and coding tools like Ollama and Helix.

> After trying `qwen3-coder:30b-a3b-q4_K_M` (following the publication of this blog post), I found it to be significantly faster and more capable than the previous model, making it a promising option for local coding tasks. Experimentation reveals that even current local setups are surprisingly effective for routine coding tasks, offering a glimpse into the future of on-machine AI assistance.

## Conclusion

Will there ever be a time we can run larger models (60B, 100B, ...and larger) on consumer hardware, or even on our phones? We are not quite there yet, but I am optimistic that we will see improvements in the next few years. As hardware capabilities improve and/or become cheaper, and more efficient models are developed (or new techniques will be invented to make language models more effective), the landscape of local AI coding assistants will continue to evolve. 

For now, even the models listed in this blog post are very promising already, and they run on consumer-grade hardware (at least in the realm of the initial tests I've performed... the ones in this blog post are overly simplistic, though! But they were good for getting started with Ollama and initial demonstration)! I will continue experimenting with Ollama and other local LLMs to see how they can enhance my coding experience. I may cancel my Copilot subscription, which I currently use only for in-editor auto-completion, at some point.

However, truth be told, I don't think the setup described in this blog post currently matches the performance of commercial models like Claude Code (Sonnet 4, Opus 4), Gemini 2.5 Pro, the OpenAI models and others. Maybe we could get close if we had the high-end hardware needed to run the largest Qwen Coder model available. But, as mentioned already, that is out of reach for occasional coders like me. Furthermore, I want to continue coding manually to some degree, as otherwise I will start to forget how to write for-loops, which would be awkward... However, do we always need the best model when AI can help generate boilerplate or repetitive tasks even with smaller models?

E-Mail your comments to `paul@nospam.buetow.org` :-)

Other related posts are:

[2025-08-05 Local LLM for Coding with Ollama on macOS (You are currently reading this)](./2025-08-05-local-coding-llm-with-ollama.md)  
[2025-06-22 Task Samurai: An agentic coding learning experiment](./2025-06-22-task-samurai.md)  

[Back to the main site](../)