Open-source LLMs put to the test: How well suited are they for development support?

Julian Lang

June 30th, 2025

Tags AI Development Development KI Web Engineering Web Engineering

1. Intro

In this blog article, we address the question of whether open source models provided locally within companies are mature enough to compete with the “big” commercial models such as Claude or Chat-GPT. The hype surrounding AI – regardless of the industry – could hardly be greater. Everywhere, automation and efficiency gains are promised, because machines are supposed to achieve more in the same amount of time than dozens of people ever could. We are investigating the extent to which this also applies to free open source models. To do this, we took a realistic Angular example project and set the LLMs Code Llama 3.1, Deep Seek Coder 2, Phind Code Llama, and Wizard Coder a few tasks…

The results are then compared with Claude Sonnet 4’s answers to provide a solid reference value. Who wins? And by how much? Read on to find out.

The results of this test include unavoidable subjective assessments and therefore inevitably reflect our perspective. Some readers will certainly have had different experiences and therefore not share our opinion on the test results. And that’s perfectly fine!

1.1 Motivation

Large language models (LLMs) are no longer just a hype—they have established themselves as practical aids in software development. They can offer valuable support, especially in web development, where frameworks, libraries, and best practices are constantly evolving. But how good are local open-source models with up to 60B really? And is it worth using them in a professional environment?

That’s exactly what we want to find out in our test. Because if open-source LLMs deliver what they promise, they could not only optimize our workflows, but also represent a more independent and, above all, more privacy-friendly alternative to proprietary models.

1.2 Testeted LLMs

In our test, we selected the best-known and most successful open source models. These include models developed and trained by large corporations such as Meta and DeepSeek AI, as well as those that are highly ranked on the Big Code Models Leaderboard.

Name	Number of parameters	Open-Source	Description
Code Llama 3.1	8B	?	Meta's powerful model for code generation and understanding.
Deep Seek Coder 2	15.7B	?	Specialized in code analysis and generation with extended context.
Phind Code Llama	3.8B	?	Optimized for accurate answers and fast code assistance.
Wizard Coder	33B	?	Finely tuned for complex programming tasks and refactorings.
Claude Sonnet 4	not public	?	Anthropic's proprietary model with strong code comprehension capabilities.

2. Test setup

The test takes place in a realistic Angular 18 project, which serves as a realistic basis for our evaluation. We use VS Code as our IDE with the Continue extension, which allows us to integrate locally hosted models. The open-source LLMs are deployed on our infrastructure using Ollama and integrated into Continue to provide convenient access to the models from within the IDE.

Hardware Limitations

By hosting on our own infrastructure, we are naturally subject to hardware limitations that restrict the maximum number of parameters in the models, for example. The largest models that still run smoothly for us have 60 billion parameters. With larger models, the test would certainly turn out differently here and there.

2.1 Procedure

All models receive the same eight tasks with identical prompts and context data. The tests cover various aspects of web development:

Explain app and routing structure (code comprehension): The LLM should explain the routing structure and shed light on the app’s architecture.
Analyze a service’s API (code analysis): Here, the responsibility of a service and how it is used in the app should be explained.
Add a simple feature (code modification): The LLMs should add the information “word count” and “reading time” to an article preview.
Fix a bug (critical thinking): A bug with very subtle effects was built in here. The LLMs should find and fix it.
Refactoring initialization observables (refactoring): Here, the models were presented with several component initialization logics based on RxJS that share a common pattern. This pattern should be found and cleanly extracted.
Accessibility check (A11y knowledge): This task tests the recognition of accessibility issues in components.
Writing unit tests (test automation): The LLMs are presented with a component for which they are to write unit tests.
Anti-pattern recognition (framework-specific knowledge): This task tests knowledge of best practices, whether Angular-specific or general to software development.

2.2 Evaluation

The answers are evaluated according to the following criteria:

Accuracy: Are the statements correct or misleading?
Completeness: Are any essential aspects missing?
Comprehensibility: Is the answer clear and structured?
Relevance: Does the solution address the core of the question?

Each aspect is given a score of 1–10, from which the average is calculated as the overall rating.

Results …

For the sake of brevity, we have not included the exact tasks and their answers from the LLMs here; the article only shows the grading and evaluation of the results.

Classification of the evaluation

Although we have made every effort to keep the ratings as objective as possible, a certain degree of bias cannot be ruled out. Not everyone will agree with all the scores awarded, and that is perfectly legitimate. Work environments and requirements often vary greatly, so a “one-size-fits-all” rating is not possible.

Task	Claude Sonnet 4	Code Llama	DeepSeek Coder 2	Phind Code Llama	Wizard Coder
1 – Understanding the code	9,75	7,75	8,75	8,25	7,25
2 – Code Analyse	10,0	8,25	9,25	7,0	9,0
3 – Feature-expansion	10,0	6,25	7,75	4,0	8,75
4 – Bug-Fixing	10,0	5,25	5,75	5,0	3,5
5 – Refactoring of Observables	5,5	4,25	1,0	4,75	2,5
6 – Accessibility-Check	9,75	5,0	5,25	6,25	5,25
7 – Writing unit-tests	8,75	7,0	4,25	8,75	3,0
8 – Angular Best Practices	10,0	1,25	8,25	2,0	2,75

Ø / 10	9,22 / 10	5,63 / 10	6,28 / 10	5,75 / 10	5,25 / 10

Explanations of the extremes

In order to better classify the highest and lowest scores, we have briefly described the extremes here:

Model	Best Task	Worst Task
Code Llama	8.25 points for code analysis	1.25 points for Angular best practices
	- Recognized methods and HTTP endpoints are correct - Clear functional description of the service - However: filter parameters are unclear, error handling is missing	- Almost all suggestions were incorrect or irrelevant - Did not understand takeUntilDestroyed - Only recognized single responsibility principle (SRP) violations - However, since Angular is used less frequently than other front-end frameworks such as React, we consider this result to be understandable, as the LLM training data will contain correspondingly little knowledge of Angular.
Deep Seek Coder 2	9.25 points for code analysis	1.0 points for refactoring an observable
	- Recognized methods and HTTP endpoints are correct - Clear functional description of the service - However: Missing parameters/return types, outdated DI recommendation	- The task could not be completed - After repeated attempts, the output always stops at the same point
Phind Llama	Write 8.75 points in unit tests	2.0 points for Angular best practices
	- Tests compiled directly - Best practices such as dependency doubles and maintaining smoke tests were followed - Template is also tested - However: Incorrectly added a standalone component to the declarations array.	- Outdated unsubscribe patterns recommended - Unnecessary component extractions suggested - No helpful architecture recommendations - However: Since Angular is used less frequently than other frontend frameworks such as React, we consider this result understandable, as the LLM training data will contain correspondingly little Angular knowledge.
Wizard Coder	9.0 points for code analysis	2.5 points for refactoring an observable
	- Good, concise, and precise listing of the class API - Uses technical terms and concepts - Shows that it is a REST API - Addresses the topic of error handling	- Code that does not compile is suggested (ignores encapsulation) - Generated code does not match the original in terms of content - Introduces a bug because the wrong service is called - Poor naming of functions
Claude	10.0 points for code analysis, feature enhancement, and bug fixing	5.5 points for refactoring an observable
	- Good and concise description of APIs and their benefits - Proposed code is simple, well named, and compiles - Cause and scenario of error are explained coherently and bug is found	- Code compiles, but the abstraction created offers no added value - A code snippet was reproduced semantically 1:1

… And evaluation

The proprietary Claude model clearly dominates with an average score of 9.2/10, outperforming open-source models (average score of 5.6–6.3) in almost all tasks, especially complex debugging, refactoring, and framework knowledge. Here is a summary of the results table:

Model	Strengths	Weaknesses
Code Llama	High scores in routine tasks (Task 1, Task 2)	Weaknesses in complex implementation with semantic details and outdated knowledge (Task 5, Task 8)
Deep Seek Coder	Robust in descriptive tasks (Task 1, Task 2)	"Fails at architecture/pattern tasks (Task 5, Task 8). Refactoring task could not be completed. Accessibility rather superficial"
Phind	Very good in structured language and tests (Task 1, Task 7)	Weak in Angular details, refactoring, anti-pattern recognition "Complete failure in anti-pattern recognition (Task 5, Task 8).
Wizard Coder	Very strong in explicit tasks with a clear structure (Task 2, Task 3)	"Complete failures in anti-pattern detection (Task 5, Task 8). Weaknesses in identifying the exact cause of errors and testable code.
Claude	Convincing in almost all disciplines, has good and up-to-date knowledge of Angular and best practices	Weak when it comes to semantic details that need to be refactored in a meaningful way.

Strengths of all LLMs:

Code understanding:
Summarizing obvious app architectures
Summarizing class APIs
Use of classes and components

Mixed performance

Unit test generation
From “too complicated to get up and running” to “compiled out-of-the-box”
From “relevant things omitted” to “too many details tested”
Feature extension
From “contains absolute beginner-level errors” to “flawlessly implemented”

Weaknesses of open-source LLMs:

Finding subtle bugs
The bug is not limited to one or a few specific locations, but rather many possibilities are “suggested.”
A kind of checklist is provided that does not necessarily have anything to do with the error.
The explanations for the bug sometimes do not make much sense in the actual error context.
Poor critical thinking (bug fixing, best practices)
Superficial knowledge of accessibility and Angular

Weaknesses of all LLMs:

RxJS / Observables refactoring
Meaningless abstractions are created very frequently
This also results in poor and incomprehensible naming
In most cases, the code no longer compiles after refactoring
Code becomes more complicated, not simpler

Conclusion

Open-source LLMs are sufficient for simple code generation, but proprietary models are (still) superior for professional development. While experienced developers can gain valuable insights from the models, beginners run the risk of adopting flawed concepts due to incorrect or outdated suggestions from the LLMs, thereby acquiring unfavorable habits. The models have very different strengths and weaknesses, which leads to a wide range of applications but also limitations.

At the end of the day, it’s the results that count. Those who use AI models effectively can turn them into valuable tools for completing a wide variety of tasks quickly and conveniently. For our part, we remain excited to see what future models will bring us and are happy to continue experimenting with them.

Blog