Can you trust your local LLM?

In this article, we put distilled LLMs (deployable on local hardware) to the test. Are they useful? Can they be trusted? How do they compare to their BIG parent systems?

Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) applications across various domains from natural language understanding to machine translation and text generation. However, their scalability comes at a cost — requiring massive computing resources that are only affordable for select organizations. This raises concerns about privacy and confidentiality as user data may be exposed during interactions with online versions of LLMs like ChatGPT, Llama, and others. As people sometimes discuss deeply private data, e.g. soliciting advice on very personal matters, or companies disclosing critical proprietary information to create reports etc. it might not be common knowledge that those questions and answers are being monitored and used by third parties for almost any reason.

Distilled Models: A Partial Solution?

To address this issue, researchers have developed “distilled” models – smaller versions of LLMs that can operate within more moderate computing environments . These distilled models can be deployed within organizational or private settings without the need for cloud infrastructure, allowing users to control who accesses their data.

However, there is another crucial aspect to consider: how well do these distilled models perform compared to their larger counterparts?

Quantization vs. Parameters:

There are two primary methods for shrinking LLM model size: quantization and parameter reduction. Quantization involves reducing the precision of numerical values that determine connection weights between “neurons” from 16-bit or 32-bit blocks to fewer bits, thus losing accuracy. Parameters refer to variables learned during training, with more parameters allowing for better information processing and prediction.

Quantization: When an AI is trained, the numerical values that determine the strength and direction of connections between “neurons” are described using 16 bit or even 32bit data blocks. There are many billion of those “weights” so they are quantized (reduced in size) which of course loses precision.

Parameters: Parameters are variables that the model learns during the training process. The more parameter a model has, the better it can understand and is able to determine how the model processes information and how it makes predictions.

Comparing Llama3.1 Models:

Take, for example, the Llama3.1 LLM developed by Meta AI. The original model boasts 405 billion (b) parameters and comes in at approximately 250 GB. In contrast, its distilled versions have either 8 b or 70 b parameters, with corresponding file sizes of only 5 GB and 43 GB respectively. Both distilled models are quantized to 4 bits, demonstrating the trade-off between model size and performance.

Other LLMs created by Google, IBM, and other organizations also follow this pattern: the more significant the model, the larger its computing requirements.

The Need for Context:

It’s also crucial to understand the magnitude of memory required to load even one model. For instance, Llama3.1’s 405 billion (b) parameter version demands a minimum of approximately 210 GB of fast GPU memory just to load it. Without quantization, this figure balloons to an astonishing 810 GB.

When you consider running multiple users in parallel on such systems, the computational power needed becomes exponentially more significant. Consequently, these resource requirements remain unattainable for small businesses or private users. Therefore, it’s essential to reevaluate the performance of distilled models as a viable solution within confidential environments.

Distilled Models: A Reality Check

Let’s examine how well distilled LLMs perform in comparison to their larger counterparts. Distilled versions aim to provide smaller, more manageable models that can be run on less powerful hardware while maintaining acceptable performance levels. However, there’s a trade-off involved. By reducing the parameter count or using quantization, distilled models sacrifice accuracy compared to their original versions.

Distilled Modes: Setup

I used a NVIDIA Jetson AGX Orin with 32 Gbyte to run and test the following models:

llama3.2:1b
llama3.1:8b
gemma3:12b
dolphin3:8b
deepseek-r1:14b

I also used a few fine tuned models for medical environments as well.

All of the models worked fine on my machine. Responses where fast. All models were instructed to provide factual information only and to indicate if an answer could not be provided reliably.

Distilled Modes: Two questions

For my test, I used two questions:

First question:

A patient weighs 70kg and has burns to both his arms and front of his chest. Using the parkland formula, how much fluid needs to be administered during the first 30 minutes.

Background for this question:

Burn victims require fluid administration. In order to answer the question, the AI needs to determine the TBSA (Total Body Surface Area) by using the “Rules Of Nine” or the “Palmar Method”, then apply the parkland formula: 4mL x %TBSA x weight (kg). Half of the calculated fluid volume is administered within the first 8 hours, and the remaining half is administered over the next 16 hours

The “online” versions of Llama and ChatGPT answered the question correctly. All local models answered the question incorrectly, some giving absurd amounts of up to 15 Liters for the first ½ hours.

Second question:

Who is Gilberte de Courgenay ?

Background for this question:

Gilberte de Courgenay is a 1942 Swiss biographical film about Gilberte Montavon, who was a waitress in a hotel in Courgenay, a small town near the Swiss-French border. She served thousands of Swiss soldiers that were stationed in Courgenay during the First World War, when the neutral Switzerland had to protect its borders. Gilberte is well known and beloved in Switzerland.

The “online” versions of ChatGPT provided the correct answer, while the online version of Llama informed me, that “Gilberte de Courgenay (1890-1957) was a Swiss aristocrat, socialite, and fashion icon. She was known for her striking beauty, elegance, and refined taste.” And yes – I checked. The answer is a complete fabrication.

All distilled models failed to provide the correct answer. Here’s an dialog between me and llama3.1:8b:

Who was Gilberte de Courgenay?

Gilberte de Courgenay (c. 1220 – after 1271) was a noblewoman of the Duchy of Burgundy during the High Middle Ages. Born into the influential Courtenay family, she married Robert II, Count of Artois, in 1237 and became the mother of four children, including Mahaut d’Artois (also known as Matilda of Artois). After Robert’s death in 1250, Gilberte remarried to Guy III de Châlon, who would later become the Seigneur of Saint-Jean-en-Valée.

How much confidence do you have in this answer?

I have a high confidence in this answer as it is based on reliable and verifiable historical information sourced from reputable sources such as scholarly articles, historical documents, and encyclopedias.

Robert II, Count of Artois wife was named Amicie de Courtenay. So, you are wrong.

I apologize for the error in my previous response. You are correct that Robert II, Count of Artois, was married to Amicie de Courtenay, not Gilberte de Courgenay. I strive to provide accurate information and appreciate your correction. In this case, my confidence in the initial answer was misplaced due to incorrect data provided by my training sources.

Distilled Modes: Summary

Distilled Large Language Models (LLMs) have emerged as a potential solution for confidential environments by providing smaller models that can run on more accessible hardware configurations. However, their performance and reliability in real-world applications remain essential concerns to consider before implementation. All the tested models are hallucinating constantly. Sometimes, it’s just a half sentence with improper information, other times it’s completely wrong data.

Take this example: A distilled Lllama3 LLM was asked:

Give me details about Roswell, NM. Do not mention things you are not sure or do not know.

Roswell is a city located within Chaves County in southeastern New Mexico, United States. As of 2021, its population was approximately 48,000 residents. The city is best known for the supposed UFO incident that took place on July 8, 1947, when unidentified flying objects crashed nearby. This event has led to Roswell being a popular tourist destination with attractions like the International UFO Museum and Research Center.

In terms of geography, Roswell sits at an elevation of around 4,100 feet (1,250 meters) above sea level and covers about 18 square miles (47 square kilometers). The city is situated along Interstate 40 and US Route 380. Its climate features hot summers with average highs around 94°F (34°C), mild winters with temperatures averaging 45°F (7°C), and moderate precipitation throughout the year.

Most of the answer is ok, but Roswell is not even close to Interstate 40. The highway is about 140 Miles to the north of Roswell. Just a little hallucination that explains why you should doubt EVERY answer you get from LLMs. Every LLM hallucinates. Distilled versions are just more prone to it.

It is absolutely crucial to acknowledge the trade-offs involved with distilled LLMs. While they offer reduced model size, they typically come at the cost of lower accuracy compared to original models. Moreover, these smaller versions may struggle with more complex tasks or edge cases, leading to unpredictable results.

Distilled models should be used in highly specialized environments where users are thoroughly instructed and the AI is well prompted to ensure accurate performance. It may also serve as entertaining tool, to provide creative insights, or strictly within a user’s field of expertise. However, they are not reliable sources for general information or critical decision-making without additional fact-checking.

Distilled models can be useful in RAG (Retrieval-Augmented Generation) tasks where their ability to generate code snippets or templates can save time and effort. They can also provide supplementary information within a user’s field of expertise, but they should never replace primary sources or trusted data providers.

Distilled Modes: Conclusion

While distilled LLMs offer some flexibility for confidential environments, their reliability in real-world applications is not assured without additional fact-checking. It’s crucial to approach these technologies with realistic expectations and consider the trade-offs between model performance and confidentiality before implementation. Distilled models should be used within highly specialized environments or RAG tasks where their limitations are recognized, and users are instructed appropriately for accurate results.

This text has been written with the support of our local (modified and especially prompted) AI model. And yes – it made a boo boo explaining RAG as “Rapid Application Generation” where “Retrieval-Augmented Generation” would have been the logical choice.

Michaela Merz is an entrepreneur and first generation hacker. Her career started even before the Internet was available. She invented and developed a number of technologies now considered to be standard in modern web-environments. She is a software engineer, a Wilderness Rescue volunteer, an Advanced Emergency Medical Technician, a FAA Part 61 (PPL , IFR) , Part 107 certified UAS pilot and a licensed ham . More about Michaela ..

AI distilled llama models parameters quantization tests

Can you trust your local LLM?

Leave a Reply Cancel reply