GGUF, the good distance round


Desk of Contents
Massive language fashions at the moment are consumed in one of several ways:
- As API endpoints for proprietary fashions hosted by OpenAI, Anthropic, or main cloud suppliers
- As mannequin artifacts downloaded from HuggingFace’s Mannequin Hub and/or skilled/fine-tuned utilizing HuggingFace libraries and hosted on native storage
- As mannequin artifacts out there in a format optimized for native inference, usually GGUF, and accessed through functions like
llama.cpp
orollama
- As ONNX, a format which optimizes sharing between backend ML frameworks
For a aspect mission, I’m utilizing llama.cpp
, a C/C++
-based LLM inference engine focusing on M-series GPUs on Apple Silicon.
When working llama.cpp
, you get an extended log that consists primarily of key-value pairs of metadata about your mannequin structure after which its efficiency (and no yapping).
make -j && ./principal -m /Customers/vicki/llama.cpp/fashions/mistral-7b-instruct-v0.2.Q8_0.gguf -p "What's Sanremo? no yapping"
Sanremo Music Pageant (Pageant di Sanremo) is an annual Italian music competitors held within the metropolis of Sanremo since 1951. It's thought-about probably the most prestigious and influential occasions within the Italian music scene. The pageant options each newcomers and established artists competing for varied awards, together with the Huge Award (Gran Premio), which grants the winner the fitting to symbolize Italy within the Eurovision Music Contest. The occasion consists of a number of dwell exhibits the place artists carry out their unique songs, and a jury composed of musicians, critics, and the general public determines the winners by means of a mix of factors. [end of text]
llama_print_timings: load time = 11059.32 ms
llama_print_timings: pattern time = 11.62 ms / 140 runs ( 0.08 ms per token, 12043.01 tokens per second)
llama_print_timings: immediate eval time = 87.81 ms / 10 tokens ( 8.78 ms per token, 113.88 tokens per second)
llama_print_timings: eval time = 3605.10 ms / 139 runs ( 25.94 ms per token, 38.56 tokens per second)
llama_print_timings: whole time = 3730.78 ms / 149 tokens
ggml_metal_free: deallocating
Log finish
These logs may be discovered within the Llama.cpp
codebase. There, you’ll additionally discover GGUF. GGUF (GPT-Generated Unified Format) is the file format used to serve fashions on Llama.cpp
and different native runners like Llamafile, Ollama and GPT4All.
To grasp how GGUF works, we have to first take a deep dive into machine studying fashions and the sorts of artifacts they produce.
Let’s begin by describing a machine studying mannequin. At its easiest, a mannequin is a file or a group of information that comprise the mannequin structure and weights and biases of the mannequin generated from a coaching loop.
In LLM land, we’re typically taken with transformer-style models and architectures.
In a transformer, we’ve got many shifting components.
If the mannequin is served as a client end-product, it solely returns the precise textual content output based mostly on the best chances, with quite a few methods for how that text is selected.

Briefly, we convert inputs to outputs utilizing an equation. Along with the mannequin’s output, we even have the mannequin itself that’s generated as an artifact of the modeling course of.

Beginning with a easy mannequin
Let’s take a step again from the complexity of transformers and construct a small linear regression mannequin in PyTorch. Fortunate for us, linear regression is also a (shallow) neural community, so we are able to work with it in PyTorch and map our easy mannequin to extra complicated ones utilizing the identical framework.
Linear regression takes a set of numerical inputs and generates a set of numerical outputs. (In distinction to transformers, which take a set of textual content inputs and generates a set of textual content inputs and their associated numerical chances.)
For instance, let’s say that we produce artisinal hazlenut spread for statisticians, and wish to predict what number of jars of Nulltella we’ll produce on any given day. Let’s say we’ve got some knowledge out there to us, and that’s, what number of hours of sunshine we’ve got per day, and what number of jars of Nulltella we’ve been in a position to produce on daily basis.

It seems that we really feel extra impressed to supply hazlenut unfold when it’s sunny out, and we are able to clearly see this relationship between enter and output in our knowledge (we don’t produce Nulltella Friday-Sunday as a result of we desire to spend these days writing about knowledge serialization codecs):
| day_id | hours | jars |
|--------|---------|------|
| mon | 1 | 2 |
| tues | 2 | 4 |
| wed | 3 | 6 |
| thu | 4 | 8 |
That is the info we’ll use to coach our mannequin. We’ll want to separate this knowledge into three components:
- used to coach our mannequin (coaching knowledge)
- used to check the accuracy of our mannequin (check knowledge)
- used to tune our hyperparameters, meta-aspects of our mannequin just like the learning rate, (validation set) through the mannequin coaching part.
Within the particular case of linear regression, there technically are not any hyperparameters, though we are able to plausibly take into account the training charge we set in PyTorch to be one. Let’s assume we’ve got 100 of those knowledge factors values.
We cut up the info into prepare, check, and validation. A usual accepted split is to make use of 80% of information for coaching/validation and 20% for testing. We would like our mannequin to have entry to as a lot knowledge as potential so it learns a extra correct illustration, so we go away most knowledge for prepare.
Now that we’ve got our knowledge, we have to write our algorithm. The equation to get output (Y) from inputs (X) for linear regression is:
$$y = beta_0 + beta_1 x_1 + varepsilon $$
This tells us that the output, (y) (the variety of jars of Nulltella), may be predicted by:
- (x_1) – one enter variable (or function), (hours of sunshine)
- (beta_1) – with its given weight, additionally referred to as parameters, (how necessary that function is)
- plus an error time period (varepsilon) that’s the distinction between the noticed and precise values in a inhabitants that captures the noise of the mannequin
Our process is to constantly predict and regulate our weights to optimally remedy this equation for the distinction between our precise (Y) as introduced by our knowledge and a predicted (hat Y) based mostly on the algorithm to search out the smallest sum of squared variations, (sqrt{frac{1}{n} sum_{i=1}^{n} (y_i – hat{y}_i)^2}), between every level and the road. In different phrases, we’d like to reduce (varepsilon), as a result of it can imply that, at every level, our (hat Y) is as near our precise (Y) as we are able to get it, given the opposite factors.
We optimize this operate through gradient descent, the place we begin with both zeros or randomly-initialized weights and proceed recalculating each the weights and error time period till we come to an optimum stopping level.
We’ll know we’re succeeding as a result of our loss, as calculated by RMSE ought to incrementally lower in each coaching iteration.
Right here’s the entire mannequin studying course of end-to-end (aside from tokenization, which we solely do for fashions the place options are textual content and we wish to do language modeling):

Now, let’s get extra concrete and describe these concepts in code. Once we prepare our mannequin, we initialize our operate with a set of function values.
Let’s add our knowledge into the mannequin by initializing each (x_1) and (Y) as PyTorch Tensor objects.
# Hours of sunshine
X = torch.tensor([[1.0], [2.0], [3.0], [4.0]], dtype=torch.float32)
# Jars of Nulltella
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]], dtype=torch.float32)
Inside code, our enter knowledge is X
, which is a torch tensor object, and our output knowledge is y
. We initialize a LinearRegression which subclasses the PyTorch Module, with one linear layer, which has one enter function (sunshine) and one output function (jars of Nulltella).
I’m going to incorporate the code for the entire mannequin, after which we’ll speak by means of it piece by piece.
import torch
import torch.nn as nn
import torch.optim as optim
X = torch.tensor([[1.0], [2.0], [3.0], [4.0]], dtype=torch.float32)
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]], dtype=torch.float32)
# Outline a linear regression mannequin and its ahead cross
class LinearRegression(nn.Module):
def __init__(self):
tremendous(LinearRegression, self).__init__()
self.linear = nn.Linear(1, 1) # 1 enter function, 1 output function
def ahead(self, x):
return self.linear(x)
# Instantiate the mannequin
mannequin = LinearRegression()
# Examine the mannequin's state dictionary
print(mannequin.state_dict())
# Outline loss operate and optimizer
criterion = nn.MSELoss()
# setting our studying charge "hyperparameter" right here
optimizer = optim.SGD(mannequin.parameters(), lr=0.01)
# Coaching loop that features ahead and backward cross
num_epochs = 100
for epoch in vary(num_epochs):
# Ahead cross
outputs = mannequin(X)
loss = criterion(outputs, y)
RMSE_loss = torch.sqrt(loss)
# Backward cross and optimization
optimizer.zero_grad() # Zero out gradients
RMSE_loss.backward() # Compute gradients
optimizer.step() # Replace weights
# Print progress
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.merchandise():.4f}')
# After coaching, let's check the mannequin
test_input = torch.tensor([[5.0]], dtype=torch.float32)
predicted_output = mannequin(test_input)
print(f'Prediction for enter {test_input.merchandise()}: {predicted_output.merchandise()}')
As soon as we’ve got our enter knowledge, we then initialize our mannequin, a LinearRegression
which subclasses Module base class particularly for linear regression.
A ahead cross includes feeding our knowledge into the neural community and ensuring it propogagtes by means of all of the layers. Since we solely have one, we’ve got to cross our knowledge to a single linear layer. The ahead cross is what calculates our predicted Y
.
class LinearRegression(nn.Module):
def __init__(self):
tremendous(LinearRegression, self).__init__()
self.linear = nn.Linear(1, 1) # 1 enter function, 1 output function
def ahead(self, x):
return self.linear(x)
We decide how we’d prefer to optimize the outcomes of the mannequin, aka how its loss ought to converge. On this case, we begin with imply squared error
, after which modify it to make use of RMSE
, the sq. root of the typical squared distinction between the anticipated values and the precise values in a dataset.
# Outline loss operate and optimizer
criterion = torch.sqrl(nn.MSELoss()) # RMSE within the coaching loop
optimizer = optim.SGD(mannequin.parameters(), lr=0.01)
....
for epoch in vary(num_epochs):
# Ahead cross
outputs = mannequin(X)
loss = criterion(outputs, y)
RMSE_loss = torch.sqrt(loss)
Now that we’ve outlined how we’d just like the mannequin to run, we are able to instantiate the mannequin object itself:
Instantiating the mannequin object
mannequin = LinearRegression()
print(mannequin.state_dict())
Discover that once we instantiate a nn.Module
, it has an attribute referred to as the “state_dict”. That is necessary. The state dict holds the details about every layer and the parameters in every layer, aka the weights and biases.
At its coronary heart, it’s a Python dictionary.
On this case, the implementation for LinearRegression returns an ordered dict with every layer of the community and values of these layers. Every of the values is a Tensor
.
OrderedDict([('linear.weight', tensor([[0.5408]])), ('linear.bias', tensor([-0.8195]))])
for param_tensor in mannequin.state_dict():
print(param_tensor, "t", mannequin.state_dict()[param_tensor].measurement())
linear.weight torch.Dimension([1, 1])
linear.bias torch.Dimension([1])
For our tiny mannequin, it’s a small OrderedDict
of tuples. You possibly can think about that this assortment of tensors turns into extraordinarily giant and memory-intensive in a big community similar to a transformer. If every parameter (every Tensor object) takes up 2 bytes in reminiscence, a 7-billion parameter model can take up 14GB in GPU.
We then run the ahead and backward passes for the mannequin in loops. In every step, we do a ahead cross to carry out the calculation, a backward cross to replace the weights of our mannequin object, after which we add all that data to our mannequin parameters.
# Outline loss operate and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(mannequin.parameters(), lr=0.01)
# Coaching loop
num_epochs = 100
for epoch in vary(num_epochs):
# Ahead cross
outputs = mannequin(X)
loss = criterion(outputs, y)
RMSE_loss = torch.sqrt(loss)
# Backward cross and optimization
optimizer.zero_grad() # Zero out gradients
RMSE_loss.backward() # Compute gradients
optimizer.step() # Replace weights
# Print progress
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.merchandise():.4f}')
As soon as we’ve accomplished these loops, we’ve skilled the mannequin artifact. What we now have as soon as we’ve got skilled a mannequin is an in-memory object that represents the weights, biases, and metadata of that mannequin, saved inside our occasion of our LinearRegression
module.
As we run the coaching loop, we are able to see our loss shrink. That’s, the precise values are getting nearer to the anticipated:
Epoch [10/100], Loss: 33.0142
Epoch [20/100], Loss: 24.2189
Epoch [30/100], Loss: 16.8170
Epoch [40/100], Loss: 10.8076
Epoch [50/100], Loss: 6.1890
Epoch [60/100], Loss: 2.9560
Epoch [70/100], Loss: 1.0853
Epoch [80/100], Loss: 0.4145
Epoch [90/100], Loss: 0.3178
Epoch [100/100], Loss: 0.2974
We will additionally see if we print out the state_dict
that the parameters have modified as we’ve computed the gradients and up to date the weights within the backward cross:
"""earlier than"""
OrderedDict([('linear.weight', tensor([[-0.6216]])), ('linear.bias', tensor([0.7633]))])
linear.weight torch.Dimension([1, 1])
linear.bias torch.Dimension([1])
{'state': {}, 'param_groups': [{'lr': 0.01, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1]}]}
Epoch [10/100], Loss: 33.0142
Epoch [20/100], Loss: 24.2189
Epoch [30/100], Loss: 16.8170
Epoch [40/100], Loss: 10.8076
Epoch [50/100], Loss: 6.1890
Epoch [60/100], Loss: 2.9560
Epoch [70/100], Loss: 1.0853
Epoch [80/100], Loss: 0.4145
Epoch [90/100], Loss: 0.3178
Epoch [100/100], Loss: 0.2974
"""after"""
OrderedDict([('linear.weight', tensor([[1.5441]])), ('linear.bias', tensor([1.3291]))])
The optimizer, as we see, has its personal state_dict
, which consists of those hyperparameters we mentioned earlier than: the training charge, the burden decay, and extra:
print(optimizer.state_dict())
{'state': {}, 'param_groups': [{'lr': 0.01, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1]}]}
Now that we’ve got a skilled mannequin object, we are able to cross in new function values for the mannequin to guage. For instance we are able to cross in an X
worth of 5
hours of sunshine and see what number of jars of Nulltella we anticipate to make.
We do that by passing in 5
to the instantiated mannequin object, which is now a mix of the tactic used to run the linear regression equation and our state dict, the weights, the present set of weights and biases to present a brand new predicted worth. We get 9
jars, which fairly near what we’d anticipate.
test_input = torch.tensor([[5.0]], dtype=torch.float32)
predicted_output = mannequin(test_input)
print(f'Prediction for enter {test_input.merchandise()}: {predicted_output.merchandise()}')
Prediction for enter 5.0: 9.049455642700195
I’m abstracting away an enormous amount of detail for the sake of readability, specifically the huge quantity of labor PyTorch does in shifting this knowledge out and in of GPUs and dealing with GPU-efficient datatypes for environment friendly computing which is a big a part of the work of the library. We’ll skip these for now for simplicity.
Serializing our objects
To this point, so good. We now have stateful Python objects in-memory that convey the state of our mannequin. However what occurs when we have to persist this very giant mannequin, that we seemingly spent 24+ hours coaching, and use it once more?
This situation is described here,
Suppose a researcher is experimenting with a brand new deep-learning mannequin structure, or a variation on an current one. Her structure goes to have an entire bunch of configuration choices and hyperparameters: the variety of layers, the varieties of every layers, the dimensionality of varied vectors, the place and how one can normalize activations, which nonlinearity(ies) to make use of, and so forth. Most of the mannequin parts shall be normal layers offered by the ML framework, however the researcher shall be inserting bits and items of novel logic as nicely.
Our researcher wants a solution to describe a selected concrete mannequin – a selected mixture of those settings – which may be serialized after which reloaded later. She wants this for just a few associated causes:
She seemingly has entry to a compute cluster containing GPUs or different accelerators she will use to run jobs. She wants a solution to submit a mannequin description to code working on that cluster so it might probably run her mannequin on the cluster.
Whereas these fashions are coaching, she wants to save lots of snapshots of their progress in such a method that they are often reloaded and resumed, in case the {hardware} fails or the job is preempted. As soon as fashions are skilled, the researcher will wish to load them once more (doubtlessly each a closing snapshot, and among the partially-trained checkpoints) in an effort to run evaluations and experiments on them.
What will we imply by serialization? It’s the method of writing objects and courses from our programming runtime to a file. Deserialization is the method of changing knowledge on disk to programming language objects in reminiscence. We now have to seralize the info right into a bytestream that we are able to write to a file.

Why “serialization”? As a result of again within the Previous Days, knowledge was saved on tape, which required bits to be so as sequentially on tape.
Since many transformer-style fashions are skilled utilizing PyTorch today, artifacts use PyTorch’s save
implementation for serializing objects to disk.

Once more, let’s summary away the GPU for simplicity and assume we’re performing all these computations in CPU. Python objects live in memory. This reminiscence is allotted in a particular personal heap at first of their lifecycle, in private heap managed by the Python reminiscence supervisor, with specialised heaps for various object sorts.
Once we initialize our PyTorch mannequin object, the working system allocates reminiscence by means of lower-level C capabilities, specifically malloc
, through default memory allocators.
Once we run our code with tracemalloc, we are able to see how reminiscence for PyTorch is definitely allotted on CPU (remember that, once more, GPU operations are utterly totally different).
import tracemalloc
tracemalloc.begin()
.....
pytorch
...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
[ Top 10 ]
<frozen importlib._bootstrap_external>:672: measurement=21.1 MiB, depend=170937, common=130 B
/Customers/vicki/.pyenv/variations/3.10.0/lib/python3.10/examine.py:2156: measurement=577 KiB, depend=16, common=36.0 KiB
/Customers/vicki/.pyenv/variations/3.10.0/lib/python3.10/web site-packages/torch/_dynamo/allowed_functions.py:71: measurement=512 KiB, depend=3, common=171 KiB
/Customers/vicki/.pyenv/variations/3.10.0/lib/python3.10/dataclasses.py:434: measurement=410 KiB, depend=4691, common=90 B
/Customers/vicki/.pyenv/variations/3.10.0/lib/python3.10/web site-packages/torch/_dynamo/allowed_functions.py:368: measurement=391 KiB, depend=7122, common=56 B
/Customers/vicki/.pyenv/variations/3.10.0/lib/python3.10/web site-packages/torch/_dynamo/allowed_functions.py:397: measurement=349 KiB, depend=1237, common=289 B
<frozen importlib._bootstrap_external>:128: measurement=213 KiB, depend=1390, common=157 B
/Customers/vicki/.pyenv/variations/3.10.0/lib/python3.10/functools.py:58: measurement=194 KiB, depend=2554, common=78 B
/Customers/vicki/.pyenv/variations/3.10.0/lib/python3.10/web site-packages/torch/_dynamo/allowed_functions.py:373: measurement=136 KiB, depend=2540, common=55 B
<frozen importlib._bootstrap_external>:1607: measurement=127 KiB, depend=1133, common=115 B
Right here, we are able to see we imported 170k objects from imports, and that the remainder of the allocation got here from allowed_functions in torch.
We will additionally extra explicitly see the varieties of these objects in reminiscence. Amongst all the opposite objects created by PyTorch and Python system libraries, we are able to see our Linear
object right here, which has state_dict
as a property. We have to serialize this object right into a bytestream so we are able to write it to disk.
import gc
# Get all dwell objects
all_objects = gc.get_objects()
# Extract distinct object sorts
distinct_types = set(sort(obj) for obj in all_objects)
# Print distinct object sorts
for obj_type in distinct_types:
print(obj_type.__name__)
InputKind
KeyedRef
ReLU
Supervisor
_Call
UUID
Pow
Softmax
Choices
_Environ
**Linear**
CFunctionType
SafeUUID
_Real
JSONDecoder
StmtBuilder
OutDtypeOperator
MatMult
attrge
PyTorch serializes objects to disk utilizing Python’s pickle framework and wrapping the pickle load
and dump
strategies.
Pickle traverses the thing’s inheritance hierarchy and converts every object encountered into streamable artifacts. It does this recursively for nested representations (for instance, understanding nn.Module
and Linear
inheriting from nn.Module
) and changing these representations to byte representations in order that they are often written to file.
For example, let’s take a easy operate and write it to a pickle file.
import torch.nn as nn
import torch.optim as optim
import pickle
X = torch.tensor([[1.0], [2.0], [3.0], [4.0]], dtype=torch.float32)
with open('tensors.pkl', 'wb') as f:
pickle.dump(X, f)
once we examine the pickled object with pickletools, we get an concept of how the info is organized.
We import some capabilities that load the info as a tensor, then the precise storage of that knowledge, then its sort. The module does the inverse when changing from pickle information to Python objects.
python -m pickletools tensors.pkl
0: x80 PROTO 4
2: x95 FRAME 398
11: x8c SHORT_BINUNICODE 'torch._utils'
25: x94 MEMOIZE (as 0)
26: x8c SHORT_BINUNICODE '_rebuild_tensor_v2'
46: x94 MEMOIZE (as 1)
47: x93 STACK_GLOBAL
48: x94 MEMOIZE (as 2)
49: ( MARK
50: x8c SHORT_BINUNICODE 'torch.storage'
65: x94 MEMOIZE (as 3)
66: x8c SHORT_BINUNICODE '_load_from_bytes'
84: x94 MEMOIZE (as 4)
85: x93 STACK_GLOBAL
86: x94 MEMOIZE (as 5)
87: B BINBYTES b'x80x02x8anlxfcx9cFxf9 jxa8Px19.x80x02Mxe9x03.x80x02}qx00(Xx10x00x00x00protocol_versionqx01Mxe9x03Xrx00x00x00little_endianqx02x88Xnx00x00x00type_sizesqx03}qx04(Xx05x00x00x00shortqx05Kx02Xx03x00x00x00intqx06Kx04Xx04x00x00x00longqx07Kx04uu.x80x02(Xx07x00x00x00storageqx00ctorchnFloatStoragenqx01Xnx00x00x006061074080qx02Xx03x00x00x00cpuqx03Kx04Ntqx04Q.x80x02]qx00Xnx00x00x006061074080qx01a.x04x00x00x00x00x00x00x00x00x00x80?x00x00x00@x00x00@@x00x00x80@'
351: x94 MEMOIZE (as 6)
352: x85 TUPLE1
353: x94 MEMOIZE (as 7)
354: R REDUCE
355: x94 MEMOIZE (as 8)
356: Ok BININT1 0
358: Ok BININT1 4
360: Ok BININT1 1
362: x86 TUPLE2
363: x94 MEMOIZE (as 9)
364: Ok BININT1 1
366: Ok BININT1 1
368: x86 TUPLE2
369: x94 MEMOIZE (as 10)
370: x89 NEWFALSE
371: x8c SHORT_BINUNICODE 'collections'
384: x94 MEMOIZE (as 11)
385: x8c SHORT_BINUNICODE 'OrderedDict'
398: x94 MEMOIZE (as 12)
399: x93 STACK_GLOBAL
400: x94 MEMOIZE (as 13)
401: ) EMPTY_TUPLE
402: R REDUCE
403: x94 MEMOIZE (as 14)
404: t TUPLE (MARK at 49)
405: x94 MEMOIZE (as 15)
406: R REDUCE
407: x94 MEMOIZE (as 16)
408: . STOP
highest protocol amongst opcodes = 4
The primary subject with pickle as a file format is that it not solely bundles executable code, however that there are not any checks on the code being learn, and with out schema ensures, you can pass something to the pickle that’s malicious,
The insecurity isn’t as a result of pickles comprise code, however as a result of they create objects by calling constructors named within the pickle. Any callable can be utilized instead of your class identify to assemble objects. Malicious pickles will use different Python callables because the “constructors.” For instance, as an alternative of executing “fashions.MyObject(17)”, a harmful pickle would possibly execute “os.system(‘rm -rf /’)”. The unpickler can’t inform the distinction between “fashions.MyObject” and “os.system”. Each are names it might probably resolve, producing one thing it might probably name. The unpickler executes both of them as directed by the pickle.’
How Pickle works
Pickle initially labored for Pytorch-based fashions as a result of it was additionally intently coupled to the Python ecosystem and preliminary ML library artifacts weren’t the important thing outputs of deep studying programs.
The first output of analysis is information, not software program artifacts. Analysis groups write software program to reply analysis questions and enhance their/their workforce’s/their discipline’s understanding of a website, extra so than they write software program in an effort to have software program instruments or options.
Nonetheless, as the usage of transformer-based fashions picked up after the discharge of the Transformer paper in 2017, so did the usage of the transformers
library, which delegates the load name to PyTorch’s load
strategies, which makes use of pickle.
As soon as practitioners began creating and importing pickled model artifacts to model hubs like HuggingFace, machine learning model supply chain security turned a difficulty.
From pickle to safetensors
As machine studying with deep studying fashions skilled with PyTorch exploded, these safety points got here to a head, and in 2021, Path of Bits launched a submit the insecurity of pickle files.
Engineers at HuggingFace began growing a library often called safetensors as an alternative choice to pickle. Safetensors was a developed to be environment friendly, however, additionally safer and extra ergonomic than pickle.
First, safetensors
isn’t certain to Python as intently as Pickle: with pickle, you’ll be able to solely learn or write information in Python. Safetensors is appropriate throughout languages. Second, safetensors additionally limits language execution, performance out there on serialization and deserialization. Third, as a result of the backend of safetensors is written in Rust, it enforces sort security extra rigorously. Lastly, safetensors was optimized for work particularly with tensors as a datatype in a method that Pickle was not. That, mixed with the truth that it was wirtten in Rust makes it really fast for reads and writes.
After a concerted push from each Trail of Bits and EleutherAI, a safety audit of safetensors was carried out and located passable, which led to HuggingFace adapting it as the default format for models on the Hub. going ahead. (Huge due to Stella and Suha for this historical past and context, and to everybody who contributed to the Twitter thread.)
How safetensors works
How does the safetensors format work? As with most issues in LLMs on the bleeding edge, the code and commit historical past will do many of the speaking. Let’s take a look at the file spec.
- 8 bytes: N, an unsigned little-endian 64-bit integer, containing the scale of the header
- N bytes: a JSON UTF-8 string representing the header.
The header knowledge MUST start with a { character (0x7B).
The header knowledge MAY be trailing padded with whitespace (0x20).
The header is a dict like {“TENSOR_NAME”: {“dtype”: “F16”, “form”: [1, 16, 256], “data_offsets”: [BEGIN, END]}, “NEXT_TENSOR_NAME”: {…}, …},
data_offsets level to the tensor knowledge relative to the start of the byte buffer (i.e. not an absolute place within the file), with BEGIN because the beginning offset and END because the one-past offset (so whole tensor byte measurement = END – BEGIN).
A particular key metadata is allowed to comprise free type string-to-string map. Arbitrary JSON isn’t allowed, all values should be strings. - Remainder of the file: byte-buffer.
That is totally different than state_dict
and pickle
file specs, however the addition of safetensors follows the pure evolution from Python objects, to full-fledged file format.
A file is a method of storing our knowledge generated from programming language objects, in bytes on disk. In totally different file format specs (Arrow,Parquet, protobuf), we’ll begin to discover some patterns round how they’re laid out.
- Within the file, we’d like some indicator that it is a sort of file “X”. Normally that is represented by a magic byte.
- Then, there’s a header that represents the metadata of the file (Within the case of machine studying, what number of layers we’ve got, the training charge, and different elements. )
- The precise knowledge. (Within the case of machine studying information, the tensors)
- We then want a spec that tells us what to anticipate in a file as we learn it and what sorts of information sorts are within the file and the way they’re represented as bytes. Primarily, documentation for the file’s structure and API in order that we are able to program a file reader towards it.
- One function the file spec normally tells us is whether or not knowledge is little or big-endian, that’s – whether or not we retailer the biggest quantity first or final. This turns into necessary as we anticipate information to be learn on programs with totally different default byte layouts.
- We then implement code that reads and writes to that filespec particularly.
One factor we begin to discover from having checked out statedicts and pickle information earlier than, is that machine studying knowledge storage comply with a sample: we have to retailer:
- a big assortment of vectors,
- metadata about these vectors and
- hyperparameters
We then want to have the ability to instantiate mannequin objects that we are able to hydrate (fill) with that knowledge and run mannequin operations on.
For example for safetensors from the documentation: We begin with a Python dictionary, aka a state dict, save, and cargo the file.
import torch
from safetensors import safe_open
from safetensors.torch import save_file
tensors = {
"weight1": torch.zeros((1024, 1024)),
"weight2": torch.zeros((1024, 1024))
}
save_file(tensors, "mannequin.safetensors")
tensors = {}
with safe_open("mannequin.safetensors", framework="pt", system="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
we use the save_file(mannequin.state_dict(), ‘my_model.st’) technique to render the file to safetensors
Within the conversion course of from pickle to safetensors, we additionally begin with the state dict.
Safetensors shortly turned the main format for sharing mannequin weights and architectures to make use of in additional fine-tuning, and in some circumstances, inference
Checkpoint information
We’ve to this point taken a have a look at easy state_dict
information and single safetensors
information. However when you’re coaching a long-running mannequin, you’ll seemingly have extra than simply weights and biases to save lots of, and also you wish to save your state every now and then so you’ll be able to revert when you begin to see points in your trianing run. PyTorch has checkpoints. A checkpoint is a file that has a mannequin state_dict
, but additionally
the optimizer’s state_dict, as this incorporates buffers and parameters which are up to date because the mannequin trains. Different objects that you could be wish to save are the epoch you left off on, the most recent recorded coaching loss, exterior torch.nn.Embedding layers, and extra. That is additionally saved as a Dictionary and pickled, then unpickled once you want it. All of that is additionally saved to a dictionary, the
optimizer_state_dict
, distinct from themodel_state_dict
.
# Further data
EPOCH = 5
PATH = "mannequin.pt"
LOSS = 0.4
torch.save({
'epoch': EPOCH,
'model_state_dict': internet.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': LOSS,
}, PATH)
As well as, most giant language fashions additionally now embrace accompanying information like tokenizers, and on HuggingFace, metadata, and so forth. So when you’re working with PyTorch fashions as artifacts generated through the Transformers library, you’ll get a repo that looks like this.
GGML
As work emigrate from pickle to safetensors was ongoing for generalized model fine-tuning and inference, Apple Silicon continued to get a lot better.. Consequently, individuals began bringing modeling work and inference from giant GPU-based computing clusters, to native and on-edge gadgets.
Georgi Gerganov’s mission to make OpenAI’s Whisper run domestically with Whisper.cpp. was a hit and the catalyst for later tasks. The mixture of the discharge of Llama-2 as a mostly open-source model, mixed with the rise of mannequin compression strategies like LoRA, giant language fashions, which had been usually solely accessible on lab or industry-grade GPU {hardware} (inspie of the small CPU-based examples we’ve run right here), additionally acted as a catalyst for enthusiastic about working with and working customized fashions domestically.
Primarily based on the curiosity and success of whisper.cpp
, Gerganov created llama.cpp, a package deal for working with Llama mannequin weights, originaly in pickle format, in GGML format, for native inference.
GGML was initialy each a library and a complementary format created particularly for on-edge inference for whisper. You can too perform fine-tuning with it, however typically it’s used to learn fashions skilled on PyTorch in GPU Linux-based environments and transformed to GGML to run on Apple Silicon.
For example, right here is script for GGML which converts PyTorch GPT-2 checkpoints to the right format, read as a .bin
file.. The information are downloaded from OpenAI.
The resulting GGML file compresses all of these into one and contains:
-
a magic quantity with an optional version number
-
model-specific hyperparameters, together with
metadata concerning the mannequin, such because the variety of layers, the variety of heads, and so forth.
a ftype that describes the kind of nearly all of the tensors,
for GGML information, the quantization model is encoded within the ftype divided by 1000 -
an embedded vocabulary, which is an inventory of strings with size prepended.
-
lastly, a list of tensors with their length-prepended identify, sort, and tensor knowledge
There are a number of components that make GGML extra environment friendly for native inference than checkpoint information. First, it makes use of 16-bit floating point representations of mannequin weights. Usually, torch
initializes floating level datatypes in 32-bit floats by default. 16-bit, or half precision implies that mannequin weights use 50% less memory at compute and inference time with out vital loss in mannequin accuracy. Different architectural decisions embrace utilizing C, which presents more efficient memory allocation than Python. And at last, GGML was constructed optimized for Silicon.
Sadly, in its transfer to effectivity, GGML contained a number of breaking changes that created points for customers.
The biggest one was that, since every thing, each knowledge and metadata and hyperparameters, was written into the identical file, if a mannequin added hyperparameters, it could break backward compatibility that the brand new file couldn’t decide up. Moreover, no mannequin structure metadata is current within the file, and every structure required its personal conversion script. All of this led to brittle efficiency and the creation of GGUF.
Lastly, GGUF
GGUF has the identical sort of structure as GGML, with metadata and tensor knowledge in a single file, however as well as can also be designed to be backwards-compatible. The important thing distinction is that beforehand as an alternative of an inventory of values for the hyperparameters, the brand new file format makes use of a key-value lookup tables which accomodate shifting values.
The intiution we spent build up round how machine studying fashions work and file codecs are laid out now permits us to grasp the GGUF format.
First, we all know that GGUF fashions are little-endian by default for particular architectures, which we keep in mind is when the least vital bytes come first and is optimized for various laptop {hardware} architectures.
Then, we’ve got gguf_header_t
, which is the header
It contains the magic byte that tells us it is a GGUF file:
Should be `GGUF` on the byte stage: `0x47` `0x47` `0x55` `0x46`.
in addition to the key-value pairs:
// The metadata key-value pairs.
gguf_metadata_kv_t metadata_kv[metadata_kv_count];
This file format additionally presents versioning, on this case we see that is model 3 of the file format.
// Should be `3` for model described on this spec, which introduces big-endian assist.
//
// This model ought to solely be elevated for structural modifications to the format.
Then, we’ve got the tensors
The complete file appears like this, and once we work with readers like llama.cpp
and ollama
, they take this spec and write code to open these information and browse them.

We’ve been on a whirlwind journey to construct up our instinct of how machine studying fashions work, what artifacts they produce, how the machine studying artifact storage story has modified over the previous couple years, and eventually ended up in GGUF’s documentation to higher perceive the log that’s introduced to us once we carry out native inference on artifacts in GGUF. Hope that is useful, and good luck!