A Distributed File System in Go Minimize Common Metadata Reminiscence Utilization to 100 Bytes

TL;DR:
JuiceFS, written in Go, can handle tens of billions of information in a single namespace. Its metadata engine makes use of an all-in-memory strategy and achieves exceptional reminiscence optimization, dealing with 300 million information with 30 GiB of reminiscence and 100 microseconds response time. Methods like reminiscence swimming pools, handbook reminiscence administration, listing compression, and compact file codecs diminished metadata reminiscence utilization to 100 bytes per file.
JuiceFS Enterprise Edition, a cloud-native distributed file system written in Go, can handle tens of billions of information in a single namespace. After years of iteration, it may possibly handle about 300 million information with a single metadata service course of utilizing 30 GiB of reminiscence, whereas sustaining the typical processing time of metadata requests at 100 microseconds. In manufacturing, 10 metadata nodes every with 512 GB of reminiscence collectively handle over 20 billion information.
For final efficiency, our metadata engine makes use of an all-in-memory strategy and undergoes steady optimization. Managing the identical variety of information, it requires about solely 27% of the reminiscence of HDFS NameNode or 3.7% of CephFS Metadata Server (MDS). This extraordinarily excessive reminiscence effectivity implies that with the identical {hardware} assets, JuiceFS can deal with extra information and extra advanced operations, thus attaining increased system efficiency.
On this put up, we’ll delve into JuiceFS’ structure, our metadata engine design, and optimization strategies that diminished our common reminiscence utilization for metadata to 100 bytes. Our objective is to supply JuiceFS customers with deeper insights and confidence in dealing with excessive situations. We additionally hope this put up will function a helpful reference for designing large-scale techniques.
JuiceFS structure
JuiceFS consists of three main elements:
- Shopper: That is the entry layer that interacts with the applying. JuiceFS helps a number of protocols, together with POSIX, Java SDK, Kubernetes CSI Driver, and S3 Gateway.
- Metadata engine: It maintains the listing tree construction of the file system and the properties of particular person information.
- Knowledge storage: It shops the precise content material of standard information, usually dealt with by object storage companies like Amazon S3.

At the moment, JuiceFS presents two editions: Community Edition and Enterprise Version. Whereas their architectures share similarities, the important thing distinction lies within the implementation of the metadata engine:
- The Group Version’s metadata engine makes use of present database companies, resembling Redis, PostgreSQL, and TiKV.
- The Enterprise Version options an in-house developed metadata engine. This proprietary engine not solely delivers enhanced efficiency with diminished useful resource consumption but additionally offers further assist for enterprise-level necessities.
The next sections will discover our concerns and methodologies in creating the unique metadata engine for JuiceFS Enterprise Version.
Metadata engine design
Selecting Go as the event language
The event of underlying system software program is normally based mostly on C or C++, whereas JuiceFS selected Go as the event language. It’s because Go has the next benefits:
- Excessive improvement effectivity: Go syntax is extra concise in comparison with C, with stronger expressive capabilities. Moreover, Go comes with built-in reminiscence administration performance and highly effective toolchains like pprof.
- Wonderful program execution efficiency: Go itself is a compiled language, and applications written in Go usually don’t lag behind C applications within the overwhelming majority of instances.
- Higher program portability: Go has higher assist for static compilation, making it simpler for applications to run immediately on totally different working techniques.
- Assist for multi-language SDKs: With the assistance of the native cgo instrument, Go code will also be compiled into shared library information (.so information), facilitating loading by different languages.
Whereas Go brings comfort, it hides some low-level particulars. This may occasionally have an effect on this system’s effectivity in utilizing {hardware} assets to a sure extent, particularly the administration of reminiscence by the rubbish collector (GC). Due to this fact, focused optimizations are wanted at vital efficiency factors.
Efficiency enhance methods: all-in-memory, lock-free companies
To enhance efficiency, we have to perceive the core tasks of the metadata engine in a distributed file system. Sometimes, it’s primarily answerable for two necessary duties:
- Managing metadata for an enormous variety of information
- Rapidly processing metadata requests
All-in-memory mode for managing large information’ metadata
To perform this activity, there are two frequent design approaches:
- Loading all file metadata into reminiscence, resembling HDFS NameNode. This could present wonderful efficiency however inevitably requires a considerable amount of reminiscence assets.
- Caching solely a part of the metadata in reminiscence, resembling CephFS MDS. When the requested metadata will not be within the cache, the MDS holds the request quickly, retrieves the corresponding content material from the disk (metadata pool) over the community, parses it, after which retries the operation. This could simply result in latency spikes, affecting the consumer expertise. Due to this fact, in apply, to fulfill the low-latency entry wants of the applying, the MDS reminiscence restrict is elevated as a lot as attainable to cache extra information, even all information.
JuiceFS Enterprise Version pursues final efficiency and thus adopted the primary all-in-memory strategy, repeatedly optimizing to cut back the reminiscence utilization of file metadata. All-in-memory mode usually makes use of real-time transaction logs to persist information for reliability. JuiceFS additionally makes use of the Raft consensus algorithm to implement metadata multi-server replication and computerized failover.
Lock-free strategy for fast metadata processing
The important thing efficiency metric of the metadata engine is the variety of requests it may possibly course of per second. Sometimes, metadata requests want to make sure transactions and contain a number of information constructions. Advanced locking mechanisms are required throughout concurrent multithreading to make sure information consistency and safety. When transactions battle steadily, multithreading doesn’t successfully enhance throughput; as an alternative, it could enhance latency as a result of too many lock operations. That is particularly evident in high-concurrency situations.
JuiceFS adopted a special strategy, much like Redis’ lock-free mode. On this mode, all core information construction operations are executed in a single thread. This strategy has the next benefits:
- The only-threaded strategy ensures the atomicity of every operation (avoiding operations being interrupted by different threads) and reduces thread context switching and useful resource competition. Thereby it improves the general effectivity of the system.
- On the similar time, it considerably reduces system complexity, enhances stability, and maintainability.
- Due to the all-in-memory metadata storage mode, requests will be effectively processed, and the CPU will not be simply bottlenecked.
Multi-partition horizontal scaling
The reminiscence obtainable to a single metadata service course of has its limits, and effectivity steadily declines as reminiscence utilization per course of will increase. JuiceFS achieves horizontal scaling by aggregating metadata distributed throughout a number of nodes in digital partitions, supporting bigger information scales and better efficiency calls for.
Particularly, every partition is answerable for a portion of the file system’s subtree, and purchasers coordinate and handle information throughout partitions to assemble the information right into a single namespace. These information within the partitions can dynamically migrate as wanted. For instance, a cluster managing over 20 billion information might use 10 metadata nodes with 512 GB of reminiscence every, deployed throughout 80 partitions. Sometimes, it is really helpful to restrict the reminiscence of a single metadata service course of to 40 GiB and handle extra information by way of multi-partition horizontal scaling.
File system entry typically has sturdy locality, with information transferring inside the similar listing or adjoining directories. Due to this fact, JuiceFS applied a dynamic subtree splitting mechanism that maintains bigger subtrees, minimizing most metadata operations to happen inside a single partition. This strategy considerably diminished using distributed transactions, guaranteeing that even after intensive scaling, the cluster maintains metadata response latencies much like these of a single partition.
How one can scale back reminiscence utilization
As information quantity will increase, the reminiscence necessities for metadata companies additionally rise. This impacts system efficiency and escalates {hardware} prices. Thus, lowering metadata reminiscence utilization is vital for sustaining system stability and price management in situations involving large information.
To realize this objective, we have explored and applied intensive optimizations in reminiscence allocation and utilization. Beneath, we’ll talk about some measures which have confirmed efficient by way of years of iteration and optimization.
Utilizing reminiscence swimming pools to cut back allocation
Utilizing reminiscence swimming pools to cut back allocation is a standard optimization method in Go applications, primarily utilizing the sync.Pool
construction from the usual library. The precept will not be discarding information constructions after use however returning them to a pool. When the identical kind of information construction is required once more, it may be retrieved immediately from the pool with out allocation. This strategy successfully reduces the frequency of reminiscence allocation and deallocation, thereby enhancing efficiency.
For instance:
pool := sync.Pool{
New: func() interface{} {
buf := make([]byte, 1<<17)
return &buf
},
}
buf := pool.Get().(*[]byte)
// do some work
pool.Put(buf)
Throughout initialization, usually we have to outline a New
perform to create a brand new construction. After we use the construction, we use the Get
methodology to acquire the item and convert it to the corresponding kind. After we end utilizing it, we use the Put
methodology to return the construction to the pool. It is price noting that after being returned, the construction within the pool has solely a weak reference and could also be garbage-collected at any time.
The construction within the instance above is a phase of pre-allocated reminiscence slices, primarily making a easy reminiscence pool. When mixed with the finer administration strategies mentioned within the subsequent part, it allows environment friendly reminiscence utilization in applications.
Guide administration of small reminiscence allocations
Within the JuiceFS metadata engine, probably the most vital side is sustaining the listing tree construction, which roughly seems to be like this:

On this construction:
- A node data attributes of every file or listing, usually occupying 50 to 100 bytes.
- An edge describes the connection between mother or father and little one nodes, usually occupying 60 to 70 bytes.
- An extent data the situation of information, usually occupying about 40 bytes.
These constructions are small however quite a few. Go’s GC doesn’t assist generations. This implies if they’re all managed by the GC, it must scan all of them throughout every reminiscence scan and mark all referenced objects. This course of will be gradual, stopping well timed reminiscence reclamation and consuming extreme CPU assets.
To effectively handle these large small objects, we used the unsafe
pointer (together with uintptr
) to bypass Go’s GC for handbook reminiscence allocation and administration. In implementation, the metadata engine requests massive blocks of reminiscence from the system after which splits them into small blocks of the identical dimension. When saving pointers to those manually allotted reminiscence blocks, we most well-liked utilizing unsafe.Pointer
and even uintptr
sorts, relieving the GC from scanning these pointers and considerably lowering its workload throughout reminiscence reclamation.
We designed a metadata reminiscence pool named Enviornment, containing a number of buckets to isolate constructions of various sizes. Every bucket holds massive reminiscence blocks, resembling 32 KiB or 128 KiB. When metadata constructions are wanted, the Enviornment interface locates the corresponding bucket and allocates a small phase from it. After use, it informs Enviornment to return it to the reminiscence pool. Enviornment’s design diagram is as follows:

The administration particulars are sophisticated. In case you’re , you may be taught extra concerning the implementation ideas of reminiscence allocators resembling tcmalloc
and jemalloc
. Our design concepts are much like them.
Beneath is a block of key code in Enviornment:
// Resident reminiscence blocks
var slabs = make(map[uintptr][]byte)
p := pagePool.Get().(*[]byte) // 128 KiB
ptr := unsafe.Pointer(&(*p)[0])
slabs[uintptr(ptr)] = *p
Right here, slabs
is a worldwide map that data all allotted reminiscence blocks in Enviornment. It permits the GC to know that these massive reminiscence blocks are in use.
The next code creates constructions:
func (a *enviornment) Alloc(dimension int) unsafe.Pointer {...}
dimension := nodeSizes[type]
n := (*node)(nodeArena.Alloc(dimension))
// var nodeMap map[uint32, uintptr]
nodeMap[n.id] = uintptr(unsafe.Pointer(n)))
The Alloc
perform of Enviornment requests reminiscence of a particular dimension and returns an unsafe.Pointer
pointer. After we create a node, we first decide the scale required by its kind after which convert the obtained pointer to the specified construction kind. If needed, we convert this unsafe.Pointer
to uintptr
and retailer it in nodeMap
. This map is a big mapping used to shortly discover the corresponding construction based mostly on the node ID.
From the attitude of the GC, it seems that this system has requested many 128 KiB reminiscence blocks which can be continuously in use, however it does not want to fret concerning the content material inside. Moreover, though nodeMap
incorporates a whole bunch of hundreds of thousands and even billions of parts, all its key-value pairs are of numeric sorts, so the GC does not have to scan every key-value pair. This design is pleasant to the GC, and even with a whole bunch of gigabytes of reminiscence, it may possibly simply full the scan.
Compressing idle directories
As talked about above, file system entry has sturdy locality, with functions typically accessing just a few particular directories steadily, leaving different components idle. Primarily based on this statement, we compressed inactive listing metadata to cut back reminiscence utilization. The method is as under:

When the dir
listing is idle, its metadata, together with all its instant little one gadgets, will be compactly serialized right into a contiguous reminiscence buffer in accordance with a predefined format. Then, this buffer will be compressed to a smaller dimension.
Sometimes, serializing a number of constructions collectively can save almost half of the reminiscence, and compression can additional scale back reminiscence utilization by roughly one-half to two-thirds. Thus, this methodology considerably lowers the typical reminiscence utilization of particular person file metadata. Nonetheless, the serialization and compression processes devour sure CPU assets and should enhance request latency. To steadiness effectivity, we monitor CPU standing internally and set off this course of solely when the CPU is idle, limiting the variety of information processed to 1,000 per operation to make sure fast completion.
Designing extra compact codecs for small information
To assist environment friendly random learn and write operations, JuiceFS indexes metadata of standard information into three ranges: fnodes, chunks, and slices. Chunks are an array, and slices are saved in a hash desk. Initially, every file required allocation of those three reminiscence blocks. Nonetheless, we discovered this methodology inefficient for many small information, as a result of they usually have just one chunk, which in flip has just one slice, and the slice’s size is similar because the file’s size.
Due to this fact, we launched a extra compact and environment friendly reminiscence format for such small information. Within the new format, we solely have to document the slice ID and derive the slice size from the file’s size, with out storing the slice itself. Moreover, we adjusted the construction of fnodes. Beforehand, fnodes saved a pointer to the chunks array, which contained solely an 8-byte slice ID. Now, we retailer this ID within the pointer variable. This utilization is much like a union construction within the C language, storing various kinds of information in the identical reminiscence location based mostly on the state of affairs. After these changes, every small file solely has one fnode object, with out requiring further chunk lists and slice data.

The optimized format saved about 40 bytes of reminiscence per small file. Furthermore, it diminished reminiscence allocation and indexing operations, leading to quicker entry.
General optimization results
The determine under summarizes our optimization outcomes:

Within the determine, the typical metadata dimension of information considerably decreased:
- Initially, the typical metadata dimension per file was almost 600 bytes.
- Via handbook reminiscence administration, this quantity dropped to about 300 bytes, considerably lowering GC overhead.
- Subsequently, by serializing idle directories, it was additional diminished to about 150 bytes.
- Lastly, by way of reminiscence compression strategies, the typical dimension decreased to about 50 bytes.
Nonetheless, the metadata service can be doing duties resembling standing monitoring, session administration, and dealing with community transfers. This may occasionally enhance reminiscence utilization past this core worth. Due to this fact, we usually estimate {hardware} assets based mostly on 100 bytes per file.
The only-file reminiscence utilization of frequent distributed file techniques is as follows:
- HDFS: 370 bytes (supply: on-line cluster monitoring — 52 GB reminiscence, 140 million information)
- CephFS: 2,700 bytes (supply: Nautilus model cluster monitoring — 32 GB reminiscence, 12 million information)
- Alluxio (heap mode): 2,100 bytes (supply: Alluxio documentation — 64 GB reminiscence, 30 million information)
- JuiceFS Group Version Redis engine: 430 bytes (supply: Redis Best Practices)
- JuiceFS Enterprise Version: 100 bytes (supply: on-line cluster monitoring — 30 GB reminiscence, 300 million information)
JuiceFS demonstrates excellent efficiency in metadata reminiscence utilization, accounting for less than 27% of HDFS NameNode and three.7% of CephFS MDS. This not solely signifies increased reminiscence effectivity but additionally implies that JuiceFS, with the identical {hardware} assets, can deal with extra information and extra advanced operations, thereby bettering general system efficiency.
Conclusion
One of many core elements of a file system lies in its metadata administration. When constructing a distributed file system able to dealing with tens of billions of information, this design activity turns into notably advanced.
This text launched JuiceFS’ key selections in designing its metadata engine and elaborated on 4 reminiscence optimization strategies: reminiscence swimming pools, handbook administration of small reminiscence blocks, compression of idle directories, and optimization of small file codecs. These measures are the outcomes of our steady exploration, experimentation, and iteration, finally lowering JuiceFS’ common reminiscence utilization for file metadata to 100 bytes. This makes JuiceFS extra adaptable to a wider vary of maximum utility situations.
In case you have any questions or wish to be taught extra, be at liberty to affix JuiceFS discussions on GitHub and our community on Slack.
Creator
JuiceFS Core System Engineer