Now Reading
Ask HN: What occurred to flatbuffers? Are they getting used?

Ask HN: What occurred to flatbuffers? Are they getting used?

2023-01-17 11:52:49

I love flatbuffers but they’re only worthwhile in a very small problem space.

If your main concern is “faster than JSON” then you’re better off using Protocol Buffers simply because they’re way more popular and better supported. FlatBuffers are cool because they let you decode on demand. Say you have an array of 10,000 complex objects. With JSON or Protocol Buffers you’re going to need to decode and load into memory all 10,000 before you’re able to access the one you want. But with FlatBuffers you can decode item X without touching 99% of the rest of the data. Quicker and much more memory efficient.

But it’s not simple to implement. You have to write a schema then turn that schema into source files in your target language. There’s an impressive array of target languages but it’s a custom executable and that adds complexity to any build. Then the generated API is difficult to use (in JS at least) because of course an array isn’t a JavaScript array, it’s an object with decoder helpers.

It’s also quite easy to trip yourself up in terms of performance by decoding the same data over and over again rather than re-using the first decode like you would with JSON or PB. So you have to think about which decoded items to store in memory, where, for how long, etc… I kind of think of it as the data equivalent of a programming language with manual memory management. Definitely has a place. But the majority of projects are going to be fine with automatic memory management.

Are there protobuf throughput benchmarks somewhere? I haven’t been able to verify that they’re faster than JSON.

Edit: I was able to find these at https://github.com/hnakamur/protobuf-deb/blob/master/docs/pe… however these numbers do not appear conclusive. Protobuf decode throughput for many schemas examined is way slower than JSON, however protobufs will most likely even be a bit smaller. One must evaluate decode throughput for a similar paperwork serialized each methods slightly than simply a desk.

> But it’s not simple to implement. You have to write a schema then turn that schema into source files in your target language. There’s an impressive array of target languages but it’s a custom executable and that adds complexity to any build. Then the generated API is difficult to use

Worth noting that all these things are true for protobuf as well.

It’s used for several ML-related projects, including as the model format for TensorFlow Lite (TFLite). The TFLite format also has long-term support as part of Google Play Services. The main attraction is ability to pass large amounts of data without having to serialize/deserialize all of it to access fields.

(I work for Google but don’t speak for it.)

I’ve played with using it to interopt some python data science code with some C++ code efficiently without writing the project in Cython or using a tool like Pybind11. It worked pretty well in my test scenario, but I’m not sure how great of an idea that truly is.

Is the capn’n proto use case similar to something like ZeroMQ or NNG? I’m still not fully sure.

TensorFlow Lite (tflite) uses flatbuffers. This format, and vendor-specific forks of it, ship on hundreds of millions of phones and other embedded devices.

They (and similar technologies) are used where it matters.

Games, data visualization, … numerically heavy applications mainly.

On a side-note; JSON has been somewhat of a curse. The developer ergonomics of it are so good, that web devs completely disregard how they should layout their data. You know, sending a table as a bunch of nested arrays, that sort of thing. Yuck.

In web apps, data is essentially unusable until it has been unmarshalled. Fine for small things, horrible for data-heavy apps, which really so many apps are now.

Sometimes I wonder if it will change. I’m optimistic that the popularity of mem-efficient formats like this will establish a new base paradigm of data transfer, and be adopted broadly on the web.

They’re used a lot in video games and embedded systems. They’re not something you see advertised.

grpc and Thrift are mostly backend service interconnects in lieu of RESTful.

Capnproto is also awesome.

I’m using flatbuffers as the basis of communication for my multiplayer game. They’re really quite pleasant to work with after you get into the flow of it.

yeah they’re used a lot. I think the difference is json is good for data or apis you want to be easily shared, flatbuffers (or protobuf or captnproto) are good for data that stays internal. That’s just a guideline and there are plenty of exceptions but it’s a starting point to thinking about it.

Is using flatbuffers as the on-disk storage format for an application a hare-brained idea?

If yes, is it a less hare-brained idea than using the ctypes Python module to mmap a file as a C struct? That’s what I’m currently doing to get 10x speedup relative to SQLite for an application bottlenecked on disk bandwidth, but it’s unergonomic to say the least.

Flatbuffers look like a way to get the same performance with better ergonomics, but maybe there’s a catch. (E.g. I thought the same thing about Apache Arrow before, but then I realized it’s basically read-only. I don’t expect to need to resize my tables often, but I do need to be able to twiddle individual values inside the file.)

> ctypes Python module to mmap a file as a C struct

Tell me more! Is your data larger than memory? You need persistence?

You might take a look at Aerospike, even on a single node if you need low latency persistence.

1. It’s not easier to use them than JSON when just getting started. However, the pay off is the strong typing and zero-copy access that they offer to folks that need to support clients on multiple architectures.

2. No, writers can directly embed structs and primitive data types into binary buffers at runtime through an API generated from an IDL file. Readers use direct memory access to pull values out of the buffers. If you set it up right, this can result in a massive perf boost by eliminating the encoding and decoding steps.

3. Facebook uses them in their mobile app. Another commenter mentioned use of the in the Arrow format. The flatbuffers website isn’t the best, but clearly documents the flatbuffers IDL.

The google documentation has a minimal tutorial. There are implementations for all of the major languages. The level of documentation in the ecosystem, though, is poor. My best recommendation for you is to jump in and get the tutorial/hello world example working in a language you’re comfortable with.

They aren’t hard to use, but they aren’t the easiest thing either.

Once you get the gist of the API through the tutorial, the other important topics that come up immediately are version control; git repo design; headers and framing.

In production, they’ve been bulletproof. As long as you account for the compile time issues (schema versioning, repo, headers and framing, etc.).

We were using protobufs at Spotify and ditched them for simple JSON calls on the client side. No one complained, and never going back to having anything like that on the client side if I can.

See Also

Just too many drawbacks.

For server to server, they might be fine, but to client just stick with JSON. (which when compressed is pretty efficient).

One could combine JSON and a serializationless library, your JSON would be blown up with whitespace, but read and update could be O(1), serialization would be a memcpy, you could probably canonicalize the json during the memcpy using the SIMD techniques of Lemire.

I did this one for reading json on the fast path, the sending system laid out the arrays in a periodic pattern in memory that enabled parseless retrieval of individual values.

https://github.com/simdjson/simdjson

That’s an intriguing idea but limits you to strings for your internal representation. Every time you wanted to pull a number out of it you’d be reparsing it.

Also I assume you’d have to have some sort of binary portion bundled with it to hold the field offsets, no?

How is it annoying? To be fair, we’re fronting our gRPC service with a AWS LB that terminates TLS (so our gRPC is plaintext), so we don’t deal with certs as direct dependencies of our server.

Being able to debug through a simple curl or browser devtools is golden.

Also browser has JSON parsing built in. Less dependencies. Easier tooling overall.

In my experience people overuse protobuf. But I also worked at Google, where it’s the hammer in constant search of any nail it can find.

At the very least, endpoints should provide the option to provide a JSON form through content representation negotiation.

When the library decoding the data is falling with weird errors, and you open the devtools in the browser and the data being transmitted is all in binary, well you have a very hard time debugging things.

We moved to flatbuffers and back to JSON because in the end of the day, for our data, data compression with JSON+gzip was similarly-sized than the original one (which had some other fields that we were not using) and 10-20 times faster to decode.

Truth.

That said, the use case for flatbuffers and capnproto isn’t really about data size, it’s about avoiding unnecessary copies in the processing pipeline. “Zero copy” really does pay dividends where performance is a concern if you write your code the right way.

Most people working on typical “web stack” type applications won’t hit these concerns. But there are classes of applications where what flatbuffers (and other zerocopy payload formats) offer is important.

The difference in computation time between operating on something sitting in L1 cache vs not-in-cache is orders of magnitude. And memory bandwidth is a bottleneck in some applications and on some machines (particularly embedded.)

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top