Replicate & Fly cold-start latency

How does chilly & heat latency evaluate between these suppliers?
Downside Context
I’m constructing a semantic search engine, and I’m most all for minimizing query-time latency, and making an attempt out a number of completely different fashions.
One constraint: I don’t wish to pay the $2k/mo to maintain a GPU completely heat, so I’m usually on the lookout for serverless suppliers with nice chilly begin latency.
Take a look at #1: Chilly-start latency profile on Replicate
I’m testing a MSMARCO educated bi-encoder from sentence-transformers
. I’m not doing something notably clever re: mannequin loading – the 100 MB mannequin is downloaded from Huggingface the primary time the container is run.
I instrumented the Cog container to report a) time because the setup
perform was first referred to as, representing the tip of machine startup, and b) time taken for mannequin inference. Then, I timed it from a Python script calling the Replicate API.

Machine startup takes round 60 seconds, downloading the mannequin takes about 10, and embedding a single question string takes simply round 5 ms.
70s is just too lengthy to attend at a search field for.
Take a look at #2: Chilly & heat latency exams on Replicate vs Fly
I uploaded the identical Cog container to Fly, and up to date my instrumentation script to name that endpoint too. With Fly, I haven’t discovered configuring the automated scale-to-zero timer but, so I manually used fly machines kill
to profile cold-start instances.
To make sure that Fly didn’t have a further layer of short-lived caching, I spaced out my cold-start exams by as much as 12 hours.

setup()
perform invocationFly far dominates Replicate’s numbers, for each heat and chilly inferences. After getting my preliminary numbers on a Replicate A40, I chatted with a buddy who works at Replicate, who talked about that T4s are hosted otherwise vs A40s and I’d get higher outcomes. I then tried T4s, however I bought a lot worse chilly latencies, and heat latencies had been nonetheless not a lot improved.
To make sure there wasn’t one thing surprisingly incorrect occurring in my inference code, I switched to an successfully no-op predict
perform of return json.dumps({success: "True"})
, and I usually noticed my outcomes unchanged.
What now?
Replicate appears to have extremely poor latency, no matter chilly/heat standing. 800ms is loads to spend on networking for an already heat mannequin. In case you’re considering of deploying a mannequin to manufacturing, and you actually wish to use the Cog structure for some cause, you may nonetheless contemplate deploying to Fly.
Fly continues to be glorious. The min(machine_boot) I skilled with chilly boots was an astounding 2s – and I’d guess that suppose their chilly boot time might be the most effective out there for GPU suppliers.
I’m prone to nonetheless use Replicate for his or her open supply / social side. I’d prefer to containerize and take a look at out a bunch of embedding fashions for which code however no API at present exists (principally SPLADE, extra fashions from sentence-transformers
) and Replicate has a fantastic story for reworking that effort right into a public good. I’d like to construct a common embeddings API, the place it’s trivial to change the mannequin you’re utilizing in a product. With Replicate’s capability to let different customers pay for their very own utilization, it’s the one supplier that would assist this.
However if you happen to ask me which supplier it’s best to deploy to prod on – it’s most likely not Replicate.