Helix Challenge Report, Feb 2024

Hello people,
It’s been simply over a month since we launched Helix v0.1, and at present I’m completely satisfied to announce at present the supply of Helix v0.5. Run it yourself on your own secure private infrastructure or try it out on our SaaS:
Together with a shiny new UI (you possibly can a screenshot for comparability in our first post), we’ve been extraordinarily centered on enhancing the standard of the textual content fine-tuning.
Once we first launched Helix, the textual content effective tuning of the Mistral-7B-Instruct language mannequin was primarily based on this LlamaIndex docs page. Yeah, we mainly determined to construct a complete enterprise round one this LlamaIndex pocket book.
On the time, sadly I hadn’t even seen the hyperlink from the father or mother web page within the LlamaIndex docs that stated “WIP: this isn’t superb but”. Uh-oh.
After all, as a result of Helix is focusing on being runnable totally on-prem, we’re utilizing Mistral-7B with axolotl for fine-tuning as a substitute of the GPT-3.5 API, however it’s the identical precept. The thought is that you just chunk your paperwork up into items, then ask a language mannequin to generate question-answer pairs (which is the form of the coaching knowledge it’s worthwhile to present the fine-tuning course of). You’re utilizing an LLM to automate producing coaching knowledge for fine-tuning one other LLM. The unique immediate seems like this:
Did it work? Sorta, however it was a bit crap. Similar to LlamaIndex stated it could be.
We had been in a position to feed it complex technical papers, and it was in a position to reply technical questions on them. But it surely failed at some far more primary duties.
This one information article for instance: Junior doctors in England to stage more strikes over pay. This text grew to become the bane of my life for just a few weeks 😂
Why? As a result of after we first requested the effective tuned mannequin the easy query:
What?! You had been fine-tuned on info derived from the article. Why are you speaking about fine-tuning?!
–me
OK, turned out this one was pretty simple. One of many textual content parts meant to inform the person “effective tuning accomplished” was additionally being despatched again to the mannequin as if the person had stated it. Now the one context the mannequin needed to go on was the concept of fine-tuning. We obtained that one out of the way in which. Cool, let’s strive once more:
OMG, critically? Certainly you need to know that the docs are threatening to go on strike. It’s proper there within the title! Silly fine-tuning, perhaps it’s going to by no means work. 🤬
–me
However we persevered. OK, so why isn’t the mannequin in a position to reply questions in regards to the primary context of the article? Effectively, it’s as a result of the question-answer pairs generated by the immediate aren’t within the type of easy questions in regards to the title of the doc. The answer, it turned out, was moderately than a single qapair producing immediate, we needed to implement a suite of them, to extract context from the doc from all kinds of various views: what are the entities within the doc and the way are they associated? What’s a brief, medium and lengthy abstract of the doc? Who/what/the place questions, and so forth. See the full list here.
Seems, after we rigorously constructed this suite of prompts, we had been then lastly in a position to get the mannequin to reply the fundamental query about “what are the docs going to do?” Phew! 😅
Our different perception was that by producing a content-addressed hash for every doc, we may additionally educate the mannequin in regards to the IDs of the person paperwork, together with IDs for teams of paperwork.
We will then map these IDs again onto the paperwork the mannequin was fine-tuned on. For instance: in this session the mannequin is ready to let you know that what it’s discovered was in a given doc, even linking the person again to that doc. I additionally discovered this change fairly hilarious:
Though perhaps that claims extra about my sense of humour than the rest.
We then added a system immediate telling the mannequin to refer solely to the particular doc IDs it was educated on and to not check with background information. What are you aware, it labored!
To this point we’re adjusting the prompts and system for this LLM app primarily based on “vibes”. That’s, attempting stuff, evaluating it by eye, after which altering it. Drawback is, vibes don’t scale.
Work is ongoing on an end-to-end “evals” framework so we will routinely construct up a library of fine and unhealthy classes, after which each time we alter the prompts, code, mannequin and so forth re-run the fine-tuning throughout the entire classes within the library, after which grade them. We would even use an LLM to grade them routinely 🙂
Please assist us by clicking the brand new thumbs up and thumbs down buttons on the backside of your classes! We’ll use these as enter to enhance the product.
Oh consider me, we’ve talked about it. We’re sticking with fine-tuning for now, as a result of:
-
Fantastic tuning can memorize much more info than you possibly can slot in a single immediate
-
By not needing to cram any customized information within the immediate, you may get significantly better latency
-
Fantastic tuning is best at copying fashion (now we have qapair prompts deliberate for this)
-
Fantastic tuning is best at understanding a big corpus of background information (a “area”) and having the ability to draw on all of it when establishing a solution
-
Fantastic tuned fashions are simpler to run on the edge with no need the infrastructure of vector shops near the mannequin
-
You need to use a a lot smaller fine-tuned mannequin than common goal mannequin plus RAG. What are you able to do with a fine-tuned Phi-2 mannequin that may run on any CPU?
-
We made it work!!
Are we unsuitable? Come and roast us on Discord!
We now have an OpenAI appropriate API. For instance, right here I’m configuring Flowise to combine with my personal Helix deployment, for instance:
Simply set:
-
Mannequin Title: mistralai/Mistral-7B-Instruct-v0.1
-
Join Credential: Your API key from https://app.tryhelix.ai/account
-
BasePath (underneath extra parameters): https://app.tryhelix.ai/v1
And that’s it! You’ll discover your API calls to Helix present up as classes in your account, so that you get a free document of them 🙂
We’ll routinely effective tune the mannequin as soon as the qapairs are extracted, then electronic mail you when it’s completed. We hope this may encourage extra individuals to dive again into the app when you’ve waited the 10-Quarter-hour it takes to coach your individual AI.
Thanks for studying! Quickly I’ll weblog extra about our roadmap, on-prem use instances the place we’re seeing vital commerial traction, and our soiled secret. Subscribe and keep within the loop 🙂