Hovering previous Transformers with 1 Trillion Tokens Throughout 100+ Languages

2024-01-28 22:55:57

An eagle, flying previous a transformer-looking robotic

Eagle 7B is a 7.52B parameter mannequin that:

We’re releasing RWKV-v5 Eagle 7B, licensed as Apache 2.0 license, under the Linux Foundation, and can be utilized personally or commercially with out restrictions

We carried out multi-lingual efficiency throughout the next benchmarks: xLAMBDA, xStoryCloze, xWinograd, xCopa

Throughout a complete of 23 languages

Most of those benchmarks cowl frequent sense reasoning, of their respective languages. And present an enormous total bounce in multi-lingual efficiency for RWKV v4-to-v5 structure. And the v2 world dataset.

It also needs to be famous, that there’s a lack of multi-lingual benchmarks, because the above covers roughly the highest 23 languages.

This makes it laborious to judge mannequin language efficiency immediately over the remaining 75+ languages, over the entire 100+ educated languages. A shortcoming we hope to enhance in future fashions.

English efficiency was measured throughout 12 separate benchmarks, throughout commonsense reasoning, and world data

As soon as once more we see an enormous total bounce from RWKV v4-to-v5 structure. And the v2 world dataset.

The place v4 beforehand misplaced out to MPT-7b, the highest mannequin within the 1T token tier.

v5 begins buying and selling blows in benchmarks, in some circumstances even approaching prime in sure benchmarks ( LAMBADA, StoryCloze16, WinoGrande, HeadQA_en, Sciq ) over Falcon, and even llama2.

As well as, v5 efficiency begins to fall consistent with the anticipated transformer efficiency degree, with its given approximate token coaching depend.

With Mistral-7B sustaining its lead with its rumored 2~7 Trillion token coaching.

We anticipate to slender the hole, as we prepare a further 1T token, to cross the llama2 line and hopefully attain the mistral line.

Alternatively, as a base mannequin, which is evenly tuned (actually small instruct set combined in), we’re desperate to see how the varied neighborhood and instruct-tuned variants

A notable statement was that our checkpoints close to the 300 Billion token level, present related efficiency to pythia-6.9b

That is in step with earlier pile-based experiments on our RWKV-v4 structure, that linear transformers like RWKV scale equally in efficiency ranges to transformers, with the identical token depend coaching.

In that case, it does repeat the query. If the precise structure, matter lower than the information for the mannequin eval efficiency?

CUDA computational value, for RWKV-based structure vs transformer fashions – that quadratic-vs-linear actually scales!

If true, maybe we must always search extra environment friendly and scalable structure, to extend accessibility, drive the price of AI downwards for everybody, and lessen the impact on our environment.

A standard suggestions we obtain for the RWKV multi-lingual strategy is

it hurts our English analysis scores and slows the expansion of linear transformers
that it’s not truthful to check the multi-lingual efficiency of a multi-lingual mannequin -vs- a purely English mannequin

And for many elements, we agree on each factors.

However we’ve got no plans on altering this, as we’re constructing AI for the world – which isn’t simply an English world.

In 2023, only 17% of the world’s population speaks English
( 1.3 billion people )

World Map displaying the distribution of areas and people who find themselves fluent in English (supply: Wikipedia)

Nevertheless, by guaranteeing help for the highest 25 languages on the earth and past, we are able to cowl roughly 4 billion people, or 50% of the world

Incomplete map, of areas the place the Eagle Language mannequin will help solely or partially

This aligns nicely with the workforce’s frequent aim, of getting AI to help everybody, not simply by permitting it to run cheaply and affordably even on lower-end {hardware}. However by supporting their language.

A significant instance of this in our neighborhood is the Indonesian-NLP discord group, which finetunes an Indonesian language mannequin from the RWKV line of base fashions.

Permitting them to construct sturdy language-specific fashions – on an affordable inexpensive foundation (ie. single node), without having to do half 1,000,000 {dollars} of pre-training.

This launch marks the discharge of the strongest linear transformer (by way of eval benchmarks) thus far.

Whereas it might not have succeeded in passing LLaMA2 and Mistral. It offers sturdy proof of the next

The RWKV-v5 mannequin structure scales equally to transformer efficiency with an identical token depend
You may obtain a close to LLaMA2-like degree of efficiency, with a considerably decrease inference value
Whereas supporting multi-lingual ranges of efficiency

We plan to observe by pushing additional forward with

[Feb 2024] An up to date RWKV v5: Eagle paper, the place we are going to go deeper in-depth on the structure adjustments since v4, and the mannequin benchmarks and evals
[Feb 2024] An additional 1T token in coaching (2T whole), for direct comparisons with the LLaMA2 7B mannequin
[Mar 2024] An MoE mannequin primarily based on the v5 Eagle 2T mannequin
[Mar 2024] RWKV-v6: “Finch” 1.5B, 3B world fashions