A 7B LLM Skilled on 8K Enter Sequence Size

TLDR
We skilled a sequence of 7B LLMs named XGen-7B with commonplace dense consideration on as much as 8K sequence size for as much as 1.5T tokens. We additionally fantastic tune the fashions on public-domain tutorial knowledge. The primary take-aways are:
- On commonplace NLP benchmarks, XGen achieves comparable or higher outcomes compared with state-of-the-art open-source LLMs (e.g. MPT, Falcon, LLaMA, Redpajama, OpenLLaMA) of comparable mannequin measurement.
- Our focused analysis on lengthy sequence modeling benchmarks present advantages of our 8K-seq fashions over 2K- and 4K-seq fashions.
- XGen-7B archives equally robust outcomes each in textual content (e.g., MMLU, QA) and code (HumanEval) duties.
- Coaching price of $150K on 1T tokens beneath Google Cloud pricing for TPU-v4.
Codebase: https://github.com/salesforce/xGen
Mannequin Checkpoint: https://huggingface.co/Salesforce/xgen-7b-8k-base
Why XGen-7B with 8K Sequence Size
As LLMs change into ubiquitous, their purposes to lengthy sequences have been a key focus, particularly for purposes like summarizing textual content (doubtlessly interleaved with different knowledge sources like tables and pictures), writing code, and predicting protein sequences, which require the mannequin to successfully contemplate lengthy distance structural dependencies. A big context permits a pre-trained LLM to have a look at buyer knowledge (e.g., paperwork the LLM didn’t use in coaching) and responds to helpful info in search of queries.
But, most open-source LLMs (e.g., LLaMA, MPT, Falcon) have been skilled with a most of 2K token sequence size, which is a key limitation in modeling lengthy sequences. Inference time options reminiscent of ALiBi have but to be evaluated for bigger fashions (e.g. MPT-7b-StoryWriter-65k+). Latest work on mannequin scaling has proven that for a given compute finances, the most effective performances will not be essentially achieved by the most important fashions, however by smaller fashions skilled on extra knowledge (measured by variety of tokens). A smaller mannequin can be usually most popular for inference effectivity throughout serving together with on-device serving. In mild of this, we practice a sequence of 7B LLMs named XGen with commonplace dense consideration on as much as 8K sequence size for as much as 1.5T tokens. We additionally fantastic tune the XGen fashions on public-domain tutorial knowledge, creating their instruction-tuned counterparts (XGen-7B-inst).
Mannequin |
Description |
XGen-7B-4K-base |
We practice for 800B tokens with a sequence size of 2k tokens first, then for an additional 400B tokens (complete 1.2T tokens) with 4k. Launched beneath Apache-2.0. |
XGen-7B-8K-base |
Initialized with XGen-7B-4K-base and additional skilled for 300B extra tokens (complete 1.5T tokens) with 8K sequence size. Launched beneath Apache-2.0. |
XGen-7B-{4K,8K}-inst |
Supervised fantastic tuned on public area tutorial knowledge together with databricks-dolly-15k, oasst1, Baize and GPT-related datasets. Launched for analysis function solely. |
Pre-training Information
We make use of a two-stage coaching technique, the place every stage makes use of a special knowledge combination.
First stage (1.37T tokens)
Dataset title |
Efficient variety of tokens (B) |
Epochs |
Sampling prop. (%) |
RedPajama-CommonCrawl |
879.37 |
1 |
63.98 |
RedPajama-GitHub |
62.44 |
1 |
4.54 |
RedPajama-Books |
65.18 |
2.5 |
4.74 |
RedPajama-ArXiv |
63.32 |
2 |
4.61 |
RedPajama-StackExchange |
21.38 |
1 |
1.56 |
C4 from 6 CC dumps |
191.5 |
0.2 |
13.93 |
Wikipedia-English |
19.52 |
4 |
1.42 |
Wikipedia-21 different languages |
62.04 |
2 |
4.51 |
Pile_DM_Mathematics |
7.68 |
2 |
0.56 |
Apex code from 6 CC dumps |
2.09 |
1 |
0.15 |
Whole |
1374.52 |
100 |
For C4, we processed 6 Frequent Crawl dumps with C4 pipeline, and deduplicated the paperwork throughout totally different dumps by solely preserving the most recent timestamp for the paperwork with the identical URL. We skilled a linear mannequin, which classifies the C4 knowledge as a Wikipedia-like doc vs. a random doc. We then selected the highest 20% Wikipedia-like paperwork. For Wikipedia, we cowl 22 languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk, ja, zh, greater than LLaMA (20 languages) and MPT (English solely).
Second stage (110B tokens)
To assist code-generation duties, within the second stage we combine extra code knowledge from Starcoder with the information from Stage 1.
Dataset title |
Variety of tokens used (B) |
Sampling prop. (%) |
Information talked about above |
55 |
50% |
BigCode Starcoder |
55 |
50% |
We use OpenAI’s tiktoken to tokenize our knowledge. We add further tokens for consecutive whitespaces and tabs, in addition to the particular tokens described within the Starcoder paper.
Coaching Particulars
The XGen-7b fashions are skilled with our in-house library JaxFormer, which facilitates environment friendly coaching of LLMs beneath each knowledge and mannequin parallelism optimized for TPU-v4 {hardware}. The coaching recipe and mannequin structure comply with LLaMA, whereas we conduct two further explorations. First, we examine the prevalence of so-called “loss spikes” [PaLM, loss spikes] throughout coaching, that’s, the loss immediately explodes quickly whereas the foundation trigger for these spikes is unknown. Second, the XGen fashions assist sequence lengths as much as 8,192 tokens (slightly than the widespread 2,048) for which we introduce stage-wise coaching.
Loss Spikes
As fashions are scaled to bigger sizes, the coaching itself is more and more delicate to instabilities, which trigger poor mannequin efficiency, if not addressed fastidiously. In our exploration, we’ve gathered proof for a number of components, which individually contribute to unstable coaching. These preliminary findings embody “sequential over parallel circuits”, “swish-GLU over GeLU”, “RMS-Norm over Layer-norm”. Particularly, broadly used parallel circuits, which parallelize the computation of self-attention and feed-forward as adopted in [GPT-J, PaLM, CodeGen] might have an effect on the steadiness of coaching.

The determine above shows the loss by way of cross-entropy over time following the well-known scaling legal guidelines. Remarkably, the coaching doesn’t undergo from any instabilities or loss spikes. The 2 loss spikes depicted within the determine are anticipated when extending the sequence size, say from 2k to 4k tokens, because the mannequin must adapt to such longer sequences.
Sequence Size
Coaching with longer sequences is computationally unproportionally pricey because the complexity of self-attention is quadratic, that’s, the coaching course of is gradual. To mitigate gradual coaching, we introduce coaching in phases with rising sequence size. First, 800B tokens with sequence size of 2k tokens are noticed, then 400B tokens with 4k, lastly, 300B tokens with 8k size.

We confirm the difference to longer sequences by computing the common perplexity at every token place on a held-out validation set containing paperwork of 8k sequence size or above. If the mannequin efficiently learns to make the most of the complete sequence, we’d anticipate the perplexity to lower over sequence size, as earlier tokens carry info for the subsequent to-be-predicted token. That’s, for a protracted sentence, the extra context within the type of earlier phrases is offered, the simpler it turns into to guess the subsequent phrase. The determine above certainly exhibits that XGen at every stage efficiently learns to make the most of longer contexts, as much as 8k sequence size.
Outcomes on Customary Benchmarks
(i) MMLU
We first contemplate the Measuring Huge Multitask Language Understanding benchmark (see examples here), which is more moderen than others as a result of which it’s arguably much less vulnerable to knowledge contamination as reported in latest research (see web page 32 of GPT-4 paper and a associated dialogue here), and has been used persistently as a held-out analysis benchmark. Just lately, nonetheless, inconsistencies in reporting MMLU scores have been reported, which resulted in flawed rankings in Hugginface’s Open LLM leaderboard; Actually, Huggingface later needed to write a blog to make clear this. In our work, we comply with the unique MMLU commonplace, which is in step with the printed outcomes (i.e., in LLaMA).
MMLU 5-shot In-context Studying Outcomes: We first present outcomes on the unique (and beneficial) 5-shot analysis setting, the place the LLM is supplied with 5 demonstrations. XGen achieves the most effective leads to most classes, additionally in weighted common.
Fashions |
Humanities |
STEM |
Social Sciences |
Different |
Weighted common |
XGen-7b |
33.8 |
30.7 |
40.0 |
41.5 |
36.3 |
LLaMA-7b |
33.9 |
30.6 |
38.2 |
38.2 |
35.1 |
OpenLLaMA-7b |
28.1 |
28.5 |
31.2 |
32.8 |
29.9 |
Falcon-7b |
26.5 |
25.4 |
29.2 |
26.8 |
26.9 |
MPT-7b |
25.9 |
26.2 |
26.9 |
28.1 |
26.7 |
Redpajama-7b |
26.1 |
25.2 |
27.4 |
26.7 |
26.3 |
Cerebras-GPT-13b |
26.1 |
26.5 |
25.8 |
26.6 |
26.2 |
Dolly-v2-12b |
26.9 |
25.7 |
25.3 |
26.5 |
26.2 |
OPT-13b |
26.2 |
24.3 |
23.4 |
26 |
25.1 |
GPT-J-6b |
25.9 |
24.0 |
24.0 |
25.8 |
25.1 |
MMLU 0-shot Outcomes: On zero-shot MMLU, equally we see good outcomes though the distinction with LLaMA is mostly much less right here.
Fashions |
Humanities |
STEM |
Social Sciences |
Different |
Weighted common |
XGen-7b |
31.4 |
27.8 |
32.1 |
37.2 |
32.1 |
LLaMA-7b |
32.3 |
27.1 |
31.3 |
36.8 |
32.0 |
OpenLLaMA-7b |
28.0 |
27.6 |
28.9 |
30.1 |
28.6 |
MPT-7b |
27.4 |
25.2 |
26.0 |
30.7 |
27.4 |
Redpajama-7b |
27.5 |
25.5 |
24.2 |
25.0 |
25.8 |
GPT-J-6b |
25.3 |
24.5 |
25.5 |
27.6 |
25.7 |
Dolly-v2-12b |
26.2 |
26.0 |
24.0 |
24.9 |
25.4 |
Cerebras-GPT-13b |
24.3 |
25.0 |
23.0 |
26.0 |
24.6 |
OPT-13b |
26.3 |
23.3 |
23.6 |
23.6 |
24.4 |
Falcon-7b |
24.8 |
21.7 |
24.0 |
24.4 |
23.9 |
(ii) Normal Zero-shot Outcomes
Subsequent, we report common zero-shot outcomes on common NLP duties that contain widespread sense reasoning and QA.
Fashions |
MMLU -wavg |
ARC_ch |
Hella Swag |
Winogrande |
TruthfulQA |
BoolQ |
PiQA |
OpenBookQA |
XGen-7b |
32.1 |
41.2 |
74.2 |
64.9 |
39.1 |
74.3 |
75.5 |
40.2 |
LLaMA-7b |
32.0 |
44.8 |
76.2 |
69.6 |
34 |
74.9 |
78.7 |
44.2 |
Falcon-7b |
23.9 |
43.4 |
76.4 |
67.2 |
34.3 |
73.8 |
79.4 |
44.0 |
MPT-7b |
27.4 |
41.7 |
76.1 |
68.6 |
33.4 |
74.1 |
79.1 |
41.8 |
OpenLLaMA-7b |
28.6 |
38.7 |
71.8 |
67.0 |
35.2 |
70.6 |
76.0 |
39.0 |
Redpajama-7b |
25.8 |
39.1 |
70.3 |
63.8 |
33.3 |
69.3 |
76.9 |
40.0 |
GPT-neox-20b |
24.5 |
41.1 |
70.5 |
66.1 |
31.4 |
64.9 |
76.7 |
38.8 |
OPT-13b |
24.4 |
35.8 |
69.9 |
64.7 |
33.9 |
65.0 |
75.7 |
39.8 |
GPT-J-6b |
25.7 |
36.3 |
66.2 |
64.5 |
36.0 |
65.4 |
75.4 |
38.2 |
Dolly-v2-12b |
25.4 |
39.6 |
70.8 |
61.8 |
34.4 |
56.3 |
75.4 |
39.2 |
Cerebras-GPT-13b |
24.6 |
32.4 |
59.4 |
60.8 |
39.2 |
61.1 |
73.5 |
35.8 |
StableLM-alpha-7b |
24.4 |
27.0 |
40.7 |
51.5 |
41.7 |
59.0 |
65.8 |
32.4 |
(iii) Outcomes on Code Era
To judge XGen’s code era functionality from pure language directions (docstrings), we consider it on the well-known HumanEval benchmark. We set the sampling temperature to 0.2, p to 0.95 (for top-p sampling), and num_samples_per_task (n) to 200. We report the usual zero-shot outcomes with cross@1 metric.
Fashions |
cross@1 |
XGen-7b |
14.20 |
LLaMA-7b |
10.38 |
OpenLLaMA-7b |
0 (Consecutive whitespaces are handled as one, breaking Python syntax) |
Falcon-7b |
0 (didn’t generate significant code) |
MPT-7b |
15.90 |
Redpajama-7b |
5.24 |
Outcomes on Lengthy Sequence Era Duties
To additional consider our XGen-7b 8k mannequin compared to baselines that are restricted to 2k inputs, we flip to long-form dialogue era, textual content summarization and QA. All these duties profit from utilizing processing and understanding a protracted context to generate an accurate response. Word that for these duties a lot of the base pre-trained fashions didn’t generate a believable response due to the duty issue. We thus use instruction-tuned fashions.
Dialogue
To evaluate the lengthy dialogue understanding and summarization capabilities, we report outcomes on three dialogue summarization duties: AMI meeting summarization, ForeverDreaming (FD), and TVMegaSite (TMS) screenplay summarization. The common supply lengths of those datasets are roughly 5570, 6466, and 7653, respectively. We particularly consider samples which can be lower than 8K in size utilizing varied instruction-tuned fashions. Notably, when enter truncation was not utilized, each MPT-7b-inst and Alpaca-inst didn’t carry out properly on this setting. Our mannequin (XGen-7B-inst) achieved the very best ROUGE scores throughout all metrics.
Mannequin |
AMI |
FD |
TMS |
||||||
R-1 |
R-2 |
R-L |
R-1 |
R-2 |
R-L |
R-1 |
R-2 |
R-L |
|
XGen-7b-inst |
31.34 |
8.25 |
17.00 |
29.34 |
5.39 |
16.43 |
26.39 |
3.94 |
13.71 |
Falcon-7b-inst |
14.89 |
1.97 |
9.28 |
18.90 |
1.80 |
9.37 |
18.90 |
1.80 |
9.37 |
MPT-7b-inst |
11.95 |
1.88 |
8.10 |
14.27 |
1.40 |
8.89 |
19.80 |
2.39 |
10.23 |
Alpaca-7b-inst |
9.69 |
1.77 |
6.43 |
16.26 |
1.56 |
10.66 |
12.26 |
1.15 |
7.30 |
Lengthy-form QA
Subsequent, we consider our XGen-7b-inst on a long-form QA activity that we’ve designed in-house. We ask ChatGPT to generate questions from (a) lengthy Wikipedia paperwork spanning 4 domains: Physics, Engineering, Historical past, and Leisure, and (b) summaries of those paperwork. Then we question the LLMs to generate solutions for these questions. The solutions are usually as much as 256 tokens lengthy. We use GPT-4 to guage the reply high quality by way of coherence (construction and group) and relevance (relevance of generated reply to the query and the context doc) on a scale of 0-3. From the outcomes beneath, we see our mannequin has greater scores in numerous facets in comparison with the baselines thought of.
Mannequin |
Metrics |
||
Coherence |
Relevance |
Avg. Rankings |
|
XGen-7b-inst |
2.55 |
2.52 |
2.54 |
MPT-7b-inst |
2.5 |
2.45 |
2.48 |
Alpaca-7b-inst |
1.65 |
1.91 |
1.78 |
Falcon-7b-inst |
2.26 |
2.13 |
2.19 |
Summarization
Right here, we consider our mannequin on two textual content summarization datasets included within the SCROLLS Benchmark, specifically QMSum and GovReport. They cowl two totally different domains — assembly conversations and authorities experiences. Moreover, QMSum knowledge contains particular pure language queries which instruct the mannequin about the important thing facets of the supply doc that must be included within the abstract. We see that our mannequin XGen-7b outperforms different baselines on these duties.
Mannequin |
QMSum |
GovReports |
||||
---|---|---|---|---|---|---|
R-1 |
R-2 |
R-L |
R-1 |
R-2 |
R-L |
|
XGen-7b-inst |
27.96 |
5.66 |
24.26 |
21.28 |
8.19 |
20.08 |
Falcon-7b-inst |
15.68 |
2.81 |
14.01 |
17.8 |
6.13 |
16.66 |
MPT-7b-inst |
21.75 |
4.38 |
19.29 |
18.11 |
6.96 |
17.11 |
Redpajama-7b-inst |
19.81 |
2.66 |
17.58 |
19.63 |
6.93 |
18.48 |
As we see encouraging outcomes of our XGen-7b fashions on these lengthy sequence duties, we want to word that since these fashions will not be skilled on the identical tutorial knowledge, they don’t seem to be strictly comparable.
Word on Potential Dangers
Lastly, regardless of our effort in addressing the dangers of bias, toxicity and hallucinations each in pre-training and fine-tuning phases, like different LLMs, XGen-7b fashions will not be free from such limitations. We hope our open-sourced codebase will assist different researchers higher perceive these challenges and enhance on these key limitations for making AI helpful for everybody.