Immediate Engineering | Lil’Log
Immediate Engineering, also referred to as In-Context Prompting, refers to strategies for talk with LLM to steer its habits for desired outcomes with out updating the mannequin weights. It’s an empirical science and the impact of immediate engineering strategies can differ loads amongst fashions, thus requiring heavy experimentation and heuristics.
This put up solely focuses on immediate engineering for autoregressive language fashions, so nothing with Cloze assessments, picture technology or multimodality fashions. At its core, the objective of immediate engineering is about alignment and mannequin steerability. Examine my previous post on controllable textual content technology.
[My personal spicy take] For my part, some immediate engineering papers aren’t worthy 8 pages lengthy, since these methods may be defined in a single or just a few sentences and the remaining is all about benchmarking. A straightforward-to-use and shared benchmark infrastructure ought to be extra useful to the group. Iterative prompting or exterior software use wouldn’t be trivial to arrange. Additionally non-trivial to align the entire analysis group to undertake it.
Helpful Sources
Zero-shot and few-shot studying are two most elementary approaches for prompting the mannequin, pioneered by many LLM papers and generally used for benchmarking LLM efficiency.
Zero-Shot#
Zero-shot studying is to easily feed the duty textual content to the mannequin and ask for outcomes.
(All of the sentiment evaluation examples are from SST-2)
Textual content: i am going to guess the online game is much more enjoyable than the movie.
Sentiment:
Few-shot#
Few-shot studying presents a set of high-quality demonstrations, every consisting of each enter and desired output, on the goal activity. Because the mannequin first sees good examples, it may higher perceive human intention and standards for what sorts of solutions are wished. Subsequently, few-shot studying typically results in higher efficiency than zero-shot. Nonetheless, it comes at the price of extra token consumption and will hit the context size restrict when enter and output textual content are lengthy.
Textual content: (lawrence bounces) everywhere in the stage, dancing, working, sweating, mopping his face and usually displaying the wacky expertise that introduced him fame within the first place.
Sentiment: constructive
Textual content: regardless of all proof on the contrary, this clunker has in some way managed to pose as an precise characteristic film, the sort that costs full admission and will get hyped on television and purports to amuse babies and ostensible adults.
Sentiment: detrimental
Textual content: for the primary time in years, de niro digs deep emotionally, maybe as a result of he is been stirred by the highly effective work of his co-stars.
Sentiment: constructive
Textual content: i am going to guess the online game is much more enjoyable than the movie.
Sentiment:
Many research appeared into assemble in-context examples to maximise the efficiency and noticed that selection of immediate format, coaching examples, and the order of the examples can result in dramatically completely different efficiency, from close to random guess to close SoTA.
Zhao et al. (2021) investigated the case of few-shot classification and proposed that a number of biases with LLM (they use GPT-3 within the experiments) contribute to such excessive variance: (1) Majority label bias exists if distribution of labels among the many examples is unbalanced; (2) Recency bias refers back to the tendency the place the mannequin could repeat the label on the finish; (3) Frequent token bias signifies that LLM tends to supply frequent tokens extra typically than uncommon tokens. To beat such bias, they proposed a way to calibrate the label chances output by the mannequin to be uniform when the enter string is N/A
.
Ideas for Instance Choice#
-
Select examples which might be semantically much like the take a look at instance utilizing $ok$-NN clustering within the embedding area (Liu et al., 2021)
-
To pick a various and consultant set of examples, Su et al. (2022) proposed to make use of a graph-based strategy: (1) First, assemble a directed graph $G=(V, E)$ primarily based on the embedding (e.g. by SBERT or other embedding models) cosine similarity between samples, the place every node factors to its $ok$ nearest neighbors; (2) Begin with a set of chosen samples $mathcal{L}=emptyset$ and a set of remaining samples $mathcal{U}$. Every pattern $u in mathcal{U}$ is scored by $$
textual content{rating}(u) = sum_{v in {v mid (u, v) in E, vin mathcal{U}}} s(v)quadtext{the place }s(v)=rho^{- vert {ell in mathcal{L} vert (v, ell)in E }vert},quadrho > 1
$$ such that $s(v)$ is low if a lot of $v$’s neighbors are chosen and thus the scoring encourages to select numerous samples. -
Rubin et al. (2022) proposed to coach embeddings by way of contrastive learning particular to 1 coaching dataset for in-context studying pattern choice. Given every coaching pair $(x, y)$, the standard of 1 instance $e_i$ (formatted input-output pair) may be measured by a conditioned likelihood assigned by LM: $textual content{rating}(e_i) = P_text{LM}(y mid e_i, x)$. We will determine different examples with top-$ok$ and bottom-$ok$ scores as constructive and detrimental units of candidates for each coaching pair and use that for contrastive studying.
-
Some researchers tried Q-Learning to do pattern choice. (Zhang et al. 2022)
-
Motivated by uncertainty-based active learning, Diao et al. (2023) prompt to determine examples with excessive disagreement or entropy amongst a number of sampling trials. Then annotate these examples for use in few-shot prompts.
Ideas for Instance Ordering#
- A basic suggestion is to maintain the collection of examples numerous, related to the take a look at pattern and in random order to keep away from majority label bias and recency bias.
- Rising mannequin sizes or together with extra coaching examples doesn’t cut back variance amongst completely different permutations of in-context examples. Identical order may go properly for one mannequin however badly for one more. When the validation set is restricted, think about selecting the order such that the mannequin doesn’t produce extraordinarily unbalanced predictions or being overconfident about its predictions. (Lu et al. 2022)
The aim of presenting few-shot examples within the immediate is to clarify our intent to the mannequin; in different phrases, describe the duty instruction to the mannequin within the type of demonstrations. Nonetheless, few-shot may be costly when it comes to token utilization and restricts the enter size on account of restricted context size. So, why not simply give the instruction immediately?
Instructed LM (e.g. InstructGPT, natural instruction) finetunes a pretrained mannequin with high-quality tuples of (activity instruction, enter, floor reality output) to make LM higher perceive consumer intention and observe instruction. RLHF (Reinforcement Studying from Human Suggestions) is a standard methodology to take action. The advantage of instruction following fashion fine-tuning improves the mannequin to be extra aligned with human intention and drastically reduces the price of communication.
When interacting with instruction fashions, we should always describe the duty requirement in particulars, attempting to be particular and exact and avoiding say “not do one thing” however quite specify what to do.
Please label the sentiment in direction of the film of the given film evaluation. The sentiment label ought to be "constructive" or "detrimental".
Textual content: i am going to guess the online game is much more enjoyable than the movie.
Sentiment:
Explaining the specified viewers is one other good approach to give directions
- For instance to supply schooling supplies for youths,
Describe what's quantum physics to a 6-year-old.
... in language that's protected for work.
In-context instruction studying (Ye et al. 2023) combines few-shot studying with instruction prompting. It incorporates a number of demonstration examples throughout completely different duties within the immediate, every demonstration consisting of instruction, activity enter and output. Observe that their experiments have been solely on classification duties and the instruction immediate comprises all label choices.
Definition: Decide the speaker of the dialogue, "agent" or "buyer".
Enter: I've efficiently booked your tickets.
Ouput: agent
Definition: Decide which class the query asks for, "Amount" or "Location".
Enter: What is the oldest constructing in US?
Ouput: Location
Definition: Classify the sentiment of the given film evaluation, "constructive" or "detrimental".
Enter: i am going to guess the online game is much more enjoyable than the movie.
Output:
Self-consistency sampling (Wang et al. 2022a) is to pattern a number of outputs with temperature > 0 after which selecting the right one out of those candidates.
The factors for selecting the right candidate can differ from activity to activity. A basic answer is to select majority vote. For duties which might be simple to validate reminiscent of a programming query with unit assessments, we will merely run by means of the interpreter and confirm the correctness with unit assessments.
Chain-of-thought (CoT) prompting (Wei et al. 2022) generates a sequence of quick sentences to explain reasoning logics step-by-step, often called reasoning chains or rationales, to finally result in the ultimate reply. The advantage of CoT is extra pronounced for difficult reasoning duties, whereas utilizing massive fashions (e.g. with greater than 50B parameters). Easy duties solely profit barely from CoT prompting.
Sorts of CoT prompts#
Two primary sorts of CoT prompting:
- Few-shot CoT. It’s to immediate the mannequin with just a few demonstrations, every containing manually written (or model-generated) high-quality reasoning chains.
(All the mathematics reasoning examples are from GSM8k)
Query: Tom and Elizabeth have a contest to climb a hill. Elizabeth takes half-hour to climb the hill. Tom takes 4 occasions so long as Elizabeth does to climb the hill. What number of hours does it take Tom to climb up the hill?
Reply: It takes Tom 30*4 = <<30*4=120>>120 minutes to climb the hill.
It takes Tom 120/60 = <<120/60=2>>2 hours to climb the hill.
So the reply is 2.
===
Query: Jack is a soccer participant. He wants to purchase two pairs of socks and a pair of soccer sneakers. Every pair of socks price $9.50, and the sneakers price $92. Jack has $40. How way more cash does Jack want?
Reply: The overall price of two pairs of socks is $9.50 x 2 = $<<9.5*2=19>>19.
The overall price of the socks and the sneakers is $19 + $92 = $<<19+92=111>>111.
Jack want $111 - $40 = $<<111-40=71>>71 extra.
So the reply is 71.
===
Query: Marty has 100 centimeters of ribbon that he should lower into 4 equal elements. Every of the lower elements should be divided into 5 equal elements. How lengthy will every last lower be?
Reply:
- Zero-shot CoT. Use pure language assertion like
Let's suppose step-by-step
to explicitly encourage the mannequin to first generate reasoning chains after which to immediate withSubsequently, the reply is
to supply solutions (Kojima et al. 2022 ). Or an analogous assertionLet's work this out it a step-by-step to make certain now we have the precise reply
(Zhou et al. 2022).
Query: Marty has 100 centimeters of ribbon that he should lower into 4 equal elements. Every of the lower elements should be divided into 5 equal elements. How lengthy will every last lower be?
Reply: Let's suppose step-by-step.
Ideas and Extensions#
-
Self-consistency sampling can enhance reasoning accuracy by sampling quite a few numerous solutions after which taking the bulk vote. (Wang et al. 2022a)
-
One other strategy for ensemble studying is to change the instance order or use mannequin generated rationales to exchange human-written ones to introduce randomness throughout a number of pattern trials. Then combination mannequin outputs with a majority vote to get last reply. (Wang et al. 2022b)
-
If coaching examples are solely related to true solutions (simple to confirm!) however no rationales, we will observe the STaR (Self-Taught Reasoner; Zelikman et al. 2022) methodology : (1) Ask LLM to generate reasoning chains and solely preserve these resulting in appropriate solutions; (2) Then fine-tune the mannequin with generated rationales and repeat the method till convergence. Observe that larger temperature is extra prone to generate incorrect rationales with appropriate solutions. If coaching examples don’t have floor reality solutions, perhaps think about using majority votes because the “appropriate” solutions.
-
Prompts with demonstrations of upper reasoning complexity can obtain higher efficiency, the place complexity is measured by the variety of reasoning steps within the chains. When separating reasoning steps, newline
n
image works higher thanstep i
, interval.
or semicolon;
. (Fu et al. 2023) -
Complexity-based consistency is to explicitly choose advanced chains amongst all of the generations by taking majority vote amongst solely high $ok$ advanced chains. (Fu et al. 2023)
-
Later, Shum et al. (2023) discovered that of their experiments CoT prompts with solely advanced examples can enhance the accuracy of advanced questions, however carry out poorly in easy questions; proof proven on GSM8k.
-
Altering
Q:
toQuery:
is discovered to be useful. (Fu et al. 2023) -
Ye & Durrett (2022) discovered that the advantage of together with explanations within the immediate is small to average for NLP duties that contain reasoning over textual content (i.e. QA and NLI) and the results differ by fashions. They noticed that explanations usually tend to be nonfactual than be inconsistent (i.e. whether or not clarification entails prediction). Nonfactual explanations most probably result in incorrect predictions.
-
Self-Ask (Press et al. 2022) is a technique to repeatedly immediate the mannequin to ask following-up questions to assemble the thought course of iteratively. Observe-up questions may be answered by search engine outcomes. Equally, IRCoT (Interleaving Retrieval CoT; Trivedi et al. 2022) and ReAct (Cause + Act; Yao et al. 2023) combines iterative CoT prompting with queries to Wikipedia APIs to seek for related entities and content material after which add it again into the context.
(Picture supply: Press et al. 2022).
Immediate is a sequence of prefix tokens that improve the likelihood of getting desired output given enter. Subsequently we will deal with them as trainable parameters and optimize them directly on the embedding area by way of gradient descent, reminiscent of AutoPrompt (Shin et al., 2020, Prefix-Tuning (Li & Liang (2021)), P-tuning (Liu et al. 2021) and Immediate-Tuning (Lester et al. 2021). This section in my “Controllable Neural Text Generation” post has a superb protection of them. The development from AutoPrompt to Immediate-Tuning is that the setup will get step by step simplified.
APE (Automated Immediate Engineer; Zhou et al. 2022) is a technique to look over a pool of model-generated instruction candidates after which filters the candidate set in accordance with a selected rating perform to finally select the perfect candidate with highest rating.
-
Immediate LLM to generate instruction candidates primarily based on a small set of demonstrations within the type of input-output pairs. E.g.
{{Given desired input-output pairs}}nnThe instruction is
. -
Given a dataset of $mathcal{D}_text{prepare} = {(x, y)}$, we want to discover an instruction $rho$ such that $rho^* = argmax_rho mathbb{E}_{(x, y) in mathcal{D}_text{prepare}} [f(rho, x, y)]$, the place $f(.)$ is a per-sample rating perform, reminiscent of execution accuracy $mathbb{1}[text{LM}(.vert rho, x)=y]$ or log likelihood: $p_text{LM}(y mid rho, x)$.
-
Use an iterative Monte Carlo search methodology to enhance the perfect candidates by proposing semantically related variants by way of prompts like
Generate a variation of the next instruction whereas maintaining the semantic which means.nnInput: ...nnOutput:...
To assemble chain-of-thought prompts routinely, Shum et al. (2023) prompt augment-prune-select, a three-step course of:
- Increase: Generate a number of pseudo-chains of thought given query utilizing few-shot or zero-shot CoT prompts;
- Prune: Prune pseudo chains primarily based on whether or not generated solutions match floor truths.
- Choose: Apply a variance-reduced coverage gradient technique to be taught the likelihood distribution over chosen examples, whereas contemplating the likelihood distribution over examples as coverage and the validation set accuracy as reward.
Zhang et al. (2023) as a substitute adopted clustering methods to pattern questions after which generates chains. They noticed that LLMs are likely to make sure sorts of errors. One kind of errors may be related within the emebedding area and thus get grouped collectively. By solely sampling one or just a few from frequent-error clusters, we will stop too many incorrect demonstrations of 1 error kind and accumulate a various set of examples.
- Query clustering: Embed questions and run $ok$-means for clustering.
- Demonstration choice: Choose a set of consultant questions from every cluster; i.e. one demonstration from one cluster. Samples in every cluster are sorted by distance to the cluster centroid and people nearer to the centroid are chosen first.
- Rationale technology: Use zero-shot CoT to generate reasoning chains for chosen questions and assemble few-shot immediate to run inference.
A survey on augmented language fashions by Mialon et al. (2023) has nice protection over a number of classes of language fashions augmented with reasoning abilities and the power of utilizing exterior instruments. Suggest it.
Retrieval#
Typically we have to full duties that require newest information after the mannequin pretraining time cutoff or inner/non-public information base. In that case, the mannequin wouldn’t know the context if we don’t explicitly present it within the immediate. Many strategies for Open Domain Question Answering rely on first doing retrieval over a information base after which incorporating the retrieved content material as a part of the immediate. The accuracy of such a course of will depend on the standard of each retrieval and technology steps.
Lazaridou et al. (2022) studied use Google Seek for doc retrieval to enhance LLMs. Given a query $q$, clear textual content is extracted out of 20 URLs returned by Google, leading to a set of paperwork. As a result of these paperwork are lengthy, every doc is break up into paragraphs of 6 sentences, ${p}$. Paragraphs are ranked by TF-IDF primarily based cosine similarity between proof paragraphs and the question. Solely probably the most related paragraph is used within the immediate to supply a solution $a$.
For closed-book QA, every demonstration is formatted as follows to assemble few-shot prompts. Swapping the query with the proof (longer distance between questions and solutions) is discovered to constantly yield decrease outcomes throughout all datasets.
Proof: ...
Query: ...
Reply: ...
The reply likelihood is computed in 3 ways:
- RAG fashion, $p(a_i mid q) = sum_{i=1}^n p_text{tf-idf} (p_i mid q) cdot p_text{LM}(a_i mid q, p_i)$, the place $p_text{tf-idf} (p_i mid q)$ is the normalized cosine similarities between the TF-IDF passage and query representations.
- Noisy channel inference, $p(a_imid q) = frac{p_text{LM}(q mid a_i, p_i) cdot p_text{LM}(a_i mid p_i)}{p_text{LM}(q mid p_i)}$
- Product-of-Consultants (PoE), combines all chances used above along with $p_text{LM}(p_i mid q)$.
In accordance with their experiments on technology and classification duties, amongst three reply reranking scores – PoE > Noisy channel > RAG. Amongst particular person chances, $p_text{LM}(a mid q, p_i)$ and $p_text{LM}(q mid p_i, a)$ are discovered to be most informative. $p_text{LM}(q mid p_i, a)$ captures how properly the query may be defined by LM given proof paragraph and reply and might reliably be used for reranking reply candidates.
One remark with SituatedQA dataset for questions grounded in several dates is that regardless of LM (pretraining cutoff is 12 months 2020) has entry to newest data by way of Google Search, its efficiency on post-2020 questions are nonetheless loads worse than on pre-2020 questions. This means the existence of some discrepencies or conflicting parametric between contextual data and mannequin inner information.
Curiously it’s discovered to be useful even with solely “inner retrieval”, that’s, to generate information a couple of matter earlier than answering the query (Liu et al. 2022). First we will use the next template to extract information:
Generate some information in regards to the enter. Examples:
Enter: What kind of water formation is fashioned by clouds?
Information: Clouds are manufactured from water vapor.
Enter: {query}
Information:
After which with model-generated information, immediate the LM additional to get the reply.
Programming Language#
Each PAL (Program-aided language fashions); Gao et al. 2022) and PoT (Program of Ideas prompting; Chen et al. 2022) ask LLM to generate programming language statements to resolve pure language reasoning issues, therefore offloading the answer step to a runtime reminiscent of a Python interpreter. Such setup decouples advanced computation and reasoning. It depends on a LM with adequate coding abilities.
Exterior APIs#
TALM (Instrument Augmented Language Fashions; Parisi et al. 2022) is a language mannequin augmented with text-to-text API calls. LM is guided to generate |tool-call
and software enter textual content
conditioned on activity enter textual content to assemble API name requests. When |end result
reveals up, the required software API is named and the returned end result will get appended to the textual content sequence. The ultimate output is generated following |output
token.
TALM adopts a self-play strategy to iteratively bootstrap the dataset of software use examples and finetune LM with it. This iterative self-play pipeline mimics a RL course of the place LM is the coverage community and it’s educated by coverage gradient with a binary reward sign.
(Picture supply: Parisi et al. 2022).
Toolformer (Schick et al. 2023) is a LM that may use exterior instruments by way of easy APIs, which is in-built a self-supervised method and solely requires a handful of demonstrations for every API. The toolbox of Toolformer consists of:
- Calculator to assist LM with the shortage of exact math abilities;
- Q&A system to assist with untrue content material and hallucination;
- Search engine to offer up-to-date data after pretraining lower off time;
- Translation system to enhance efficiency on low useful resource language;
- Calendar to make LM concentrate on time development.
(Picture supply: Schick et al. 2023).
Toolformer is educated as follows:
-
Prompting to annotate potential API calls. Ask a pre-trained LM to annotate a dataset by way of few-shot studying with API name utilization examples. Formatting instance:
Fig. 6. How dataset is annotated to do API calls.
(Picture supply: Schick et al. 2023).-
Every API name is represented as a tuple of (API title, corresponding enter), $c=(a_c, i_c)$ and its corresponding result’s denoted as $r$. The API name sequences with and with out outcomes are labeled as follows, respectively:
$$
start{aligned}
e(c) &= langletexttt{API}rangle a_c(i_c) langletexttt{/API}rangle
e(c, r) &= langletexttt{API}rangle a_c(i_c) to r langletexttt{/API}rangle
finish{aligned}
$$ -
Pattern API calls primarily based on the chances $p_text{LM}(langletexttt{API}rangle mid textual content{immediate}(mathbf{x}), mathbf{x}_{1:i})$ and choose high $ok$ candidate positions for doing API calls at place $i$ if the likelihood is bigger than a threshold.
-
Then we pattern potential API calls from the LM given the sequence $[text{prompt}(mathbf{x}), x_1, dots, x_{i-1}, langletexttt{API}rangle]$ as prefix and $langletexttt{/API}rangle$ as suffix.
-
-
Filter annotations primarily based on whether or not API calls assist mannequin predict future tokens. Use a self-supervised loss to resolve which API calls are literally useful.
-
Execute every API name $c_i$ to get corresponding end result $r_i$.
-
Compute weighted cross entropy loss for the LM over tokens $x_i, dots, x_n$ when the mannequin is prefixed with the immediate. Two variations are computed, one with API end result and the opposite with empty sequence $varepsilon$.
$$
start{aligned}
L^+_i &= L_i(e(c_i, r_i))
L^-_i &= min(L_i(varepsilon), L_i(e(c_i, varepsilon)))
finish{aligned}
$$Solely API calls with $L^-_i – L^+_i$ bigger than a threshold are saved, which means that including this API name and its outcomes assist the mannequin predict future tokens.
-
-
Superb-tune LM on this annotated dataset. The brand new coaching sequences are constructed as $mathbf{x}^* = x_{1:i-1}, e(c_i, r_i), x_{i:n}$ . The coaching knowledge is a mixture of the unique dataset (e.g. a subset of CCNet, as within the paper) and its augmented model.
At inference time, decoding runs till the mannequin produces “$to$ ” token, indicating that it’s anticipating response from an API name subsequent.
Toolformer at the moment doesn’t assist software use in a series (i.e. utilizing the output of 1 software as an enter for one more software) or in an interactive approach (i.e. undertake API response after human choice). Each are attention-grabbing future instructions to develop the mannequin for.
Cited as:
Weng, Lilian. (Mar 2023). Immediate Engineering. Lil’Log. https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/.
Or
@article{weng2023prompt,
title = "Immediate Engineering",
writer = "Weng, Lilian",
journal = "lilianweng.github.io",
12 months = "2023",
month = "Mar",
url = "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/"
}
[1] Zhao et al. “Calibrate Before Use: Improving Few-shot Performance of Language Models.” ICML 2021
[2] Liu et al. “What Makes Good In-Context Examples for GPT-3?” arXiv preprint arXiv:2101.06804 (2021).
[3] Lu et al. “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity.” ACL 2022
[4] Ye et al. “In-Context Instruction Learning.” arXiv preprint arXiv:2302.14691 (2023).
[5] Su et al. “Selective annotation makes language models better few-shot learners.” arXiv preprint arXiv:2209.01975 (2022).
[6] Rubin et al. “Learning to retrieve prompts for in-context learning.” NAACL-HLT 2022
[7] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022
[8] Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023.
[9] Diao et al. “Active Prompting with Chain-of-Thought for Large Language Models.” arXiv preprint arXiv:2302.12246 (2023).
[10] Zelikman et al. “STaR: Bootstrapping Reasoning With Reasoning.” arXiv preprint arXiv:2203.14465 (2022).
[11] Ye & Durrett. “The unreliability of explanations in few-shot in-context learning.” arXiv preprint arXiv:2205.03401 (2022).
[12] Trivedi et al. “Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.” arXiv preprint arXiv:2212.10509 (2022).
[13] Press et al. “Measuring and narrowing the compositionality gap in language models.” arXiv preprint arXiv:2210.03350 (2022).
[14] Yao et al. “ReAct: Synergizing reasoning and acting in language models.” ICLR 2023.
[15] Fu et al. “Complexity-based prompting for multi-step reasoning.” arXiv preprint arXiv:2210.00720 (2022).
[16] Wang et al. “Rationale-augmented ensembles in language models.” arXiv preprint arXiv:2207.00747 (2022).
[17] Zhang et al. “Automatic chain of thought prompting in large language models.” arXiv preprint arXiv:2210.03493 (2022).
[18] Shum et al. “Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data.” arXiv preprint arXiv:2302.12822 (2023).
[19] Zhou et al. “Large Language Models Are Human-Level Prompt Engineers.” ICLR 2023.
[20] Lazaridou et al. “Internet augmented language models through few-shot prompting for open-domain question answering.” arXiv preprint arXiv:2203.05115 (2022).
[21] Chen et al. “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks.” arXiv preprint arXiv:2211.12588 (2022).
[22] Gao et al. “PAL: Program-aided language models.” arXiv preprint arXiv:2211.10435 (2022).
[23] Parisi et al. “TALM: Tool Augmented Language Models” arXiv preprint arXiv:2205.12255 (2022).
[24] Schick et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv preprint arXiv:2302.04761 (2023).
[25] Mialon et al. “Augmented Language Models: a Survey” arXiv preprint arXiv:2302.07842 (2023).