Now Reading
Toy Fashions of Superposition

Toy Fashions of Superposition

2023-08-21 11:07:11

It will be very handy if the person neurons of synthetic neural networks corresponded to cleanly interpretable options of the enter. For instance, in an “ideally suited” ImageNet classifier, every neuron would fireplace solely within the presence of a particular visible characteristic, comparable to the colour pink, a left-facing curve, or a canine snout. Empirically, in fashions we’ve got studied, among the neurons do cleanly map to options. But it surely is not at all times the case that options correspond so cleanly to neurons, particularly in massive language fashions the place it really appears uncommon for neurons to correspond to scrub options. This brings up many questions. Why is it that neurons generally align with options and generally do not? Why do some fashions and duties have many of those clear neurons, whereas they’re vanishingly uncommon in others?

On this paper, we use toy fashions — small ReLU networks educated on artificial knowledge with sparse enter options — to analyze how and when fashions signify extra options than they’ve dimensions. We name this phenomenon superposition . When options are sparse, superposition permits compression past what a linear mannequin would do, at the price of “interference” that requires nonlinear filtering.

Think about a toy mannequin the place we practice an embedding of 5 options of various significanceThe place “significance” is a scalar multiplier on imply squared error loss. in two dimensions, add a ReLU afterwards for filtering, and range the sparsity of the options. With dense options, the mannequin learns to signify an orthogonal foundation of a very powerful two options (just like what Principal Element Evaluation would possibly give us), and the opposite three options usually are not represented. But when we make the options sparse, this modifications:

Not solely can fashions retailer extra options in superposition by tolerating some interference, however we’ll present that, not less than in sure restricted circumstances, fashions can carry out computation whereas in superposition. (Specifically, we’ll present that fashions can put easy circuits computing absolutely the worth operate in superposition.) This leads us to hypothesize that the neural networks we observe in follow are in some sense noisily simulating bigger, extremely sparse networks. In different phrases, it is potential that fashions we practice will be regarded as doing “the identical factor as” an imagined much-larger mannequin, representing the very same options however with no interference.

Function superposition is not a novel thought. A variety of earlier interpretability papers have thought of it , and it’s totally intently associated to the long-studied matter of compressed sensing in arithmetic , in addition to the concepts of distributed, dense, and inhabitants codes in neuroscience and deep studying . What, then, is the contribution of this paper?

For interpretability researchers, our major contribution is offering a direct demonstration that superposition happens in synthetic neural networks given a comparatively pure setup, suggesting this may occasionally additionally happen in follow. That’s, we present a case the place deciphering neural networks as having sparse construction in superposition is not only a helpful post-hoc interpretation, however really the “floor reality” of a mannequin. We provide a idea of when and why this happens, revealing a  phase diagram for superposition. This explains why neurons are generally “monosemantic” responding to a single characteristic, and generally “polysemantic” responding to many unrelated options. We additionally uncover that, not less than in our toy mannequin, superposition displays complex geometric structure.

However our outcomes can also be of broader curiosity. We discover preliminary proof that superposition could also be linked to adversarial examples and grokking, and may also counsel a idea for the efficiency of combination of consultants fashions. Extra broadly, the toy mannequin we examine has unexpectedly wealthy construction, exhibiting phase changes, a geometric structure primarily based on uniform polytopes, “energy level”-like jumps throughout coaching, and a phenomenon which is qualitatively just like the fractional quantum Corridor impact in physics, amongst different hanging phenomena. We initially investigated the topic to achieve understanding of cleanly-interpretable neurons in bigger fashions, however we have discovered these toy fashions to be surprisingly attention-grabbing in their very own proper.

Key Results From Our Toy Models

In our toy fashions, we’re in a position to display that:

  • Superposition is an actual, noticed phenomenon.
  • Each monosemantic and polysemantic neurons can type.
  • A minimum of some sorts of computation will be carried out in superposition.
  • Whether or not options are saved in superposition is ruled by a part change. 
  • Superposition organizes options into geometric buildings comparable to digons, triangles, pentagons, and tetrahedrons.

Our toy fashions are easy ReLU networks, so it appears honest to say that neural networks exhibit these properties in not less than some regimes, nevertheless it’s very unclear what to generalize to actual networks.


In our work, we usually consider neural networks as having options of the enter represented as instructions in activation area. This is not a trivial declare. It is not apparent what sort of construction we should always count on neural community representations to have. Once we say one thing like “phrase embeddings have a gender path” or “imaginative and prescient fashions have curve detector neurons”, one is implicitly making sturdy claims concerning the construction of community representations.

Regardless of this, we consider this type of “linear illustration speculation” is supported each by vital empirical findings and theoretical arguments. One would possibly consider this as two separate properties, which we’ll discover in additional element shortly:

  • Decomposability: Community representations will be described when it comes to independently comprehensible options.
  • Linearity: Options are represented by path.

If we hope to reverse engineer neural networks, we want a property like decomposability. Decomposability is what allows us to reason about the model with out becoming the entire thing in our heads! But it surely’s not sufficient for issues to be decomposable: we want to have the ability to entry the decomposition by some means. As a way to do that, we have to determine the person options inside a illustration. In a linear illustration, this corresponds to figuring out which instructions in activation area correspond to which unbiased options of the enter.

Generally, figuring out characteristic instructions could be very straightforward as a result of options appear to correspond to neurons. For instance, many neurons within the early layers of InceptionV1 clearly correspond to options (e.g. curve detector neurons ). Why is it that we generally get this extraordinarily useful property, however in different circumstances do not? We hypothesize that there are actually two countervailing forces driving this:

  • Privileged Foundation: Just some representations have a privileged foundation which inspires options to align with foundation instructions (i.e. to correspond to neurons).
  • Superposition: Linear representations can signify extra options than dimensions, utilizing a technique we name superposition. This may be seen as neural networks simulating bigger networks. This pushes options away from similar to neurons.

Superposition has been hypothesized in earlier work , and in some circumstances, assuming one thing like superposition has been proven to assist discover interpretable construction . Nonetheless, we’re not conscious of characteristic superposition having been unambiguously demonstrated to happen in neural networks earlier than ( demonstrates a intently associated phenomenon of mannequin superposition). The aim of this paper is to vary that, demonstrating superposition and exploring the way it interacts with privileged bases. If superposition happens in networks, it deeply influences what approaches to interpretability analysis make sense, so unambiguous demonstration appears vital.

The aim of this part will likely be to inspire these concepts and unpack them intimately.

It is value noting that lots of the concepts on this part have shut connections to concepts in different strains of interpretability analysis (particularly disentanglement), neuroscience (distributed representations, inhabitants codes, and so forth), compressed sensing, and lots of different strains of labor. This part will give attention to articulating our perspective on the issue. We’ll focus on these different strains of labor intimately in Related Work.

Empirical Phenomena

Once we speak about “options” and the way they’re represented, that is in the end idea constructing round a number of noticed empirical phenomena. Earlier than describing how we conceptualize these outcomes, we’ll merely describe among the main outcomes motivating our considering:

  • Phrase Embeddings – A well-known consequence by Mikolov et al. discovered that phrase embeddings seem to have instructions which correspond to semantic properties, permitting for embedding arithmetic vectors comparable to V("king") - V("man") + V("girl") = V("queen") (however see ).
  • Latent Areas – Related “vector arithmetic” and interpretable path outcomes have additionally been discovered for generative adversarial networks (e.g. ).
  • Interpretable Neurons – There’s a vital physique of outcomes discovering neurons which seem like interpretable (in RNNs ; in CNNs ; in GANs ), activating in response to some comprehensible property. This work has confronted some skepticism . In response, a number of papers have aimed to offer extraordinarily detailed accounts of some particular neurons, within the hope of dispositively establishing examples of neurons which really detect some comprehensible property (notably Cammarata et al. , but in addition ).
  • Universality – Many analogous neurons responding to the identical properties will be discovered throughout networks .
  • Polysemantic Neurons – On the identical time, there are additionally many neurons which seem to not reply to an interpretable property of the enter, and specifically, many polysemantic neurons which seem to answer unrelated mixtures of inputs .

In consequence, we have a tendency to think about neural community representations as being composed of options that are represented as instructions. We’ll unpack this concept within the following sections.

What are Features?

Our use of the time period “characteristic” is motivated by the interpretable properties of the enter we observe neurons (or phrase embedding instructions) responding to. There is a wealthy number of such noticed properties!Within the context of imaginative and prescient, these have ranged from low-level neurons like curve detectors and high-low frequency detectors, to extra complicated neurons like oriented dog-head detectors or car detectors, to extraordinarily summary neurons similar to famous people, emotions, geographic regions, and more . In language fashions, researchers have discovered phrase embedding instructions comparable to a male-female or singular-plural path , low-level neurons disambiguating phrases that happen in a number of languages, rather more summary neurons, and “motion” output neurons that assist produce sure phrases . We would like to make use of the time period “characteristic” to embody all these properties.

However even with that motivation, it seems to be fairly difficult to create a passable definition of a characteristic. Quite than supply a single definition we’re assured about, we take into account three potential working definitions:

  • Options as arbitrary capabilities. One method can be to outline options as any operate of the enter (as in ). However this does not fairly appear to suit our motivations. There’s one thing particular about these options that we’re observing: they appear to in some sense be basic abstractions for reasoning concerning the knowledge, with the identical options forming reliably throughout fashions. Options additionally appear identifiable: cat and automotive are two options whereas cat+automotive and cat-car look like mixtures of options quite than options in some vital sense.
  • Options as interpretable properties. All of the options we described are strikingly comprehensible to people. One may attempt to use this for a definition: options are the presence of human comprehensible “ideas” within the enter. But it surely appears vital to permit for options we would not perceive. If AlphaFold discovers some vital chemical construction for predicting protein folding, it very properly may not be one thing we initially perceive!
  • Neurons in Sufficiently Massive Fashions. A remaining method is to outline options as properties of the enter which a sufficiently massive neural community will reliably dedicate a neuron to representing.This definition is trickier than it appears. Particularly, one thing is a characteristic if there exists a big sufficient mannequin measurement such that it will get a devoted neuron. This create a sort “epsilon-delta” like definition. Our current understanding – as we’ll see in later sections – is that arbitrarily massive fashions can nonetheless have a big fraction of their options be in superposition. Nonetheless, for any given characteristic, assuming the characteristic significance curve is not flat, it ought to ultimately be given a devoted neuron. This definition will be useful in saying that one thing is a characteristic – curve detectors are a characteristic since you discover them in throughout a variety of fashions bigger than some minimal measurement – however unhelpful for the rather more frequent case of options we solely hypothesize about or observe in superposition. For instance, curve detectors seem to reliably happen throughout sufficiently refined imaginative and prescient fashions, and so are a characteristic. For interpretable properties which we presently solely observe in polysemantic neurons, the hope is {that a} sufficiently massive mannequin would dedicate a neuron to them. This definition is barely round, however avoids the problems with the sooner ones.

We have written this paper with the ultimate “neurons in sufficiently massive fashions” definition in thoughts. However we aren’t overly hooked up to it, and really suppose it is in all probability vital to not prematurely connect to a definition.A well-known ebook by Lakatos illustrates the significance of uncertainty about definitions and the way vital rethinking definitions usually is within the context of analysis.

Features as Directions

As we have talked about in earlier sections, we typically consider options as being represented by instructions. For instance, in phrase embeddings, “gender” and “royalty” seem to correspond to instructions, permitting arithmetic like V("king") - V("man") + V("girl") = V("queen") . Examples of interpretable neurons are additionally circumstances of options as instructions, because the quantity a neuron prompts corresponds to a foundation path within the illustration

Let’s name a neural community illustration linear if options correspond to instructions in activation area. In a linear illustration, every characteristic f_i has a corresponding illustration path W_i. The presence of a number of options f_1, f_2… activating with values x_{f_1}, x_{f_2}… is represented by x_{f_1}W_{f_1} + x_{f_2}W_{f_2}…. To be clear, the options being represented are virtually definitely nonlinear capabilities of the enter. It is solely the map from options to activation vectors which is linear. Be aware that whether or not one thing is a linear illustration depends upon what you take into account to be the options.

We do not suppose it is a coincidence that neural networks empirically appear to have linear representations. Neural networks are constructed from linear capabilities interspersed with non-linearities. In some sense, the linear capabilities are the overwhelming majority of the computation (for instance, as measured in FLOPs). Linear representations are the pure format for neural networks to signify info in! Concretely, there are three main advantages:

  • Linear representations are the pure outputs of apparent algorithms a layer would possibly implement. If one units up a neuron to sample match a specific weight template, it is going to fireplace extra as a stimulus matches the template higher and fewer because it matches it much less properly.
  • Linear representations make options “linearly accessible.” A typical neural community layer is a linear operate adopted by a non-linearity. If a characteristic within the earlier layer is represented linearly, a neuron within the subsequent layer can “choose it” and have it persistently excite or inhibit that neuron. If a characteristic have been represented non-linearly, the mannequin would not have the ability to do that in a single step.
  • Statistical Effectivity. Representing options as completely different instructions could enable non-local generalization in fashions with linear transformations (such because the weights of neural nets), rising their statistical effectivity relative to fashions which might solely regionally generalize. This view is very advocated in a few of Bengio’s writing (e.g. ). A extra accessible argument will be present in this blog post.

It’s potential to assemble non-linear representations, and retrieve info from them, in case you use a number of layers (though even these examples will be seen as linear representations with extra unique options). We offer an instance within the appendix. Nonetheless, our instinct is that non-linear representations are typically inefficient for neural networks.

One would possibly suppose {that a} linear illustration can solely retailer as many options because it has dimensions, nevertheless it seems this is not the case! We’ll see that the phenomenon we name superposition will enable fashions to retailer extra options – doubtlessly many extra options – in linear representations.

For dialogue on how this view of options squares with a conception of options as being multidimensional manifolds, see the appendix “What about Multidimensional Options?”.

Privileged vs Non-privileged Bases

Even when options are encoded as instructions, a pure query to ask is which instructions? In some circumstances, it appears helpful to contemplate the idea instructions, however in others it would not. Why is that this?

When researchers examine phrase embeddings, it would not make sense to investigate foundation instructions. There can be no cause to count on a foundation dimension to be completely different from every other potential path. One approach to see that is to think about making use of some random linear transformation M to the phrase embedding, and apply M^{-1} to the next weights. This might produce an an identical mannequin the place the idea dimensions are completely completely different. That is what we imply by a non-privileged foundation. After all, it is potential to review activations and not using a privileged foundation, you simply have to determine attention-grabbing instructions to review by some means, comparable to making a gender path in a phrase embedding by taking the distinction vector between “man” and “girl”.

However many neural community layers usually are not like this. Usually, one thing concerning the structure makes the idea instructions particular, comparable to making use of an activation operate. This “breaks the symmetry”, making these instructions particular, and doubtlessly encouraging options to align with the idea dimensions. We name this a privileged foundation, and name the idea instructions “neurons.” Usually, these neurons correspond to interpretable options.

From this attitude, it solely is smart to ask if a neuron is interpretable when it’s in a privileged foundation. Actually, we sometimes reserve the phrase “neuron” for foundation instructions that are in a privileged foundation. (See longer dialogue here.)

Be aware that having a privileged foundation would not assure that options will likely be basis-aligned – we’ll see that they usually aren’t! But it surely’s a minimal situation for the query to even make sense.

The Superposition Hypothesis

Even when there’s a privileged foundation, it is usually the case that neurons are “polysemantic”, responding to a number of unrelated options. One rationalization for that is the superposition hypothesis. Roughly, the thought of superposition is that neural networks “need to signify extra options than they’ve neurons”, in order that they exploit a property of high-dimensional areas to simulate a mannequin with many extra neurons.

A number of outcomes from arithmetic counsel that one thing like this could be believable:

  • Virtually Orthogonal Vectors. Though it is solely potential to have n orthogonal vectors in an n-dimensional area, it is potential to have exp(n) many “virtually orthogonal” ( cosine similarity) vectors in high-dimensional areas. See the Johnson–Lindenstrauss lemma.
  • Compressed sensing. Generally, if one initiatives a vector right into a lower-dimensional area, one cannot reconstruct the unique vector. Nonetheless, this modifications if one is aware of that the unique vector is sparse. On this case, it’s usually potential to get better the unique vector.

Concretely, within the superposition speculation, options are represented as almost-orthogonal instructions within the vector area of neuron outputs. For the reason that options are solely almost-orthogonal, one characteristic activating seems like different options barely activating. Tolerating this “noise” or “interference” comes at a value. However for neural networks with extremely sparse options, this value could also be outweighed by the good thing about having the ability to signify extra options! (Crucially, sparsity drastically reduces the prices since sparse options are not often lively to intervene with one another, and non-linear activation capabilities create alternatives to filter out small quantities of noise.)

A technique to think about that is {that a} small neural community could possibly noisily “simulate” a sparse bigger mannequin:

Though we have described superposition with respect to neurons, it may possibly additionally happen in representations with an unprivileged foundation, comparable to a phrase embedding. Superposition merely signifies that there are extra options than dimensions.

Summary: A Hierarchy of Feature Properties

The concepts on this part could be considered when it comes to 4 progressively extra strict properties that neural community representations might need.

  • Decomposability: Neural community activations that are decomposable will be decomposed into options, the that means of which isn’t depending on the worth of different options. (This property is in the end a very powerful – see the position of decomposition in defeating the curse of dimensionality.)
  • Linearity: Options correspond to instructions. Every characteristic f_i has a corresponding  illustration path W_i. The presence of a number of options f_1, f_2… activating with values x_{f_1}, x_{f_2}… is represented by x_{f_1}W_{f_1} + x_{f_2}W_{f_2}….
  • Superposition vs Non-Superposition: A linear illustration displays superposition if W^TW will not be invertible. If W^TW is invertible, it doesn’t exhibit superposition.
  • Foundation-Aligned: A illustration is foundation aligned if all W_i are one-hot foundation vectors. A illustration is partially foundation aligned if all W_i are sparse. This requires a privileged foundation.

The primary two (decomposability and linearity) are properties we hypothesize to be widespread, whereas the latter (non-superposition and basis-aligned) are properties we consider solely generally happen.


If one takes the superposition speculation critically, a pure first query is whether or not neural networks can really noisily signify extra options than they’ve neurons. If they cannot, the superposition speculation could also be comfortably dismissed.

The instinct from linear fashions can be that this is not potential: one of the best a linear mannequin can do is to retailer the principal elements. However we’ll see that including only a slight nonlinearity could make fashions behave in a radically completely different manner! This will likely be our first demonstration of superposition. (It would even be an object lesson within the complexity of even quite simple neural networks.)

Experiment Setup

Our aim is to discover whether or not a neural community can mission a excessive dimensional vector x in R^n right into a decrease dimensional vector hin R^m after which get better it.This experiment setup is also considered as an autoencoder reconstructing x.

The Feature Vector (x)

We start by describing the high-dimensional vector x: the activations of our idealized, disentangled bigger mannequin. We name every component x_i a “characteristic” as a result of we’re imagining options to be completely aligned with neurons within the hypothetical bigger mannequin. In a imaginative and prescient mannequin, this could be a Gabor filter, a curve detector, or a floppy ear detector. In a language mannequin, it would correspond to a token referring to a particular well-known individual, or a clause being a specific sort of description.

Since we have no floor reality for options, we have to create artificial knowledge for x which simulates any vital properties we consider options have from the angle of modeling them. We make three main assumptions:

  • Function Sparsity: Within the pure world, many options appear to be sparse within the sense that they solely not often happen. For instance, in imaginative and prescient, most positions in a picture do not comprise a horizontal edge, or a curve, or a canine head. In language, most tokens do not seek advice from Martin Luther King or aren’t a part of a clause describing music. This concept goes again to classical work on imaginative and prescient and the statistics of pure photos (see e.g. Olshausen, 1997, the part “Why Sparseness?” ). For that reason, we’ll select a sparse distribution for our options.
  • Extra Options Than Neurons: There are an infinite variety of doubtlessly helpful includes a mannequin would possibly signify.A imaginative and prescient mannequin of ample generality would possibly profit from representing each species of plant and animal and each manufactured object which it would doubtlessly see. A language mannequin would possibly profit from representing every one that has ever been talked about in writing. These are solely scratching the floor of believable options, however already there appear greater than any mannequin has neurons. Actually, massive language fashions demonstrably do in actual fact learn about individuals of very modest prominence – presumably extra such individuals than they’ve neurons. This level is a standard argument in dialogue of the plausibility of “grandmother neurons” in neuroscience, however appears even stronger for synthetic neural networks. This imbalance between options and neurons in actual fashions looks as if it should be a central stress in neural community representations.
  • Options Range in Significance: Not all options are equally helpful to a given job. Some can cut back the loss greater than others. For an ImageNet mannequin, the place classifying completely different species of canine is a central job, a floppy ear detector could be one of the vital options it may possibly have. In distinction, one other characteristic would possibly solely very barely enhance efficiency.For computational causes, we can’t give attention to it on this article, however we regularly think about an infinite variety of options with significance asymptotically approaching zero.

Concretely, our artificial knowledge is outlined as follows: The enter vectors x are artificial knowledge supposed to simulate the properties we consider the true underlying options of our job have. We take into account every dimension x_i to be a “characteristic”. Each has an related sparsity S_i and significance I_i. We let x_i=0 with likelihood S_i, however is in any other case uniformly distributed between [0,1].The selection to have options distributed uniformly is bigoted. An exponential or energy regulation distribution would even be very pure. In follow, we give attention to the case the place all options have the identical sparsity, S_i = S.

The Model (x to x’)

We’ll really take into account two fashions, which we inspire beneath. The primary “linear mannequin” is a properly understood baseline which doesn’t exhibit superposition. The second “ReLU output mannequin” is a quite simple mannequin which does exhibit superposition. The 2 fashions range solely within the remaining activation operate.

Linear Mannequin

h~=~Wx
x’~=~W^Th~+~b

x’ ~=~W^TWx ~+~ b

ReLU Output Mannequin

h~=~Wx
x’~=~textual content{ReLU}(W^Th+b)

x’ ~=~textual content{ReLU}(W^TWx + b)

Why these fashions?

The superposition speculation suggests that every characteristic within the higher-dimensional mannequin corresponds to a path within the lower-dimensional area. This implies we will signify the down projection as a linear map h=Wx. Be aware that every column W_i corresponds to the path within the lower-dimensional area that represents a characteristic x_i.

To get better the unique vector, we’ll use the transpose of the identical matrix W^T. This has the benefit of avoiding any ambiguity concerning what path within the lower-dimensional area actually corresponds to a characteristic. It additionally appears comparatively mathematically principledRecall that W^T = W^{-1} if W is orthonormal. Though W cannot be actually orthonormal, our instinct from compressed sensing is that it will likely be “virtually orthonormal” within the sense of Candes & Tao ., and empirically works.

We additionally add a bias. One motivation for that is that it permits the mannequin to set options it would not signify to their anticipated worth. However we’ll see later that the flexibility to set a detrimental bias is vital for superposition for a second set of causes – roughly, it permits fashions to discard small quantities of noise.

The ultimate step is whether or not so as to add an activation operate. This seems to be important as to if superposition happens. In an actual neural community, when options are literally utilized by the mannequin to do computation, there will likely be an activation operate, so it appears principled to incorporate one on the finish.

The Loss

Our loss is weighted imply squared error weighted by the characteristic importances, I_i, described above: L = sum_x sum_i I_i (x_i – x’_i)^2

Basic Results

Our first experiment will merely be to coach a number of ReLU output fashions with completely different sparsity ranges and visualize the outcomes. (We’ll additionally practice a linear mannequin – if optimized properly sufficient, the linear mannequin resolution doesn’t rely on sparsity stage.)

The principle query is learn how to visualize the outcomes. The best manner is to visualise W^TW (a options by options matrix) and b (a characteristic size vector). Be aware that options are organized from most vital to least, so the outcomes have a reasonably good construction. This is an instance of what one of these visualization would possibly seem like, for a small mannequin mannequin (n=20; ~m=5;) which behaves within the “anticipated linear model-like” manner, solely representing as many options because it has dimensions:

However the factor we actually care about is that this hypothesized phenomenon of superposition – does the mannequin signify “additional options” by storing them non-orthogonally? Is there a approach to get at it extra explicitly? Properly, one query is simply what number of options the mannequin learns to signify. For any characteristic, whether or not or not it’s represented is set by ||W_i||, the norm of its embedding vector.

We would additionally like to grasp whether or not a given characteristic shares its dimension with different options. For this, we calculate sum_{jneq i} (hat{W_i}cdot W_j)^2, projecting all different options onto the path vector of W_i. It will likely be 0 if the characteristic is orthogonal to different options (darkish blue beneath). Then again, values geq 1 imply that there’s some group of different options which might activate W_i as strongly as characteristic i itself!

We will visualize the mannequin we checked out beforehand this manner:

Now that we’ve got a approach to visualize fashions, we will begin to really do experiments.  We’ll begin by contemplating fashions with only some options (n=20; ~m=5;~ I_i=0.7^i). This can make it straightforward to visually see what occurs. We take into account a linear mannequin, and several other ReLU-output fashions educated on knowledge with completely different characteristic sparsity ranges: