# Deep Neural Networks As Computational Graphs | by Tyler Elliot Bettilyon | Teb’s Lab

*by*Phil Tadros

A number of folks say that neural nets are a “black field” whose profitable predictions are unimaginable to clarify. I hate treating something as a black field, it grinds towards my curious nature. It’s additionally not a really useful psychological mannequin — understanding *what *neural nets are and *how *they arrive on the conclusions they do might help practitioners achieve perception into utilizing them.

Considered by means of the right lens, neural nets’ prediction making functionality makes lots of sense. This text is about taking nets out of their black field by understanding what a neural community actually represents. Later within the collection we’ll discover precisely how they’re educated with gradient descent and backpropagation.

At its core, each neural community **represents a single mathematical operate. **Which means while you got down to use a neural community for some process, your speculation is that there’s some mathematical operate that may approximate the noticed habits fairly properly. After we practice a neural community we’re looking for* one such cheap approximation.*

As a result of these features are sometimes monstrously complicated we use graphs to characterize them fairly than the usual method notation. These graphs assist us manage our occupied with the features we got down to construct and it seems some graphs work a lot better than others for explicit duties. Lots of analysis and improvement within the neural community house is about inventing new architectures for these graphs, fairly than inventing model new algorithms.

So what’s a computational graph and the way are they utilized by neural networks?

A computational graph is a strategy to characterize a math operate within the language of graph principle. Recall the premise of graph principle: nodes are related by edges, and every little thing within the graph is both a node or an edge.

In a computational graph nodes are both enter values or features for combining values. Edges obtain their weights as the information flows by means of the graph. Outbound edges from an enter node are weighted with that enter worth; outbound nodes from a operate node are weighted by combining the weights of the inbound edges utilizing the required operate.

For instance, take into account the comparatively easy expression: f(x, y, z) = (x + y) * z. That is how we’d characterize that operate as as computational graph:

There are three enter nodes, labeled X, Y, and Z. The 2 different nodes are operate nodes. In a computational graph we usually compose many easy features right into a extra complicated operate. We are able to do composition in mathematical notation as properly, however I hope you’ll agree the next isn’t as clear because the graph above:

`f(x, y, z) = h(g(x, y) , z)`

g(i, j) = i + j

h(p, q) = p*q

In each of those notations we are able to compute the solutions to every operate individually, offered we accomplish that within the appropriate order. Earlier than we all know the reply to f(x, y, z) first we want the reply to g(x, y) after which h(g(x, y), z). With the mathematical notation we resolve these dependencies by computing the deepest parenthetical first; in computational graphs now we have to attend till all the sides pointing *into *a node have a worth earlier than computing the output worth for that node. Let’s take a look at the instance for computing f(1, 2, 3).

`f(1, 2, 3) = h(g(1, 2), 3)`

g(1, 2) = 1 + 2 = 3

f(1, 2, 3) = h(3, 3)

h(3, 3) = 3*3 = 9

f(1, 2, 3) = 9

And within the graph, we use the output from every node as the load of the corresponding edge:

Both manner, graph or operate notation, we get the identical reply as a result of these are simply two methods of expressing the identical factor.

On this easy instance it is likely to be laborious to see the benefit of utilizing a computational graph over operate notation. In any case, there isn’t something terribly laborious to know concerning the operate f(x, y, z) = (x + y) * z. The benefits turn into extra obvious once we attain the dimensions of neural networks.

Even comparatively “easy” deep neural networks have *tons of of hundreds *of nodes and edges; it’s fairly widespread for a neural community to have a couple of million edges. Attempt to think about the operate expression for such a computational graph… are you able to do it? How a lot paper would you have to write all of it down? This subject of scale is likely one of the causes computational graphs are used.

Let’s take a look at one concrete instance: suppose we’re constructing a deep neural community for predicting if somebody is single or in some type of relationship; a binary predictor. Moreover, assume that we’ve gathered a dataset that tells us 4 issues about an individual: their age, gender, what metropolis they stay in, and if they’re single or in some type of relationship.

After we say we need to “construct a neural community” to make this prediction — we’re actually saying that we need to discover a mathematical operate of the shape:

`f(age, gender, metropolis) = predicted_relationship_status`

The place the output worth is 0 if that particular person is in a relationship, and 1 if that particular person isn’t in a relationship.

We’re making an enormous (and flawed) assumption right here that age, gender, and metropolis inform us every little thing we have to learn about whether or not or not somebody is in a relationship. However that’s okay — all fashions are flawed and we are able to use statistics to seek out out if this one is *helpful *or not. Don’t give attention to how a lot this toy mannequin oversimplifies human relationships, give attention to what this implies for the neural community we need to construct.

As an apart, however earlier than we transfer on: encoding the worth for “metropolis” could be tough. It’s by no means clear what the numerical worth of “Berkeley” or “Salt Lake Metropolis” ought to be in our mathematical operate. Matters akin to tokenization and processing categorical information are past this text’s scope however completely worthy of your time when you’ve got not encountered them earlier than. One-hot encoding is a well-liked tactic.

In truth, a one-hot encoded vector could be used because the output layer for this community as properly. The main points of utilizing one-hot vectors this manner are within the subsequent article within the collection.

I like to consider the structure of a deep neural community as a template for a operate. After we outline the structure of a neural community we’re laying out the collection of sub-functions and specifying how they need to be composed. After we practice* *the neural community we’re experimenting with the parameters of those sub-functions. Contemplate this operate for instance:

`f(x, y) = ax² + bxy + cy²; the place a, b, and c are scalars`

The *part sub-functions *of this operate are all the operators: two squares*, *two additions,* *and* *4 *multiplications.* The *tunable* *parameters *of this operate are a, b, and c, in neural community parlance these are known as **weights**. The inputs to the operate are X and Y — we are able to’t tune these values in machine studying as a result of they’re the values from the dataset, which we’d have (hopefully) gathered earlier within the course of.

By altering the values of our weights (a, b, and c) we are able to dramatically affect the output of the operate. However, whatever the values of a, b, and c there’ll all the time be an x², a y² and an xy time period — so our operate has a restricted vary of potential configurations.

Here’s a computational graph representing this operate:

This isn’t *technically *a neural community, but it surely’s very shut in all of the ways in which depend. It’s a graph that represents a operate; we may use it to foretell some sorts of traits; and we may practice it utilizing gradient descent and backpropagation if we had a dataset that mapped two inputs to an output. This explicit computational graph will likely be good* *at modeling some quadratic traits involving precisely 2 variables, however dangerous at modeling the rest.

On this instance, coaching the community would quantity to altering the weights till we discover some mixture of a, b, and c that causes the operate to work properly as a predictor for our dataset. When you’re aware of linear regression, this could really feel just like tuning the weights of the linear expression.

This graph continues to be fairly easy in comparison with even the best neural networks which might be utilized in follow, however the principle concept — {that a}, b, and c could be adjusted to enhance the mannequin’s efficiency — stays the identical.

The rationale this neural community wouldn’t be utilized in follow is that it isn’t very *versatile. *This operate solely has 3 parameters to tune: a, b, and c. Making issues worse, we’ve solely given ourselves room for two options per enter (x and y).

Thankfully, we are able to simply remedy this drawback by utilizing* extra complicated features *and permitting for *extra complicated enter.* Huzzah!

Recall two information about deep neural networks:

- DNNs are a particular type of graph, a “computational graph”.
- DNNs are made up of a collection of “absolutely related” layers of nodes.

“Absolutely related” signifies that the output from every node within the first layer turns into one of many inputs for *each node *within the second layer*. *In a computational graph the sides are the output values of features — so in a totally related layer the output for every sub-function is used as one of many inputs for every of the sub-functions within the subsequent layer. However, what are these features?

The operate carried out by every node within the neural internet is named a **switch operate **(which can also be known as the **activation operate**). There are two steps in each switch operate. First, all the enter values are mixed ultimately, often it is a weighted sum. Second, a “nonlinear” operate is utilized to that sum; this second operate would possibly change from layer to layer inside a single neural community.

Standard nonlinear features for this second step are tanh, log, max(0, x) (known as Rectified Linear Unit, or ReLU), and the sigmoid operate. On the time of this writing, ReLU is the most well-liked alternative of nonlinearity, however issues change shortly.

If we zoom in on a neural community, we’d discover that every “node” within the community was truly 2 nodes in our computational graph:

On this case, the **switch operate **is a sum adopted by a sigmoid. Usually, all of the nodes in a layer have the identical switch and activation operate. Certainly it is not uncommon for all of the layers in the identical community to make use of the identical activation features, although it isn’t a requirement by any means.

The final sources of complexity in our neural community are **biases **and **weights. **Each incoming edge has a singular **weight, **the output worth from the earlier node is multiplied by this weight *earlier than *it’s given to the switch operate. Every switch operate additionally has a single **bias **which is added *earlier than *the nonlinearity has been utilized. Lets zoom in yet one more time:

On this diagram we are able to see that every enter to the sum is first **weighted** through multiplication *then *it’s summed. The bias is added to that sum as properly, and at last the overall is distributed to our nonlinear operate (sigmoid on this case).** **These weights and biases are the parameters which might be finally fine-tuned throughout coaching.

Within the earlier instance, I stated we didn’t have sufficient flexibility as a result of we solely had 3 parameters to advantageous tune. So simply what number of parameters are there in a deep neural community for us to tune?

If we outline a neural internet to foretell binary classification (in/not in a relationship) with 2 hidden layers every with 512 nodes and an enter vector with 20 options we may have 20*512 + 512*512 + 512*2 = 273,408 weights that we are able to advantageous tune plus 1024 biases (one for every node within the hidden layers). This can be a “easy” neural community. “Advanced” neural networks regularly have a number of million tunable weights and biases.

This extraordinary flexibility is what permits neural nets to seek out and mannequin complicated relationships. It’s additionally why they require **tons** of knowledge to coach. Utilizing backpropagation and gradient descent we are able to purposely* *change the hundreds of thousands of weights till the output turns into extra appropriate, however as a result of we’re doing calculations involving hundreds of thousands of variables it takes lots of time and lots of information to seek out the best mixture of weights and biases.

Whereas they’re typically known as a “black field”, neural networks are actually only a manner of representing very complicated mathematical features. The neural nets we construct are notably helpful features *as a result of* they’ve so many parameters that may be advantageous tuned. The results of the advantageous tuning is that wealthy complexities between completely different parts of the enter could be plucked out of the noise.

In the end, the “structure” of our computational graph may have a huge impact on how properly our community can carry out. Questions like: what number of nodes per layer, which activation features are used at every layer, and what number of layers to make use of, are the topic of analysis and would possibly change dramatically from neural community to neural community. The structure will depend upon the kind of prediction being made and the type of information being fed into the system — similar to we shouldn’t use a linear operate to mannequin parabolic information, we shouldn’t use *any *neural internet to unravel *each* drawback.

Within the subsequent article on this collection I’m going to construct and study just a few “easy” neural networks utilizing the Keras library. I’ll be working by means of the “whats up world” of machine studying — classifying handwritten digits utilizing the MNIST dataset. My objective in that article is to elucidate how completely different neural internet architectures affect coaching time and efficiency of the mannequin.

After an interlude into practical-land we’ll return to the idea to debate two essential matters: gradient descent and backpropagation. See you then!

Part 3: Classifying MNIST Digits With Different Neural Network Architectures