# Tensors and Convolution – Artwork, Tech and different Nonsense

*by*Phil Tadros

That is the second episode of a miniseries within the huge world of Deep Neural Networks, the place I begin from Intel® Open Picture Denoise open-source library (OIDN) and create a CUDA/CUDNN-based denoiser that produces high-quality outcomes at real-time body charges. On this collection, I narrate the story in first particular person.

The first episode was centered on a fast dive into the OIDN library supply code to find out if the Deep Neural Community it comprises will be simply extracted and reproduced elsewhere. The library comprises ‘tensor archive’ recordsdata that may be simply decoded, which gives the info essential to assemble a DNN. I recognized the decoder supply code and the process by which tensors are linked to kind the community, but in addition gathered many questions on among the extra nebulous ideas: what’s a tensor? How do tensors relate to Neural Networks? What’s a Convolutional Neural Community? What’s a U-Web? And extra… It’s time to get some solutions.

One factor first, I wish to thank two folks:

- Attila Áfra, the architect behind the OIDN library, for publishing the library for anybody to review.
- Marco Salvi, for the assistance and steerage on theoretical notions throughout my exploration.

## Tensors

I let mathematicians clarify what a tensor is. Nonetheless, as appropriate as the next rationalization could also be, the common generalization in it helps me solely a lot in understanding what I’m coping with right here.

Tensors are merely mathematical objects that can be utilized to explain bodily properties, similar to scalars and vectors. The truth is, tensors are merely a generalization of scalars and vectors; a scalar is a zero-rank tensor, and a vector is a primary rank tensor.

In my very own phrases, and for the aim of this challenge, a tensor is commonly a 3D or 4D matrix, sometimes floating-point numbers; greater variety of dimensions are attainable, however I cannot encounter these on this collection. A 1080p RGB picture is a 3D tensor with dimensions (W*H*C) 1920x1080x3, the place W and H are for *width* and *top* in pixels, respectively; and C is for the variety of channels: RGB. A sequence of 10 of such photographs will be seen as a 4D tensor. Such a tensor has 4 dimensions (N*W*H*C) 10x1920x1080x3, the place N is the variety of photographs. Generalizing this, one may say {that a} single picture can nonetheless be thought of a 4D tensor the place N = 1. Dimensions right here don’t have anything to do with Cartesian dimensions after all, they’re purely relational.

When expressing the decision of a picture, we generally say one thing on the road of “1920 by 1080”. Nonetheless, if we think about how uncompressed picture information is saved in reminiscence, we’ll observe that photographs are saved line by line, pixel by pixel, and channel by channel. Within the case of a sequence, we have now N full photographs, every with H rows fabricated from W columns of pixels, every pixel fabricated from C channels. Thus, if one must give you a notation to explain how photographs are generally saved in reminiscence, it could possibly be NHWC. These letters are generally utilized in Machine Studying to explain tensors, and when a picture is loaded to be processed by a neural community, it’s saved in a tensor with a “format” corresponding to NHWC, though I do favor to make use of the time period *information format*.

Are there different information layouts? Certain, there are! My thoughts is shortly drawn to acquainted similarity in idea of the SIMD programming mannequin (Single Instruction A number of Information), the place there are other ways of storing information leading to doubtlessly important distinction in efficiency. Within the context of SIMD one might have AOS (Array of Structs) or SOA (Struct of Arrays) layouts. For instance:

```
struct colour
{
float r,g,b;
};
// An AOS information format with 64 colour components
colour aos[64];
template<int measurement>
struct color_soa
{
float r[size];
float g[size];
float b[size];
};
// An SOA information format with 64 colour components
color_soa<64> soa;
```

Within the AOS case, the info as saved in reminiscence comprises a sequence of rgbrgbrgbrgb…, whereas within the SOA case, the info in reminiscence seems as rrrrr…gggg…bbbb… SIMD directions favor studying SOA information as a result of, with every particular person load and retailer directions, the processor can fill huge registers with a number of components accessed consecutively and sequentially. This makes good use of the reminiscence latency and bandwidth, leading to sooner processing speeds.

Again from this direct analogy. Tensors could also be organized in several information layouts relying on the operation we have to run on them, and the way the accessible {hardware} might favor to entry the info. So, if NHWC is analog to AOS, NCHW is for SOA. In a NCHW information format, you’ll have N photographs concatenated, every fabricated from H*W planar illustration of the pink channel, adopted by the planar illustration of the inexperienced channel, then the blue…

The ‘N’ dimension is a little bit of an impostor. Greater than a dimension to the info, it’s a strategy to categorical a batch of equivalent components. A notion for some algorithms the place to use the exact same computation to many particular person entries, with none of them to overlap or intervene.

## Convolution

A graphics programmer is more likely to be accustomed to the idea of convolution:

In mathematics (specifically, functional analysis),

convolutionis a mathematical operation on two functions (fandg) that produces a 3rd operate () that expresses how the form of 1 is modified by the opposite. The time periodconvolutionrefers to each the end result operate and to the method of computing it. It’s outlined because the integral of the product of the 2 features after one is mirrored concerning the y-axis and shifted. The selection of which operate is mirrored and shifted earlier than the integral doesn’t change the integral end result (see commutativity). The integral is evaluated for all values of shift, producing the convolution operate.

Whereas convolution is an integral of the product of features. In picture processing, the time period convolution is commonly improperly used, a pedantic particular person would say discrete convolution as an alternative, which is achieved as a a lot less complicated weighted summation.

Discrete convolution is the method of sliding a weighted window (*kernel*) on prime of a picture, pixel by pixel. Because the kernel overlaps a area of pixels within the picture, it gives wights to sum such pixels. The end result obtained is then saved to the output picture. Widespread use of convolution in picture processing contains blur, sharpen, edge detection, and so on…

Throughout convolution, one thing occurs on the edges and corners of the picture. Say the kernel is a 3×3 window. When its heart is aligned to the very first pixel of the enter picture, among the kernel extends exterior the picture boundary. There are a number of methods to outline how this case should be dealt with:

- The values of the pixels on the border implicitly extends exterior of the picture.
- Rings of values exterior of the picture are thought of zeroes (that is referred to as
*padding*) - The convolution kernel by no means extends exterior the borders of the enter picture (no padding). With a kernel 3×3, the output picture has 1 pixel trimmed off either side.

By controlling *padding,* one can produce an output that’s equal or bigger that the enter picture. A typical configuration is to protect the picture decision, which allows the appliance of a number of filters in succession with out decreasing the output decision. For this configuration, the padding worth needs to be half the kernel width rounded all the way down to the closest integer. For instance, for a 3×3 kernel, the padding needs to be 1 pixel on either side, and for a 5×5 kernel, the padding needs to be 2 pixels on either side.

#### Convolution and Tensors

So far the diagrams have proven convolution being utilized to single channel. Issues turn out to be extra fascinating when a number of channels come into play. The best extension is to use the identical filter weights persistently to all channels, corresponding to for an RGB picture the place the enter rgb pixels are weighted by scalar weights to supply new rgb values. That is widespread in picture processing, corresponding to with blur filters that apply the identical results to all picture channels. Nonetheless, convolution filters can have channels too, permitting for a special set of weights to be utilized per enter channel. Moreover, a filter can categorical a matrix multiplication between a set of enter channels and a set of filter channels to supply a set of output channels which will differ in quantity from the enter channels.

Say I’ve an RGB picture, and the aim of the convolution is to extract a wide range of options from it, say vertical edges, horizontal edges, and two diagonal edges. To realize this, a 4D tensor can be utilized because the filter. This tensor defines the way to compute every of the 4 output options in respect of the three enter channels. Every of the 12 combos is a H*W filter. So, we have now 4 outputs, instances 3 inputs, instances H*W filter weights. Following the NCHW notation for tensor, ‘O’ stands for output and ‘I’ stands for enter. In our instance, the filter is a tensor with information format OIHW and dimensions [4, 3, 3, 3].

The variety of components in a tensor is given by the merchandise of the scale. On this instance, the filter tensor has a complete of 4*3*3*3 = 108 weights. These weights join a area within the enter tensor of three*3 pixels, throughout its 3 channels, to a pixel of the output tensor throughout its 4 channels.

Within the first episode we described Deep Neural Networks as a sequence of neural layers, interconnected by weights… And connecting the dots, a 4D convolution filter whose weights are produced by a ML coaching course of is the truth is a sort of neural community!

## Convolutional Neural Networks

Additionally referred as ConvNets, or CNN, Convolutional Neural Networks arises from the remark that sure kind of processing is desired to be utilized persistently throughout the enter information. If I wish to determine handwritten numbers on a picture, I would really like a Neural Community to determine the function independently on the place it might seem within the picture, and independently on its decision. If I wish to denoise a picture, I would really like the noise to be persistently acknowledged as such, kind the middle of the picture, to the corners. ConvNets are sensible and efficient at this, as they are often expressed as a sequence of convolution filters (plus just a few extra kind of layers I’ll describe in a future episode), quite than inflexible and costlier fully-connected networks.

In a ConvNet, the enter and output tensors of a convolution layer are the neurons, the filter tensor is the weights. The operation of convolution instantiates the identical weights throughout the picture because the convolution window slides throughout, connecting the various areas of neurons to the respective output neurons. The end result will be seen as a large, and really effectively compressed, neural community. That is the reply to one among my naive preliminary questions: how can a neural community course of photographs of arbitrary measurement. Now I do know and this opens the door to me to a complete new universe beforehand ignored. I used to be blind and now I see!!

Time for a break.

## Extraction

Armed with some new theoretical understanding, I really feel optimistic about extracting the *parseTZA* operate from the OIDN code base. Right here is the operate pseudocode from the primary episode:

```
parseTZA(...):
Parse the magic worth
Parse the model
Parse the desk offset and soar to the desk
Parse the variety of tensors
Parse the tensors in a loop:
Parse the title of the tensor
Parse the variety of dimensions
Parse the form of the tensor (vector of dimensions)
Parse the format of the tensor, both "x" or "oihw"
Parse the info kind of the tensor, solely float is supported
Parse the offset to the tensor information
Create a tensor utilizing dimensions, format, kind, and pointer to information
Add it to the map by its title
Return map of tensor
```

I’ve a very good understanding of what a tensor is, what “dimensions” means, what OIHW format is, whereas the format “x” stays mysterious… Applicable label although! I’m going to roll up my sleeves and start copy-pasting code to a brand new header file. I start with the operate itself; the operate requires sorts *Tensor*, *Machine*, and *Exception*. Class *Machine *appears to be concerned…

I’m not going to make use of any of it. Lessons like this one have a tendency to control the computation in a posh heterogeneous system. Since I solely must learn the tensors, I declare an impostor *struct Machine {} *as an alternative, and see if I can eradicate it later.

Taking a look at class Tensor I see bits I want and bits I don’t… Right here is an instance of the earlier than and after, to offer you a way of the kind of aggressive pruning: its 261 traces all the way down to 21.

The entire level of what I maintain is to retain the knowledge decoded in *parseTZA* operate solely. Any lifeless code that comes with the copy-paste, I shortly determine and take away. The method is recursive: *class Tensor* requires *class TensorDesc*, copy paste the implementation, simplify, simplify, simplify. It will be too concerned for me to point out this course of unfolding. And I don’t suppose it might be fascinating to doc such a course of both, it’s merely an extraction refactoring.

That is what I’m left with after a few hours with a machete.

```
// Be aware: this implementation is extracted from the OIDN library and simplified for the aim
// of simply loading the OIDN tza weights blob recordsdata. For extra, verify:
// https://github.com/OpenImageDenoise/oidn.git
//
// Copyright 2009-2021 Intel Company
// SPDX-License-Identifier: Apache-2.0
//
// Licensed below the Apache License, Model 2.0 (the "License");
// you could not use this file besides in compliance with the License.
// Chances are you'll get hold of a replica of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Except required by relevant regulation or agreed to in writing, software program
// distributed below the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, both categorical or implied.
// See the License for the precise language governing permissions and
// limitations below the License.
#embrace <stdint.h>
#embrace <stdio.h>
#embrace <stdlib.h>
#embrace <string.h>
#embrace <algorithm>
#embrace <vector>
#embrace <map>
#embrace <string>
#embrace <reminiscence>
#pragma as soon as
namespace oidn
{
// Error codes
enum class Error
{
None = 0, // no error occurred
Unknown = 1, // an unknown error occurred
InvalidArgument = 2, // an invalid argument was specified
InvalidOperation = 3, // the operation shouldn't be allowed
OutOfMemory = 4, // not sufficient reminiscence to execute the operation
UnsupportedHardware = 5, // the {hardware} (e.g. CPU) shouldn't be supported
Cancelled = 6, // the operation was cancelled by the person
};
class Exception : public std::exception
{
personal:
Error error;
const char* message;
public:
Exception(Error error, const char* message)
: error(error), message(message) {}
Error code() const noexcept
{
return error;
}
const char* what() const noexcept override
{
return message;
}
};
enum class DataType
{
Float32,
Float16,
UInt8,
Invalid
};
// Tensor dimensions
utilizing TensorDims = std::vector<int64_t>;
// Tensor reminiscence format
enum class TensorLayout
{
x,
chw,
oihw,
};
// Tensor descriptor
struct TensorDesc
{
TensorDims dims;
TensorLayout format;
DataType dataType;
__forceinline TensorDesc() = default;
__forceinline TensorDesc(TensorDims dims, TensorLayout format, DataType dataType)
: dims(dims), format(format), dataType(dataType) {}
// Returns the variety of components within the tensor
__forceinline size_t numElements() const
{
if (dims.empty())
return 0;
size_t num = 1;
for (size_t i = 0; i < dims.measurement(); ++i)
num *= dims[i];
return num;
}
// Returns the scale in bytes of a component within the tensor
__forceinline size_t elementByteSize() const
{
swap (dataType)
{
case DataType::Float32: return 4;
case DataType::Float16: return 2;
case DataType::UInt8: return 1;
default:
return 0;
}
}
// Returns the scale in bytes of the tensor
__forceinline size_t byteSize() const
{
return numElements() * elementByteSize();
}
};
// Tensor
class Tensor : public TensorDesc
{
public:
const void* ptr; // Information is simply briefly referred, not owned
public:
Tensor(const TensorDesc& desc, const void* information)
: TensorDesc(desc),
ptr(information)
{}
Tensor(TensorDims dims, TensorLayout format, DataType dataType, const void* information)
: TensorDesc(dims, format, dataType),
ptr(information)
{}
__forceinline const void* information() { return ptr; }
__forceinline const void* information() const { return ptr; }
};
// Checks for buffer overrun
__forceinline void checkBounds(char* ptr, char* finish, size_t measurement)
{
if (finish - ptr < (ptrdiff_t)measurement)
throw Exception(Error::InvalidOperation, "invalid or corrupted weights blob");
}
// Reads a price from a buffer (with bounds checking) and advances the pointer
template<typename T>
__forceinline T learn(char*& ptr, char* finish)
{
checkBounds(ptr, finish, sizeof(T));
T worth;
memcpy(&worth, ptr, sizeof(T));
ptr += sizeof(T);
return worth;
}
// Decode DNN weights from the binary blob loaded from .tza recordsdata
int parseTZA(void* buffer, size_t measurement,
// outcomes
std::map<std::string, std::unique_ptr<Tensor>>& tensorMap)
{
[...]
}
} // namespace oidn
```

This itemizing will be simplified additional, by eradicating using inheritance, granular heap allocations, reliance of *std::unique_ptr*, and exceptions. These are just a few among the many issues I may spend just a few extra minutes cleansing up. However for now, I don’t wish to modify the physique of the *parseTZA* operate. I can at all times come again to it later. I’m extra inquisitive about what it’s parsing from the recordsdata, so I add some logging code, and that is what I discover:

```
Tensor Title | Dimensions | Format | BytesSize
------------------+----------------+--------+-----------
enc_conv0.weight | 32, 9, 3, 3 | oihw | 10368
enc_conv0.bias | 32 | x | 128
enc_conv1.weight | 32, 32, 3, 3 | oihw | 36864
enc_conv1.bias | 32 | x | 128
enc_conv2.weight | 48, 32, 3, 3 | oihw | 55296
enc_conv2.bias | 48 | x | 192
enc_conv3.weight | 64, 48, 3, 3 | oihw | 110592
enc_conv3.bias | 64 | x | 256
enc_conv4.weight | 80, 64, 3, 3 | oihw | 184320
enc_conv4.bias | 80 | x | 320
enc_conv5a.weight | 96, 80, 3, 3 | oihw | 276480
enc_conv5a.bias | 96 | x | 384
enc_conv5b.weight | 96, 96, 3, 3 | oihw | 331776
enc_conv5b.bias | 96 | x | 384
dec_conv4a.weight | 112, 160, 3, 3 | oihw | 645120
dec_conv4a.bias | 112 | x | 448
dec_conv4b.weight | 112, 112, 3, 3 | oihw | 451584
dec_conv4b.bias | 112 | x | 448
dec_conv3a.weight | 96, 160, 3, 3 | oihw | 552960
dec_conv3a.bias | 96 | x | 384
dec_conv3b.weight | 96, 96, 3, 3 | oihw | 331776
dec_conv3b.bias | 96 | x | 384
dec_conv2a.weight | 64, 128, 3, 3 | oihw | 294912
dec_conv2a.bias | 64 | x | 256
dec_conv2b.weight | 64, 64, 3, 3 | oihw | 147456
dec_conv2b.bias | 64 | x | 256
dec_conv1a.weight | 64, 80, 3, 3 | oihw | 184320
dec_conv1a.bias | 64 | x | 256
dec_conv1b.weight | 32, 64, 3, 3 | oihw | 73728
dec_conv1b.bias | 32 | x | 128
dec_conv0.weight | 3, 32, 3, 3 | oihw | 3456
dec_conv0.bias | 3 | x | 12
```

The mysterious “x” tensor format is revealed: these the biases! Not surprisingly, all of the decoded tensors are available in pairs of weight and biases. The burden tensors appear all to be 3×3 convolution home windows, whereas I anticipated to see bigger sizes. Earlier questions are changed with new questions. That is progress!

## Conclusion and subsequent steps

I made some strides within the sensible understanding of tensors, convolution and the way that turns into the muse of a wide range of DNN architectures referred as Convolutional Neural Networks. I then bit the bullet and commenced extracting the minimal supply code I want from the OIDN library. I do know there shall be extra snippets to extract, however hopefully these shall be extra mathematical in nature, and fewer structural.

Within the subsequent episode I’m going to review how the CNN is linked and uncover some new kind of layers that I presently ignore. Lastly, I’ll construct a theoretical understanding of the U-Web structure and the way these are simply transportable to different issues and domains.

Hopefully you loved this second episode. I’ll try to publish one new episode per week throughout the collection. Keep tuned!

Record of earlier episodes: