Deep learning

An introduction for the layperson

Giorgio Sironi - SETI@eLife

(if you're reading this on your laptop, press S for notes)

Giorgio Sironi (@giorgiosironi)

Software Engineer in Tools and Infrastructure
What do I do
- Distributed systems
- Automated complex tests, integrating many different projects
- Continuous Delivery
- Pasta and risotto

We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.

John McCarthy (LISP), Claude Shannon (information theory), Marvin Minsky et al, 1956

The perceptron

Training a perceptron

x₁	x₂	ŷ	y	e
0	0	0	-0.1	0.1
0	1	0	0.2	-0.2
1	0	0	1.1	-1.1
1	1	1	0.9	0.1

Derivatives

When we start from a random set of weights, we want to understand how we should modify those weights when we are shown a new example, complete with input and expected output. If what we see here is an error function like (ŷ - y), w₀ will place the output in some place of the curve, usually not when this error is minimum. For example, if AND(1, 1) is 0.4 the error is 0.6. If we compute the derivative of the error function, we find that: - it's 0 in the minimum - it's positive if we have offshoot a minimum - it's negative if we are before a minimum Therefore the derivative of the error with respect to a weight is an indication of how we should update the weight: - do nothing if we are in a minimum, anything will worsen the error - subtract something off the weight if we are to the right of a minimum - add something to the weight if we are to the left of a minimum These are all local minima!

Perceptron learning rule

                    for each example x:
                        y = f(w * x) // * is a vector product
                        e = ŷ - y // + error, - derivative
                        for each weight i:
                            w_i = η * e * x_i // ∝ derivative

Linear boundaries

On to networks

Networks and backpropagation

In from three to eight years we will have a machine with the general intelligence of an average human being. -- Marvin Minsky, 1970

Deep neural networks

Here we see a convolutional neural network for the recognition of hand-written digits, one of the standard problems against which new approaches and models are measured. The pixels of a 32x32 image of the digit have a luminance value from 0 to 1. They are connected to the input layer, which is not fully-connected but is a convolution, a mathematical operation that makes each neuron of the next layer the combination of a (5x5) square pixel around its position. The next layer is a subsampling layer, which divides the previous layer in 2x2 squares and takes the maximum (or another function) in the set of the convolutional neuron values. More layers are added on top... until we get to the output layer, consisting of 10 neurons corresponding to the different classes, the 10 digits from 0 to 9. The maximum value wins as the input image is classified into that set.

A lot to learn

possibly millions of weights
feature extraction: which and how many layers work best for this problem?
does the brain really work like that?

AlphaGo: supervised

On training set: This data set contains 29.4 million positions from 160,000 games played by KGS 6 to 9 dan human players

On training time and resources: The value network was trained for 50 million mini-batches of 32 positions, using 50 GPUs, for one week.

Alpha(Go) Zero: reinforcement

My conclusions

There is both hype and beauty
Big hardware and, unless it's a game, big data
We weren't able to do OCR or speech recognition, now they're normal