How Neural Turing Machines actually relate to Turing Machines

The (plain) Turing Machine (TM) served as the basis of computer science by formalizing the concept of algorithm, or even the very concept of a computer, in the sense of a device that computes. One specific TM has a finite number of states Q, a discrete memory (in the form of an infinite tape with cells in which symbols can be written or read) pre-filled with the (arbitrary-sized, but finite) input to process (a sequence of symbols from the finite set of allowed symbols \Gamma) and infinitely many blank symbols _ following it. At any moment, the machine is in a specific state from Q and is looking (with its head) at one memory cell. It then obeys a transition function, a rule such as « being in state q and having symbol x where the head lays on the tape, replace x for y then shift the head to the cell on the left (or right) ». At some point, it may reach a special state which accepts the input, or rejects it. Interestingly, in some cases, some inputs may cause the machine to never reach such a special state and the machine runs for eternity and beyond.

This specific Turing Machine is analog to the notion of algorithm: the states are somewhat like the lines of code (some inputs will reach the inside of an if statement, some won’t; some will make the algorithm reach return True, some will reach return False), the symbols are the values that variables may take, and the tape is obviously the memory where those values are stocked. It has been shown that any algorithm in any programming language can be translated in a TM, and vice versa (given the language is Turing-complete, meaning it is as expressive as the TM). One can also imagine a generic TM which takes another TM as input and runs it, hence the computer view of the Turing Machine, but we will focus on the algorithmic view. Algorithms in general often output values more complicated than just true or false, but it is straightforward to reduce such an algorithm to a decision algorithm (deciding if yes or no the input is accepted). For instance, one could change What does 3 times 5 equal? for Does 3 times 5 equal 15? This greatly simplifies complexity analysis and, talk about lucky! such decision algorithm is very well suited for classification in machine learning.

This wooden Turing Machine is, sadly, not differentiable end-to-end.

The Neural Turing Machine (NTM), invented by Graves, Wayne and Danihelka in their paper Neural Turing Machines, is a machine learning structure designed to learn to solve typically algorithmic tasks. In principle, an NTM aims at doing the same work as a specific Turing Machine, but where every single operation is differentiable. That way, for a given algorithmic problem (such as sorting, or « is this list sorted or not? ») we could start with an arbitrary TM, run it on some inputs (both sorted and unsorted ), compute the loss (how poorly we solved the problem on these inputs), along with the gradient of the loss in function of every parameter, then change every parameter in the direction of the gradient for the next batch of inputs, in the hope of converging to TM which can actually answer the question well.

In practice, NTMs do not use the same formalism as TMs, although they do are Turing-complete. Instead they are made of several components, all enclosed in the dashed square in the figure below~:

Neural Turing Machine, figure taken from the paper

I will analyze each component in light of the aforementioned paper, my subconscious knowledge on Turing Machines, and the following PyTorch implementation of an NTM:


This is the neural part of the NTM: a simple neural network. The authors mention it can be vanilla, recurrent (RNN) or Long Short Term Memory (LSTM) -the latter is used in the implementation. RNNs and LSTMs have the advantage of having their own little memory in addition to the large external memory, but this is akin to having registries in the processor in addition to the RAM: although practical, it will likely only complexify analysis. Therefore I will assume the use of a vanilla neural network. What is important is simply that the controller is a differentiable function approximator. It takes a batch of external inputs (for one batch, all inputs (e.g. unsorted lists) have the same length, in order to have a clean tensor, but the length can change from batch to batch (or can it? unclear in the implementation)), and outputs an internal embedding, which is rather abstract (in the implementation, a controller state is also given as input and received at output, but this is specific to RNN/LSTM). This embedding will be used for both memory interactions and output. The figure shows an arrow going from the read head to the controller, in fact it is both combined to the embedding to create the output, and used to alter the embedding for the next forward pass. Therefore the following modified figure may be more accurate:

Such a pretty picture

In short, the controller creates an embedding from the data and some parts of the memory. This embedding is given to every write head in order to alter the memory for the next forward passes, as well as to a special linear layer that combines it to a part of the memory to generate an output.

The controller plays the role of the TM transition function. Indeed, from the input and memory, it says how to alter memory for future reference. The TM transition function also specifies in which state to move afterwards; here there are no explicit states, but it can be encoded in the memory. For instance, one read head could tell what state we’re in, and another what symbol is read from the tape. The reading serving for the output can then act as an accepting or rejecting state. Importantly, this leaves the number of implicit states unbounded, therefore the NTM can simulate an arbitrarily large TM. The vast majority of parameters to learn are those of the controller neural network.


As we were focusing on the controller, we have left aside the question of what read and write heads actually are in NTMs. Read and write heads are for the most part linear functions (therefore they have a weight matrix \mathbf{W} and bias \mathbf{b} to be learned) mapping the embedding \mathbf{x} to parameters that specify where to go in the memory (\mathbf{k}, \beta, g, s and \gamma, available for all heads) and what to change there (\mathbf{e} and \mathbf{a}, for write heads only). See Memory Section for the use of each parameter.

Linear function of both types of heads:

    \[\mathbf{x} \rightarrow \mathbf{W} \mathbf{x} + \mathbf{b} = ( \mathbf{k} ,\beta,g,s,\gamma, \mathbf{e}, \mathbf{a})\,\]

where (\cdot, \cdot) denotes concatenation. The vector parameter \mathbf{k} will take the first M (see Memory section) entries, the scalars \beta and g the following two, s the following 3 (as it is a tuple of size 3), then the scalar \gamma takes the next one. For read heads, \mathbf{e} and \mathbf{a} have a size of 0, and for write heads they both have a size of M (meaning write heads need more learned parameters!).

The decision of where to go in the memory also takes into account where the head pointed at the last forward pass, meaning similar embeddings can lead to different outcomes at different steps in the computation. The last location \ell could be stored as a state in the head; but in the implementation it is kept outside the head and given to the forward function. When any kind of head asks the memory where to focus, as a function of \mathbf{k} ,\beta,g,s,\gamma and their last location \ell, they receive the new location \ell', which is saved for the next forward pass.

Read heads read (Memory section) from the location \ell' and return \mathbf{r}, while write heads write (memory section) at location \ell', using \mathbf{e} and \mathbf{a}.

I should get a better tool than paint

Just like in a TM, heads find which cell to read/write in function of where they initially were, the state and the symbol read (as given by the controller). The heads encapsulate the concept of being at a certain location.


The third and last main component of the NTM is the memory. In the paper, the memory in itself is only a rather large N \times M (funny how saying N times M resembles saying NTM!) matrix, which we will denote by \mathcal{M}. N is the total, fixed number of memory cells, and M is the length of each cell (since cells contain vectors). The memory is initialized with the equivalent of blank characters: zeros plus a little noise. Then, at each computation step, the matrix can be altered by the write heads.

What crucially distinguishes TMs and NTMs is that operations on the latter must be differentiable. But reading and writing from memory cells are fundamentally discrete operations. In NTMs the global strategy is to read everywhere at once instead of in just one cell, and focus on the right cell with a softmax function. Of course in practice it is a little more complicated:

Memory.address uses arguments \mathbf{k} ,\beta,g,s,\gamma, \ell to output \ell', using four auxiliary functions:

def address(k, beta, g, s, gamma, ell):
    c = similarity(k, beta)
    i = interpolate(ell, c, g)
    h = shift(i, s)
    ell_new = sharpen(h, gamma)
    return ell_new

similarity(k, \beta) returns a vector c giving a large weight to the cell that resembles most the key k and small weights to all others. It uses cosine for similarity and a softmax (parametered by \beta) to isolate the winner.

interpolate(\ell, c, g) returns the following: i:= gc + (1-g)\ell. In other words, g allows to choose between staying exactly where the head was (g=0), warping to where the memory resembles \mathbf{k} the most (g=1), or keeping tabs of both locations (0<g<1). I believe using always g=0 (and never using similarity) would theoretically be enough to simulate a TM, but g=1 enables content search which is very practical (although it may complexify analysis)).

shift(i, s) simply allows to move the focus of i to his neighbour on the right or on the left. As convolution (parametered by s) is used in order to be differentiable, the focus can become blurred on the resulting h.

sharpen(h, \gamma) applies one last simple regularization (resembling a softmax and parametered by \gamma) on h in order to undo this blurring. The end result is a N-sized vector \ell' whose weight is large in one of its coordinates and very small in all the others.

In short, the addressing mechanism can either return the cell left or right of the previous cell (like in a TM), or warp to a specific known value and go right or left from there. outputs the content at the cell(s) given by \ell', with a simple dot product:

    \[\mathcal{M} \cdot \ell'\]

Memory.write takes arguments \ell', \mathbf{e} and \mathbf{a}. Each cell (of size M) is first erased by multiplying it with the M-sized, [0,1]-valued vector \mathbf{e} times the entry of \ell' corresponding to this cell (we don’t want to erase the old content of non-focused cells, but we have to do it a little to remain differentiable). Then we use a similar strategy to add a new value at these locations: we add the M-sized, real-valued vector \mathbf{a} times the corresponding entry of \ell' to the remaining value in the cell.

Of course the read and write operations defer from the TM by being neither discrete nor local, but this is the price to pay to be optimizable. Also, the memory is not infinite, but hey, neither are those of our computers!

Questions about NTMs

We just saw the inner workings of Neural Turing Machines and how they relate to vanilla Turing Machines. But some questions remain.

Can NTMs run for eternity, just like TMs? Because of the structure of the global structure of the NTM, my answer would be that it depends. In a way, we have an output at every forward pass, which means at every loop in the program, so the program is actually not running a long time. At each step this output (hopefully) becomes closer to the right answer. But it never quite reaches the right answer for sure. So it would run for infinity if you wanted it to be 100% sure of its output.

Could we make a Universal NTM, i.e. an NTM that can take as inputs another NTM and an input for this NTM, and return whether this NTM accepts its input? I am not sure it is infeasible in principle, but I hardly see how the controller would make an embedding from a whole algorithm without deleting crucial information.

7 Commentaires

1 ping

Passer au formulaire de commentaire

  1. May I simply say what a relief to uncover somebody that really understands what theyre discussing over the internet. You certainly know how to bring a problem to light and make it important. More and more people really need to check this out and understand this side of the story. I was surprised that youre not more popular because you certainly have the gift.

  2. Качественные WordPress ссылки в комментариях от 5000 уник. доменов заказать здесь .

  3. Quality WordPress links in the comments from 5000 uniques. domains order here.

  4. Use artificial intelligence to cut turnaround time, extend your budget, and create more high-quality content that Google and readers will love. How It Works? WordAi is extremely fast and intuitive. Just enter your content, click rewrite, and in a matter of seconds, WordAi will rewrite an entire piece of content for you. WordAi comes up with different ways to express the same ideas by rewriting every sentence from scratch. This not only removes duplicate content, it also makes the rewritten content read completely naturally. Scale Your Business Make your entire content production process 10x more efficient. Use AI to create more high-quality, unique content that impresses clients and engages readers, all without having to put in more hours Register here and get a bonus.

  5. I was pretty pleased to discover this great site. I need to to thank you for your time for this particularly fantastic read!! I definitely appreciated every part of it and I have you book marked to see new stuff on your blog

  6. Fantastic beat ! I would like to apprentice while you amend your site, how can i subscribe for a blog web site? The account aided me a acceptable deal. I had been a little bit acquainted of this your broadcast offered bright clear idea

  7. You made some nice points there. I looked on the internet for the subject matter and found most individuals will go along with with your blog.

  1. […] My last article presented an in depth analysis of Neural Turing Machines (NTM), which is generally thought of as the foundation of the Memory-Augmented Neural Networks research field. Other ideas have sprung since, which may ultimately be better suited for an eventual computational complexity analysis. In the following article, I will give a (somewhat short and shallow I’m afraid) analysis of several Memory-Augmented Neural Network architectures. More precisely, I will give a short description of each, along with pros and cons regarding my objectives: […]

Les commentaires sont désactivés.