Blog - HubSpot Product Team

Misnomers and Confusing Terms in Machine Learning

Written by Marco Lagi | Feb 18, 2021

Like every other field, machine learning has its fair share of misnomers and confusing terms that, for one reason or another, have stuck. Usually that’s not a problem: meaning follows usage. It’s still useful for newcomers to be aware of them, though, so that these terms don’t propagate implicit wrong assumptions about the underlying concepts. 

Disclaimer: At HubSpot, we use many of the terms listed below within the same context as the machine learning community at large. There is undeniable value in being able to share a common vocabulary.

Introduction

Strawberries are not berries, the funny bone is a nerve, and Greenland is... mostly not green. Naming things is surprisingly difficult ⁠— words and concepts can be slippery. But changing a name is even harder. Evidence for this is the widespread presence of misnomers that have stuck in all fields of knowledge. While it’s easy to find lists of these terms for dermatology, culinary arts, flutes, and many other topics, I couldn’t find a good one for machine learning, and decided to put together this list with the 25 most confusing terms. 

A misnomer does not make the current usage of the term incorrect, because meaning follows usage and it evolves through space and time. Most of these terms are here to stay, and that’s OK. While their use has no practical consequences for experts, it can be argued that it’s riskier for newcomers. This is particularly true for people who come from a different background, where those terms might refer to a related concept, and create uncanniness or downright confusion. So hopefully this post doesn’t come across as a case of “Old man yells at cloud,” but more as a reference for disambiguation.

One could start with the name of the discipline itself. While I think machine learning is a solid name, the name of the adjacent fields of artificial intelligence (in its current meaning), data mining, and data science are all pretty puzzling. There’s not much intelligence in cognitive automation, the mining refers to the patterns and not the data, and generic data cannot be the subject of science. The debates around these terms are already in the spotlight, so I’ll set these aside and focus on more technical terms.

For the terms listed below, the machine learning (ML) definition is taken from canonical ML textbooks when possible (Bishop, Murphy, Goodfellow, etc.). Each term includes a “Confusion” section, which gives some context around the ambiguity, and an “Alternative” section. The latter should just be taken as a potential way to disambiguate the term in your head (i.e. what could have been) rather than advocacy for change (i.e. what should be).

Here's a handy table of contents if you'd like to skip to a certain section.

Section 1. Statistical terms

1.1 Multinomial distribution

1.2 Inference

1.3 \(R^2\)

1.4 Multi-armed bandit

1.5 Regression / Logistic regression

Section 2. The “model” catch-all

2.1 ML model / algorithm

2.2 Model drift

2.3 Black-box model

2.4 Non-parametric model

Section 3. Optimization terms

3.1 Learning rate

3.2 Stochastic gradient descent

3.3 Momentum

3.4 Backpropagation

Section 4. Losses and activations

4.1 Cross-entropy

4.2 Softmax / Hardmax / Softplus

4.3 Softmax loss

Section 5. Neural networks

5.1 Neural networks

5.2 Multi-layer perceptron

5.3 Input layer

5.4 Hidden layer

Section 6. Deep learning

6.1 Tensor

6.2 Convolutional layer

6.3 Deconvolutional layer

6.4 BERT

6.5 RNN gates

Conclusion

References

Section 1. Statistical terms

1.1 Multinomial distribution

Confusion. In probability theory, the discrete distributions for \(k\) categories and \(n\) trials are commonly referred to as:

  • \(k=2\),  \(n=1\) -- Bernoulli distribution
  • \(k=2\),  \(n>1\) -- binomial distribution
  • \(k>2\),  \(n=1\) -- categorical distribution
  • \(k>2\),  \(n>1\) -- multinomial distribution

In practical terms, the Bernoulli distribution is sampled by a single coin flip, the binomial by multiple coin flips, the categorical by a single dice roll, the multinomial by multiple dice rolls (Murphy 2012, page 35).

In machine learning, especially in natural language processing, the categorical distribution is often referred to as the multinomial distribution. For example, in order to generate text from a language model, practitioners tend to use the more convoluted multinomial on one trial followed by argmax (Chollet 2018, page 276),

instead of sampling from a categorical distribution,

which would directly give the index of the class (although possibly slower). Given that np.random.categorical doesn’t exist, I wonder if the semantics and performance of the APIs have contributed to keeping the categorical / multinomial conflation going, or maybe the fact that the output of the multinomial with \(n=1\) is a one-hot encoded vector makes it more palatable.

Alternative. There was already an established name for the categorical distribution, we could have picked that. 

1.2 Inference 

Confusion. In statistics, inference is the process of estimating properties of the unknown distribution that generates the data at hand, \(P(X, Y)\), where \(X\) are the features and \(Y\) the labels. The goal is to estimate a function \(f(X)\) such that (James 2017, page 16):

\[Y = f(X) + \epsilon \]

where \(\epsilon\) is the error term. In the context of an inference, \(f\) is used to understand how the labels are affected by changes in the features. The output of an inference can be either constructing confidence intervals or testing hypotheses about the population parameters (Mann 2010, page 439).

In machine learning, estimating properties of \(P(X,Y)\) is referred to as training or learning. Inference instead (especially in the deep learning literature) is often reserved for predicting the labels given the features, using an already trained model (Goodfellow 2016, page 262). Therefore, training time and inference time are the two main modes of a model (Goodfellow 2016, page 450).  

Alternative. The consensus seems to be that prediction would have been a better choice for estimating the labels given the features (James 2017, page 17). 

1.3 \(R^2\)

Confusion. After inadvertently choosing the wrong model, you fit your data \(y_i\) obtaining predictions \(f_i\), and check the coefficient of determination, \(R^2\). And you find out it’s negative. Aside from the fact that you could probably use a better statistic, how in the world is a squared real number negative? There are many definition of the coefficient of determination, but the most general and common is (Mann 2010, page 584):

\[R^2 = 1 - {\sum_i(y_i-f_i)^2 \over \sum_i(y_i-\overline{y_i})^2}\]

All it takes for \(R^2\) to be negative is for the model to be worse than the trivial baseline, i.e. a horizontal straight line corresponding to the average of the data, \(\overline{y}\). It is not necessarily defined as the square of anything. The reason why it’s denoted as \(R^2\) is that, under particular conditions, it coincides with the square of the multiple correlation coefficient, \(R\) (Wright 1921, page 574). 

Alternative. Any other symbol would have worked, possibly excluding \(\sqrt{R^4}\). Its inventor, Sewall Wright, used \(d\) (Wright 1921, page 562).

1.4 Multi-armed bandit

Confusion. I get why slot machines are also called one-armed bandits: they have one lever, they steal your money. But I’ve never seen a slot machine with multiple levers, so calling multi-armed bandit the problem of choosing between different options with partial information (Murphy 2012, page 184) doesn’t really hit the spot. It’s an analogy with no correspondence to common experience.

As Andrew Gelman puts it, another reason why this is an “obscure and overly cute” term is that slot machines have negative expected value, so the best strategy would be not to play. 

Alternative. In the same note, Gelman argues that it’s really just many one-armed bandits, so that could have been an option. Or maybe, since it’s a reinforcement learning problem with actions and rewards but no features, something like stateless reinforcement learning could have worked.

In the HubSpot product we have implemented multi-armed bandits so that our users can test different webpage variations. When we asked our users to describe what they thought a feature with that name might do, these were some of the responses:

“Create an octopus bank robber. ”

“Sounds like it steals bits of information; I'd be hesitant to utilize it for privacy concerns.”

And the best one:

“Honestly, that name is a little too "millennial," even for me, a millennial. HubSpot is an expensive professional business tool, not a toy.”

That’s why we ended up calling it adaptive tests.   

1.5 Regression / Logistic regression

Confusion. Supervised learning is usually split into classification, prediction of a categorical variable, and regression, prediction of a continuous variable (Murphy 2012, page 3). So why is logistic regression used for classification problems? Because logistic regression is a regression in the statistical sense, but not in the machine learning sense.

First, logistic regression is definitely not a classifier, in that it does not output a class but probabilities (i.e. a continuous output). It’s only when a decision rule (e.g. a probability threshold) is chosen and applied to the output that we can convert one to the other, but this step is not part of the modeling process

Second, it can be argued that the problem is not with the term itself, but with the use of regression as the prediction of a continuous variable. In statistics, this term is defined as the estimation of the relationships between a dependent variable and one or more independent variables (Mann 2010, page 565). This means that, in statistics, regression:

  1. Refers to the analysis rather than the goal. The goal can indeed be to predict a quantity, but it might also be to explain a relationship. 
  2. Does not require the dependent variable to be continuous. It could very well be categorical.

Alternative. Calling it logistic model might avoid all this ambiguity (James 2017, page 131).

Section 2. The “model” catch-all 

This section includes terms where the word model is used in place of a more appropriate ⁠— or at least less ambiguous ⁠— noun. 

2.1 ML model / algorithm 

Confusion. The standard definition of an ML model is distinct from that of an ML algorithm (Murphy 2012, page XXVIII). It is the artifact created by the training process, i.e. an algorithm trained to recognize certain types of patterns (see also Microsoft and Amazon API docs). And yet the concepts of model and algorithm are usually not differentiated. Sometimes the former definition is considered too narrow, and ML model is meant to be the ML pipeline deployed to production, including feature transformation (Schelter 2018).

To add to the confusion, the term machine learning algorithm is also pretty ambiguous (Deisenroth 2020). One might be talking about the system that makes predictions on input data (i.e. the estimator), or the system that allows the parameters of the estimator to minimize empirical risk (i.e. the training process).

Alternative. The terms machine learning model / algorithm are intrinsically fuzzy and hard to pin down. Providing more context when they are used would probably be enough to disambiguate.

2.2 Model drift 

Confusion. Several practitioners, both in industry and academia, refer to the degradation of a model’s performance over time due to a shift in the feature and/or label distributions as model drift (e.g. Nelson 2015, Kang 2018). This might be confusing because the model isn’t drifting at all. In fact, that’s the whole problem! As the features or labels shift away from the training distributions, the model doesn’t follow the drift.

Alternative. The terms covariate shift and data drift typically refer to changes in the distribution of features, so they’re not a great substitute for the more general phenomenon. The term concept drift might work (Žliobaitė 2010), although it is often reserved for shifts in the distribution of labels

Given that an example in machine learning is defined as “an instance (with its features) and a label,” example drift or generically distribution drift may have worked better.

2.3 Black-box model 

Confusion. The term black box model is often used to refer to the lack of interpretability or explainability of a machine learning model (Murphy 2012, page 585). This is especially true in deep learning, where models are non-linear and have many parameters. But even the most basic model (say a linear regression with two features) can be fairly hard to interpret if the features are correlated, or if they have different units (King 1986, page 669). Nevertheless, I think black box model is not a great term, not because of the black box part, but because of the model part. 

The internal workings of machine learning models are in general entirely transparent. One could follow the computation arbitrarily closely, without finding anything hidden or uninspectable about how the model produces its output. Models are clear boxes. It’s true that we have yet to understand in a reductionist way how features are globally combined to reach a prediction. But this is evidence that those models are emergent complex systems, not that they are inscrutable. 

Alternative. What’s really missing is not the logic of how we get from the input to the output, but how that logic maps to our own causal narratives. So black-box mapping instead of black box model here might have been a more accurate name, albeit pretty goofy.

There are actually a couple of cases where the term black box model is completely appropriate: either when the model is not available for inspection, or when black box doesn’t refer to the algorithm itself, but to the system the algorithm is trying to model. These cases are not how the term is commonly used. 

By the way, black box is also definitely a misnomer for its own traditional reference, the flight recorder, which has to be brightly colored for obvious reasons.

2.4 Non-parametric model 

Confusion. There are actually two things that can be confusing about a non-parametric model. The first one is the non-parametric part, which doesn’t mean the model has no parameters. It means that the number of parameters is not fixed: they grow with the size of the training set, like in k-nearest neighbors (Goodfellow 2016, page 115) or uncapped decision trees (Murphy 2012, page 270). 

The second one is the model part, as usual a loaded term. In statistics it can refer to the data generation model or to the estimator, while in machine learning it usually refers only to the estimator. So one can have a dataset produced by a non-parametric data generation model fitted by a parametric machine learning model, and vice versa.

Alternative. Maybe non-parametric algorithm would cut the ambiguity in half… or would it

Section 3. Optimization terms

3.1 Learning rate

Confusion. The learning rate is regarded as one of the most important levers to tweak during training: “If you have time to tune only one hyperparameter, tune the learning rate.” (Goodfellow 2016, page 429). In machine learning and optimization, the gradient descent algorithm can be written as (Goodfellow 2016, page 85), 

\[w_{t+1} - w_t = -\epsilon \nabla E(w_t)\] where the weights \(w\) that minimize \(E\) are to be estimated, and \(\epsilon\) is the learning rate. While it’s true that when this parameter is very small the model learns slowly, there is no necessary connection between a large value of \(\epsilon\) and learning quickly. If the rate is set to a value that is too large, the system might become unstable, and not learn at all (Goodfellow 2016, page 295). If we interpret “rate” as meaning the amount of learning done per iteration, then the real learning rate should be the derivative of the generalization error w.r.t. time, which can change even if is \(\epsilon\) kept constant. The metaphor leaks left and right.

Alternative. Maybe just step size, as it is sometimes known (Goodfellow 2016, page 311), would have been a better name for \(\epsilon\), although the whole term \(\epsilon \nabla E (w_t)\) might be a better candidate for step size.

3.2 Stochastic gradient descent

Confusion. The idea of stochastic gradient descent (SGD) is to apply the gradient descent algorithm on a random example (or minibatch) of the training data (Goodfellow 2016, page 152). SGD is an unbiased estimator of the gradient, so, in expectation, one recovers the full gradient descent. While this is enough to guarantee convergence, it means that the vector returned by SGD is not guaranteed to be the descent direction of the loss landscape at that point. In general, the sometimes-descent will be along a deviated direction that depends on the particular example (or minibatch) at hand. 

Alternative. SGD does add stochasticity to the GD algorithm, in that sense the term works. But since it’s part of the bigger family of stochastic approximation methods, maybe it would have been less confusing if approximation wasn’t dropped: Stochastic Approximation of General Gradient of at-this-point-Expected Descent (SAGGED).

3.3 Momentum

Confusion. This is mostly for people coming from physics, where the momentum is the product of mass and velocity of an object, \(p=mv\). In machine learning and optimization, the most naive member in the family of gradient descent algorithms is the steepest descent, as seen here.

This can be very slow. In particular, when the system finds itself in a ravine of the loss landscape, it starts oscillating mostly in the direction of the steep walls, making little progress over time. The authors of the original backpropagation paper (Rumelhart 1986) introduced therefore a momentum term

\[\Delta w_t = -\epsilon \nabla E (w_t) + \alpha \Delta w_{t-1}\]

The analogy here is clear: \(\Delta w_{t-1}\) is the speed in the weight space, the mass of the object is set to \(1\), and the hyperparameter \(\alpha \in [0, 1)\) controls the importance of the term. While the authors didn’t give a name to this last parameter in the original paper, it is usually referred to as the momentum hyperparameter (Goodfellow 2016, page 298), which then is often truncated to momentum (Chollet 2018, page 51). But using a synecdoche here leads to weird consequences, where a “momentum” \(\alpha\) is multiplied with the “speed” \(\Delta w_{t-1}\) to give… kinetic energy? 

Alternative. It can be demonstrated that there is a deeper analogy between the equation above and the Newtonian equation for a point mass \(m\) with coordinates \(w\) moving in a medium with friction coefficient \(\mu\) (Qian 1999), which leads to 

\[\alpha = {m \over m + \mu}\]

which means that \(\alpha\) plays the role of an effective mass, not momentum. So if one sets the mass to \(1\) and writes the hyperparameter as a reciprocal, a more appropriate name could be friction coefficient. While \(\alpha\) could have a better name, it’s amazing that the name for the momentum term \(\alpha\Delta w_{t-1}\) was so well given that it’s even more precise than the authors’ original intent. Not only does it look like a momentum; under particular circumstances, it is one. 

3.4 Backpropagation

Confusion. While the name is pretty apt for the algorithm that efficiently computes the gradient of the loss function with respect to the weights of the network (Rumelhart 1986), it might be a bit too generic. There are a number of alternatives that still rely on propagating a signal backwards through the network, starting from the output layer. For example, backpropagating desired states (Plaut 1986), target values (Lee 2015), perturbations (Scellier 2017), activations (Choromanska 2019), etc. 

There’s a second source of ambiguity, where this term is often used to include not only the algorithm to compute the gradient, but also the whole learning algorithm, including, for example, the gradient descent part (Goodfellow 2016, page 204).

Alternative. For the first source of confusion, maybe loss gradient backpropagation might have worked better? The second usage just seems unnecessary.

Section 4. Losses and activations

4.1 Cross-entropy 

Confusion. In information theory, the cross-entropy for a probability mass function \(q\) relative to another distribution \(p\) with same support is defined as (Goodfellow 2016, page 75):

\[H(p,q) = -\sum_x p(x)\log q(x)\]

In machine learning, this is often used as a loss function, where \(p(x)\) represents the empirical distribution of the label, and the given distribution \(q(x)\) is the predicted value of the current model. There are no further assumptions on the particular distributions at hand. And yet, several authors use this term only when \(q(x)\) is the output of a logistic or softmax function (Goodfellow 2016, page 132), that is to say either a Bernoulli or a Boltzmann distribution (Bishop 2006, page 206). The generic negative logarithm of the likelihood function is instead referred to as an error function (Bishop 2006, page 23). 

Alternative. Probably less confusing to stick with the information theoretical terminology.  

4.2 Softmax / Hardmax / Softplus 

Confusion. The softmax function, commonly used as the last activation function of a neural network among other things, is usually written as (Goodfellow 2016, page 81): 

\[softmax(x)_i = {e^{x_i} \over \sum_j e^{x_j}}\]

The effect of this transformation on the vector \(x\) is to:

  1. Squash each component in the interval \((0, 1)\)
  2. Normalize the sum of the components to \(1\)
  3. Amplify (usually) the largest components, and suppress the rest. 

This means that the output will be a soft version (i.e. a smooth approximation) of the one-hot encoded representation of the argmax of the original vector, not of the max function. For example,

\[softmax([1, 0.5, 0.2, 5]) = [0.02, 0.01, 0.01, 0.96] \sim [0, 0, 0, 1]\]

The real version of the softmax is the LogSumExp (also called realsoftmax, I kid you not), and it would be:

\[realsoftmax(x) = \log (\sum_i e^{x_i})\]

Another theory is that the softmax function is called such because the one-hot encoded representation of the argmax is sometimes called hardmax. Not only is this equally confusing, but most likely softmax came first, and the term is being retrofitted.

A similar problem arises with softplus, named this way because it’s a smooth approximation of the rectifier, also written with the plus superscript notation (Goodfellow 2016, page 68),

\[softplus(x) = \log (1 + e^x) \sim rectifier(x) = x^+ = max(0, x)\]

which is confusing since the plus function has nothing to do with it.

Alternative. The soft part of these names makes sense. It’s the function they are trying to approximate that can create confusion. Softargmax and softrectifier would have been more precise. If you can’t shake the feeling that all these terms feel very 1984, you’re not alone. 

As an alternative, the output of the softmax is also known as the Boltzmann distribution in physics, and as the Gibbs measure in mathematics. Given that information theory has already borrowed entropy from statistical mechanics, maybe we could have called it the Boltzmann function.

4.3 Softmax loss

Confusion. Not only is there confusion around cross-entropy and softmax, but to complicate things further they’re often taken as synonyms! Softmax loss (maybe defined first in Liu 2016) is essentially the contraction of softmax activation on a dense layer followed by cross-entropy loss — just take the first and last word. It kind of reminds me of the joke where the scientist says: 

"My findings are meaningless if taken out of context" 

and the media reports: 

“Scientist claims her findings are meaningless”

Alternative. There’s already a name for the cross-entropy loss ⁠— it would have been less confusing to stick with that.

Section 5. Neural networks 

5.1 Neural networks

Confusion. As François Chollet puts it, “[Neural networks] are neither neural nor networks. They're chains of differentiable, parameterized geometric functions.” The term has stuck as a reference to natural neural networks, by which they were partially inspired, but there’s no evidence they might be a plausible model of the brain (Chollet 2018, page 8). And yet, I think the term is able to convey a useful structure of the representation, both conceptually and visually, in the same way that the term decision tree does. 

Alternative. “Differentiable, parameterized geometric functions” is probably not the way to go. Not sure why I read them all, but there are other... interesting ideas from Chollet’s twitter thread: functionchain (in the wake of blockchain), multi-layer modeling (possibly a bit too close to multi-level marketing) and my favorite, chain train.

5.2 Multi-layer perceptron

Confusion. A multi-layer perceptron (MLP) is not a perceptron with multiple layers. It’s a neural network (sorry, Chollet) that contains many perceptrons (sort of), organized in layers. Peeling the onion, it’s a function composed of simpler functions that maps some set of input values to output values (Goodfellow 2016, page 5). A multi-layer multi-perceptron network, if you will. 

The reason why even this term is not rigorous enough is that usually perceptrons use a step function as activation, while the nodes of an MLP use continuous activations, which allows its weights to be trained with backpropagation. Also, usually perceptrons are thought of as linear classifiers, while MLPs are thought of as non-linear regressors or probabilistic classifiers. The “usually” above is because a plethora of variants for these algorithms exist. But even considering all of them, an MLP is not a P with ML. 

Alternative. Sometimes an MLP is simply referred to as a feed-forward neural network (Murphy 2012, page 563), although this term is more general, and comprises in principle any directed acyclic graph trained with backpropagation (Goodfellow 2016, page 168). Maybe stacked logistic regression would have been more appropriate or, again, my favorite: plain chain train.

5.3 Input layer

Confusion. The standard presentation of a multi-layer perceptron includes the statement that this architecture is composed of at least three layers of neurons: an input layer, a hidden layer, and an output layer (Haykin 2009, page 21). An artificial neuron (sorry again, Chollet) is supposed to 1) receive inputs, 2) combine them (often linearly), and 3) produce an output (often non-linearly). Instead, the neurons in the input layer start with a value, do nothing, and hand it off to the next layer. They perform no computation, and the already precarious abstraction of an artificial neuron becomes even more problematic. Most textbooks recommend not counting it in the number of layers of a network (Bishop 2006, page 229).

Alternative. Whatever.

5.4 Hidden layer

Confusion. Some textbooks report that hidden layers are named this way because this part of a neural network is not seen directly from either the input or output of the network (Haykin 2009, page 22). The name is confusing because hidden layers are not any more hidden than the internals of any other machine learning algorithm. As in the black box discussion, they are entirely transparent. Maybe this term is still around because, in an alternative context, one can think of them as representing latent concepts not explicitly present in the training data (Goodfellow 2016, page 6). 

Alternative. Given that latent variable is more common than hidden variable (see the number of hits on Google Scholar for the former versus the latter), maybe latent layers would have been slightly less confusing.

Section 6. Deep learning

6.1 Tensor

Confusion. In machine learning, a tensor \(T\) is a multi-dimensional array (Goodfellow 2016, page 33), i.e. the generalization of a matrix to an arbitrary number of axes. So if a matrix size can only be \(m \times n\), a tensor can be \(m \times n \times p \times ...\)

In physics, on the other hand, a tensor is an element of a tensor product that behaves in a particular way under a change of coordinates. A tensor is something that transforms like a tensor (Zee 2013, page 313). In formula (Dodson 1991, page 105), given a vector space \(X\) and its dual \(X^*\)

\[T \in X \otimes X \otimes ... \otimes X^* \otimes X^*\]

This definition implies that if the dimensionality of \(X\) is \(d\), tensors can only be of size \(d \times d\), or \(d \times d \times d\), etc. In fact, a physicist is familiar with: 

  • the inertia tensor (\(3 \times 3\))
  • the stress tensor (\(3 \times 3\))
  • the electromagnetic field tensor (\(4 \times 4\))
  • the permutation tensor (\(n \times n \times ... n\) times) 

In principle, one could define a tensor to be an element of any tensor product \(X \otimes Y\), but that’s not how physicists (and most mathematicians) talk about it. So while all physics tensors are machine learning tensors, the reverse is not true. The machine learning definition has retained only the “multi-dimensional array” aspect of the original one, and relaxed every other constraint. The reason might be that the notation and tools of tensor calculus were convenient to keep around, or that other fields adjacent to ML had already generalized the concept, or both. This is a great example of how a concept can evolve over time without changing name.

Alternative. In computer science, the generalization of a matrix is often just called a multi-dimensional array, although dimension in this context refers to the number of indices needed to specify an element and not to the total number of elements. Another term was coined in the mid 1980s by Moon and Spencer: holor (Moon 1986). It didn’t quite stick though. A search on Google Scholar reveals that only about 1,000 papers have chosen to use this term as of today. Had it been adopted more broadly, now we might have HolorFlow and Holor Processing Units.  

6.2 Convolutional layer

Confusion. In mathematics, the convolution over a matrix \(A\) with a kernel \(K\) is defined as (Goodfellow 2016, page 332):

\[S(i,j) = \sum_m \sum_n A(i-m, j-n)K(m,n)\]

while the cross-correlation over the same objects is defined as:

\[S(i,j) = \sum_m \sum_n A(i+m,j+n)K(m,n)\]

The difference between the two operators is just the reflection of one of the two functions. They both involve sliding the kernel across the matrix (e.g. an image), but convolutions flip the kernel while cross-correlations don’t. In fact, for the sake of simplicity, many animations around the web that try to explain convolutions really display cross-correlations.

How about the implementations of convolutional layers in common libraries ⁠— do they use convolutions? No. Both TensorFlow and PyTorch implement the simpler cross-correlations instead of convolutions. Does it matter for model performance? Also no. The parameters are learnable, and if convolutions were implemented instead, the layer would just learn the flipped orientation (Goodfellow 2016, page 333). 

Alternative. As far as I know there is no term for their superset, so sticking with one of the two seems like it was the best option. Convolutions are more common in computer vision than cross-correlation, so that might be a reason the term was chosen. Either way they would have been called CNNs!

6.3 Deconvolutional layer

Confusion. In mathematics, the deconvolution is a process that tries to find the original function \(f\) that was convolved with a filter \(g\) to give the output \(h\), 

\[f*g=h\]

if the output and the filters are known.

In machine learning instead, the term has been used as a synonym for transposed convolution (see Zeiler 2010), which is an entirely different operator.

Alternative. Sticking with transposed convolution would have made the most sense.

6.4 BERT

Confusion. In the Bidirectional Encoder Representations from Transformers, the confusing bit might be the bidirectional part. The encoder of a transformer looks at the whole sequence at once using self-attention, there’s really no directionality involved.

Alternative. It’s been suggested that non-directional would be more appropriate. So... NERT?

6.5 RNN gates

Confusion. A gate, in recurrent neural network parlance, is defined as a unit that applies a logistic sigmoid to an input vector, squashing its elements between 0 and 1. The activated vector is then multiplied element-wise with another vector, effectively selecting how much of each component is let through to the next step. The function of the gates is to protect and control the cell state.

A common LSTM layer is composed of three gates: information gets into the cell thanks to the input gate, stays in the cell thanks to the forget gate, and exits the cell thanks to the output gate. The most confusing term is the forget gate. If one follows the above logic, when an element of the activated vector is \(1\) it means that the component will be completely retained. So the action the gate controls is to remember, not to forget.

Alternative. A better name for the forget gate might have been remember gate (or keep gate, as it is sometimes referred to). Input gate and output gate are not confusing, but maybe better names would have been write gate and read gate.

Conclusion

Knowing why a term is confusing can shed light not only on subtleties of meaning, but also on the history of the underlying concept it represents. It’s also a great excuse to strike up a conversation at parties, tell people they’re wrong, and implicitly claim your superiority. 

Speaking of which, a training set is definitely not a set.

Interested in working with a team that's just as interested in how you work as what you're working on? Check out our open positions and apply.

References

[Bishop 2006] Bishop, C. M. (2006) Pattern Recognition and Machine Learning. Springer. 

[Chollet 2018] Chollet, F. (2018) Deep Learning with Python. Manning.

[Choromanska 2019] Choromanska, A. et al (2019) Beyond Backprop: Online Alternating Minimization with Auxiliary Variables. arXiv. 1806.09077

[Deisenroth 2020] Deisenroth, M. P. et al (2020) Mathematics for Machine Learning. Cambridge University Press.

[Dodson 1991] Dodson, C.T.J. et al (1991) Tensor geometry. Springer.

[Goodfellow 2016] Goodfellow, I. et al (2016) Deep Learning. MIT Press.

[Haykin 2009] Haykin, S. (2009) Neural Networks and Learning Machines. Pearson.

[James 2017] James, G. et al (2017) An Introduction to Statistical Learning. Springer, 8th Edition. 

[Kang 2018] Kang, D. et al (2018) Model assertions for debugging machine learning. NeurIPS MLSys Workshop. 

[King 1986] King, G. (1986) How Not to Lie With Statistics. American Journal of Political Science. 30: 666–687

[Lee 2015] Lee, D. H. et al (2015) Difference Target Propagation. arXiv. 1412.7525

[Mann 2010] Mann, P. S. (2010) Introductory Statistics. Wiley, 7th Edition. 

[Moon 1986] Moon, P. et al (1986) Theory of Holors: A Generalization of Tensors. Cambridge University Press.

[Murphy 2012] Murphy, K. (2012) Machine Learning: A Probabilistic Perspective. The MIT Press.

[Nelson 2015] Nelson, K. et al (2015) Evaluating model drift in machine learning algorithms. IEEE, CISDA.

[Plaut 1986] Plaut, D. C. et al (1986) Experiments on Learning by Back Propagation. Technical Report CMU-CS-86-126. Carnegie-Mellon University.

[Qian 1999] Qian, N. (1999) On the momentum term in gradient descent learning algorithms. Neural networks. 12: 145–151

[Rumelhart 1986] Rumelhart, D. E. et al (1986) Learning representations by back-propagating errors. Nature. 323: 533–536

[Scellier 2017] Scellier, B. et al (2017) Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation. arXiv. 1602.05179

[Schelter 2018] Schelter, S. et al (2018) On Challenges in Machine Learning Model Management. IEEE Data Eng. Bull. 41: 5-15

[Wright 1921] Wright, S. (1921) Correlation and causation. Journal of Agricultural Research. 20: 557–585

[Zee 2013] Zee, A. (2013) Einstein Gravity in a Nutshell. Princeton University Press.

[Žliobaitė 2010] Žliobaitė, I. (2010) Learning under Concept Drift: an Overview. arXiv. 1010.4784