# Watch Cincinnati vs. SMU: How to Live Stream, TV channel, start time for Tuesday’s NCAA

A pivotal AAC matchup is on tap on Saturday night when №16 SMU plays host to №9 Cincinnati. The Bearcats are back in action after multiple weeks off and look to remind people why it is they have such a low number next to their name. Both programs have expectations of reaching the AAC Championship Game in December and this matchup will go a long way in determining that.

The Bearcats had a scheduled off week on Oct. 10 and then had to postpone their Oct. 17 date against Tulsa because of positive COVID-19 tests. By Saturday, it will have been 21 days since their last competition, a 28–7 win against USF. Prior to that conference opener, the Bearcats rolled Austin Peay 55–20 and bested Army in a low-scoring 24–10 win. There’s no reason to doubt Cincinnati’s status as one of the best teams in the country, but there is still some unknown with only two games against FBS competition.

Knowing the mathematics behind machine learning algorithms is a superpower. If you have ever built a model for a real-life problem, you probably experienced that being familiar with the details can go a long way if you want to move beyond baseline performance. This is especially true when you want to push the boundaries of state of the art.

However, most of this knowledge is hidden behind layers of advanced mathematics. Understanding methods like stochastic gradient descent might seem difficult since it is built on top of multivariable calculus and probability theory.

With proper foundations, though, most ideas can be seen as quite natural. If you are a beginner and don’t necessarily have formal education in higher mathematics, creating a curriculum for yourself is hard. In this post, my goal is to present a roadmap, taking you from absolute zero to a deep understanding of how neural networks work.

To keep things simple, the aim is not to cover everything. Instead, we will focus on getting our directions. This way, you will be able to study other topics without difficulties, if need be.

Instead of reading through in one sitting, I recommend using this article as a reference point through your studies. Go deep into a concept that is introduced, then check the roadmap and move on. I firmly believe that this is the best way to study: I will show you the road, but you must walk it.

The fundamentals

Most of machine learning is built upon three pillars: linear algebra, calculus, and probability theory. Since the last one builds on the first two, we should start with them. Calculus and linear algebra can be studied independently, as is usually the case in a standard curriculum.

Image for post

Calculus

Calculus is the study of differentiation and integration of functions. Essentially, a neural network is a differentiable function, so calculus will be a fundamental tool to train neural networks, as we will see.

To familiarize yourself with the concepts, you should make things simple and study functions of a single variable for the first time. By definition, the derivative of a function is defined by

Image for post

where the ratio for a given h is the slope of the line between the points (x, f(x)) and (x+h, f(x+h)).

In the limit, this is essentially the slope of the tangent line at the point x. The figure below illustrates the concept.

Image for post

Visualizing the derivative of a function

Differentiation can be used to optimize functions: the derivative is zero at local maxima or minima. (However, this is not true in the other direction; see f(x) = x³ at 0.) Points where the derivative is zero, are called critical points. Whether a critical point is minima or maxima can be decided by looking at the second derivative:

Image for post

There are several essential rules regarding differentiation, but probably the most important is the so-called chain rule:

Image for post

which tells us how to calculate the derivative of composed functions.

Integration is often called the inverse of differentiation. This is true because

Image for post

which holds for any integrable function f(x). The integral of a function can also be thought of as the signed area under the curve. For instance,

Image for post

because when the function is negative, the area there also has a negative sign.

Image for post

The signed area under a sine curve between -π and π

Integration itself plays a role in understanding the concept of expected value. For instance, quantities like entropy and Kullback-Leibler divergence are defined in terms of integrals.

Further study

I would recommend the Single Variable Calculus course from MIT. (In general, online courses from MIT are always excellent learning resources.) If you are more of a book person, there are many textbooks available. The Calculus book by Gilbert Strang accompanying the previously mentioned course is again a great resource, completely free of charge.

Image for post

Linear algebra

As I mentioned, neural networks are essentially functions, which are trained using the tools of calculus. However, they are described with linear algebraic concepts like matrix multiplication.

Linear algebra is a vast subject with many essential aspects of machine learning, so this will be a significant segment.

Vector spaces

To have a good understanding of linear algebra, I would suggest starting with vector spaces. It is better if we talk about a special case first. You can think of each point in the plane as a tuple

Image for post

These are essentially vectors pointing from zero to (x₁, x₂). You can add these vectors together and multiply them with scalars:

Image for post

This is the prototypical model of a vector space. In general, a set of vectors V is a vector space over the real numbers if you can add vectors together and multiply a vector with a real number, such that the following properties hold:

Image for post

Don’t panic! I know that this looks very scary (at least it looked that to me when I was a freshman student with a math major), but it really isn’t. These just guarantee that vectors can be added together and scaled just as you would expect. When thinking about vector spaces, it helps if you mentally model them as

Image for post

Normed spaces

If you feel like you have a good understanding of vector spaces, the next step would be to understand how to measure a vector’s magnitude. By default, a vector space in itself gives no tools for this. How would you do this on the plane? You have probably learned that there, we have

Image for post

This is a special case of a norm. In general, a vector space V is normed if there is a function

Image for post

called norm such that

Image for post

Again, this might be scary, but this is a simple and essential concept. There are a bunch of norms out there, but the most important is the p-norm family

Image for post

(with p = 2 we obtain the special case mentioned above) and the supremum norm

Image for post

Sometimes, like for p = 2, the norm comes from a so-called inner product, which is a bilinear function

Image for post

such that

Image for post

A vector space with an inner product is called inner product space. An example is the classical Euclidean product

Image for post

Every inner product can be turned into a norm by

Image for post

When the inner product for two vectors is zero, we say that the vectors are orthogonal to each other. (Try to come up with some concrete examples on the plane to understand the concept deeper.)

Basis and orthogonal/orthonormal basis

Although vector spaces are infinite (in our case), you can find a finite set of vectors that can be used to express all vectors in the space. For example, on the plane, we have

Image for post

where

Image for post

This is a special case of basis and orthonormal basis.

In general, basis is a minimal set of vectors

Image for post

such that their linear combinations span the vector space:

Image for post

A basis always exists for any vector space. (It may not be a finite set, but that shouldn’t concern us now.) Without a doubt, a basis simplifies things greatly when talking about linear spaces.

When the vectors in a basis are orthogonal to each other, we call it an orthogonal basis. If each basis vector’s norm is 1 for an orthogonal basis, we say it is orthonormal.

Linear transformations

One of the key objects related to vector spaces are linear transformations. If you have seen a neural network before, you know that one of the fundamental building blocks are layers of the form

Image for post

where A is a matrix, b and x are vectors, and σ is the sigmoid function. (Or any activation function, really.) Well, the part Ax is a linear transformation. In general, the function

Image for post

is a linear transformation between vector spaces V and W if

Image for post

holds for all x, y in V, and all a real number.

To give a concrete example, rotations around the origin in the plane are linear transformations.

Undoubtedly, the most crucial fact about linear transformations is that they can be represented with matrices, as you’ll see next in your studies.

Matrices and their operations

If linear transformations are clear, you can turn to the study of matrices. (Linear algebra courses often start with matrices, but I would recommend it this way for reasons to be explained later.)

The most important operation for matrices is the matrix product. In general, if A and B are matrices defined by

Image for post

then their product can be obtained by

Image for post

This might seem difficult to comprehend, but it is actually quite straightforward. Take a look at the figure below, demonstrating how to calculate the element in the 2nd row, 1 st column of the product.

Image for post

The reason why matrix multiplication is defined the way it is because matrices represent linear transformations between vector spaces. Matrix multiplication is the composition of linear transformations.

If you would like to read more about this, there is a great article right here at Towards Data Science, explaining things in detail.

Why is Linear Algebra Taught So Badly?

Linear algebra is one of the cornerstones of machine learning. It’s more intuitive than you might think

towardsdatascience.com

Determinants

In my opinion, determinants are hands down one of the most challenging concepts to grasp in linear algebra. Depending on your learning resource, it is usually defined by either a recursive definition or a sum that iterates through all permutations. None of them are tractable without significant experience in mathematics.

To understand this concept, watch this video below. Trust me, it is magic.

To summarize, the determinant of a matrix describes how the volume of an object scales under the corresponding linear transformation. If the transformation changes orientations, the sign of the determinant is negative.

You will eventually need to understand how to calculate the determinant, but I wouldn’t worry about it now.

Eigenvalues, eigenvectors, and matrix decompositions

A standard first linear algebra course usually ends with eigenvalues/eigenvectors and some special matrix decompositions like the Singular Value Decomposition.

Let’s suppose that we have a matrix A. The number λ is an eigenvalue of A if there is a vector x (called eigenvector) such that

Image for post

holds. In other words, the linear transformation represented by A is a scaling by λ for the vector x. This concept plays an essential role in linear algebra. (And practically in every field which uses linear algebra extensively.)

At this point, you are ready to familiarize yourself with a few matrix decompositions. If you think about it for a second, what type of matrices are the best from a computational perspective? Diagonal matrices! If a linear transformation has a diagonal matrix, it is trivial to compute its value on an arbitrary vector.

Image for post

Most special forms aim to decompose a matrix A to a product of matrices, where preferably at least one is diagonal. Singular Value Decomposition, or SVD in short, the most famous one, states that there are special matrices U, V, and a diagonal matrix Σ such that

Image for post

holds. (U and V are so-called unitary matrices, which I don’t define here, suffice to know that it is a special family of matrices.)

SVD is also used to perform Principal Component Analysis, one of the simplest and most well-known dimensionality reduction methods.

Further study

Linear algebra can be taught in many ways. The path I outlined here was inspired by the textbook Linear Algebra Done Right by Sheldon Axler. For an online lecture, I would recommend the Linear Algebra course from MIT OpenCourseWare, an excellent resource.

If a course might be too much, there are great articles out there, for instance, the following.

Mathematics for AI: Linear Algebra and How to Understand It Better

How it works, where it is used and how to learn it faster.

towardsdatascience.com

Image for post

Multivariable calculus

This is the part where linear algebra and calculus come together, laying the foundations for the primary tool to train neural networks: gradient descent. Mathematically speaking, a neural network is simply a function of multiple variables. (Although, the number of variables can be in the millions.)

Similar to univariable calculus, the two main topics here are differentiation and integration. Let’s suppose that we have a function

Image for post

mapping vectors to real numbers. In two dimensions (that is, for n = 2), you can imagine its plot as a surface. (Since humans don’t see higher than three dimensions, it is hard to visualize functions with more than two real variables.)

Image for postImage for post

Graph of a function of two variables.

Differentiation in multiple variables

In a single variable, the derivative was the slope of the tangent line. How would you define tangents here? A point on the surface has several tangents, not just one. However, there are two special tangents: one is parallel to the x-z plane, while the other is to the y-z plane. Their slope is determined by the partial derivatives, defined by

Image for postImage for post

That is, you take the derivative of the functions obtained by fixing all but one variable. (The formal definition is identical for ≥ 3 variables, just more complicated notation.)

Tangents in these special directions span the tangent plane.

Image for postImage for post

Tangent plane.

The gradient

There is another special direction: the gradient, which is the vector defined by

Image for postImage for post

The gradient always points to the direction of the largest increase! So, if you would take a tiny step in this direction, your elevation would be the maximal among all the other directions you could have chosen. This is the basic idea of gradient descent, which is an algorithm to maximize functions. Its steps are the following.

Calculate the gradient in the point x₀, where you currently are.

Take a small step in the direction of the gradient to arrive at the point x₁. (The step size is called the learning rate.)

Go back to Step 1. and repeat the process until convergence.

Of course, there are several flaws in this basic algorithm, which was improved several times over the years. Modern gradient descent based optimizers employ many tricks like adaptive step size, momentum, and other methods, which we are not going to detail here.

Calculating the gradient in practice is difficult. Functions are often described by the composition of other functions, for instance, the familiar linear layer

Image for post

where A is a matrix, b and x are vectors, and σ is the sigmoid function. (Of course, there can be are other activations, but we shall stick to this for simplicity.) How would you calculate this gradient? At this point, it is not even clear how to define the gradient for vector-vector functions such as this, so let’s discuss! A function

Image for postImage for post

can always be written in terms of vector-scalar functions like

Image for postImage for post

The gradient of g is defined by the matrix whose k-th row is the k-th component’s gradient. That is,

Image for postImage for post

This matrix is called the total derivative of g.

In our example

Image for post

things become a bit more complicated because it is the composition of two functions:

Image for postImage for post

and

Image for postImage for post

defined by applying the univariate sigmoid componentwise. The function l can be decomposed further to m functions mapping from the n-dimensional vector space to the space of real numbers:

Image for postImage for post

where

Image for postImage for post

If you calculate the total derivative, you’ll see that

Image for postImage for post

This is the chain rule for multivariate functions in its full generality. Without it, there would be no easy way to calculate the gradient of a neural network, which is ultimately a composition of many functions.

Higher-order derivatives

Similarly to the univariate case, the gradient and the derivatives play a role in determining whether a given point in space is local minima or maxima. (Or neither.) To provide a concrete example, training a neural network is equivalent to minimizing the loss function on the parameters’ training data. It is all about finding the optimal parameter configuration w for which the minimum is attained:

Image for postImage for post

where

Image for postImage for post

are the neural network and the loss function, respectively.

For a general differentiable vector-scalar function of say n variables, there are n² second derivatives, forming the Hessian matrix

Image for postImage for post

In multiple variables, the determinant of the Hessian takes the role of the second derivative. Similarly, it can be used to tell whether a critical point (that is, where all of the derivatives are zero) is minima, maxima, or just a saddle point.

Further study

There are a lot of awesome online courses on multivariable calculus. I have two specific recommendations:

Khan Academy on multivariable calculus,

MIT multivariable calculus.

Now we are ready to take on the final subject: probability theory!

Image for postImage for post

Probability theory

Probability theory is the mathematically rigorous study of chance, fundamental to all fields of science.

Setting exact definitions aside, let’s ponder a bit about what probability represents. Let’s say I toss a coin, with a 50% chance (or 0.5 probability) of it being heads. After repeating the experiment 10 times, how many heads did I get?

If you have answered 5, you are wrong. Heads being 0.5 probability doesn’t guarantee that every second throw is heads. Instead, what it means that if you repeat the experiment n times where n is a really large number, the number of heads will be very close to n/2.

To get a good grip on probability, I would recommend my post below, which I wrote a while ago, to offer a concise yet mathematically correct presentation of the concept.

The mathematical foundations of probability

A measure-theoretic introduction

towardsdatascience.com

Besides the basics, there are some advanced things you need to understand, first and foremost, expected value and entropy.

Expected value

Supposed that you play a game with your friend. You toss a classical six-sided dice, and if the outcome is 1 or 2, you win 300 bucks. Otherwise, you lose 200. What are your average earnings per round if you play this game long enough? Should you be even playing this game?

Well, you win 100 bucks with probability 1/3, and you lose 200 with probability 2/3. That is if X is the random variable encoding the result of the dice throw, then

Image for postImage for post

This is the expected value, that is, the average amount of money you will receive per round in the long run. Since this is negative, you will lose money, so you should never play this game.

Generally speaking, the expected value is defined by

Image for postImage for post

for discrete random variables and

Image for postImage for post

for real-valued continuous random variables.

In machine learning, loss functions for training neural networks are expected values in one way or another.

Law of large numbers

People often falsely attribute certain phenomena to the law of large numbers. For instance, gamblers who are on a losing streak believe that they should soon win because of the law of large numbers. This is totally wrong. Let’s see what this is really!

Suppose that

Image for postImage for post

are random variables representing the independent repetitions of the same experiment. (Say, rolling a dice or tossing a coin.)

Essentially, the law of large numbers states that

Image for postImage for post

That is the average of the outcomes in the long run equal to the expected value.

An interpretation is that if a random event is repeated enough times, individual results might not matter. So, if you are playing in a casino with a game that has negative expected value (as they all do), it doesn’t matter that you win occasionally. The law of large numbers implies that you will lose money.

To get a little bit ahead, LLN is going to be essential for stochastic gradient descent.

Information theory

Let’s play a game. I have thought of a number between 1 and 1024, and you have to guess it. You can ask questions, but your goal is to use as few questions as possible. How much do you need?

If you play it smart, you will perform a binary search with your questions. First you may ask: is the number between 1 and 512? With this, you have cut the search space in half. Using this strategy, you can figure out the answer in

Image for postImage for post

questions.

But what if I didn’t use the uniform distribution when picking the number? For instance, I could have used a Poisson distribution.

Image for postImage for post

Probability mass functions for Poisson distributions. Source: Wikipedia

Here, you would probably need fewer questions because you know that the distribution tends to concentrate around specific points. (Which depends on the parameter.)

In the extreme case, when the distribution is concentrated on a single number, you need zero questions to guess it correctly. Generally, the number of questions depends on the information carried by the distribution. The uniform distribution contains the least amount of information, while singular ones are pure information.

Entropy is a way to quantify this. It is defined by

Image for postImage for post

for discrete random variables and

Image for postImage for post

for continuous, real-valued ones. (The base of the logarithm is usually 2, e, or 10, but it doesn’t really matter.)

If you have ever worked with classification models before, you probably encountered the cross-entropy loss, defined by

Image for postImage for post

where P is the ground truth (a distribution concentrated to a single class), while the hatted version represents the class predictions. This measures how much “information” the predictions have compared to the ground truth. When the predictions match, the cross-entropy loss is zero.

Another frequently used quantity is the Kullback-Leibler divergence, defined by

Image for postImage for post

where P and Q are two probability distributions. This is essentially cross-entropy minus the entropy, which can be thought of as quantifying how different the two distributions are. This is useful, for instance, when training generative adversarial networks. Minimizing the Kullback-Leibler divergence guarantees that the two distributions are similar.

Further study

Here, I would recommend two books for you:

Pattern Recognition and Machine Learning by Christopher Bishop,

The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

These are the two fundamental textbooks, and they teach you much more than probability theory. Both of them go way beyond the basics, but the corresponding chapters provide an excellent introduction.

Beyond the mathematical foundations

With this, we reviewed the necessary mathematics for understanding neural networks. Now, you are ready for the fun part: machine learning!

To really understand how neural networks work, you still have to learn some optimization and mathematical statistics. These subjects build upon the foundations we set. I won’t go into details, since this is beyond the scope of the article, but I have prepared a study roadmap to guide you.