What is the most important topic in statistics?

Important Math Topics You Need To Learn For Data Science

Without these, your potential will be severely limited

Leon Lok

Feb 15·7 min read

Photo by Artturi Jalli on Unsplash

Introduction

Mathematics.

Its always the big elephant in the room: Nobody wants to talk about it, but everyone has to address it eventually.

From my experience, asking whether you need to learn maths for data science is a redundant question. Instead, its almost always a question of how much and what type of maths do you need to learn.

Having come from a maths background, I can say that most of what Ive learnt during my maths degree has never been explicitly used in a real life situation.

That time when we had to prove Pythagoras Theorem? Nope Ive never needed it.

But this doesnt imply that you can get by with just the absolute basics. The problem is, the maths you need to learn varies greatly depending on the type of data science role youre after.

With that being said, I believe theres a minimum amount of maths knowledge needed for most entry-level data science roles; this creates a good, solid foundation for doing data science and learning more advanced concepts.

If you want to watch something instead, you can check out my video below on the same topic.

Instead of reading this post, you can watch my YouTube video on this topic.

Functions, Variables and Graphs

Photo by Dan-Cristian Pădureț on Unsplash

Before going into the more advanced topics, its important to get comfortable with the basics.

For most of you reading this, you might already know what functions, variables and graphs are. But if you dont, then these topics form the foundation for tasks like exploratory data analysis and statistical / machine learning modelling.

When I studied machine learning during my data science masters degree, students that werent familiar or have forgotten about these topics had a harder time progressing at the start.

There were a few students who struggled with plotting simple equations and interpreting graphs. It didnt take long for them to pick this up, but their importance cant be understated.

Statistics

Photo by Edge2Edge Media on Unsplash

Basic statistical understanding is probably the most important skill in data science.

Statistics is about quantifying uncertainty. It lets you rigorously interpret your results thus helping you make better-informed decisions.

Probability Theory

A foundational statistical topic is probability theory: This is about quantifying uncertainty and understanding randomness.

Beginner statistics courses usually start with this topic because it forms the foundation for a lot of advanced statistical concepts; for example, it helps with understanding statistical distributions, hypothesis testing, and inferential statistics.

Probability theory is what Id suggest starting with if you dont have a basic foundation in statistics already.

Descriptive Statistics

Descriptive statistics is for analysing and understanding the basic features of your data.

We use descriptive statistics to understand:

  • The distribution of the data.
  • The central tendency of the data, i.e. mean, median, and mode.
  • The spread of the data, i.e. standard deviation and variance.

By understanding the basic makeup of your data, youll be able to know which statistical methods to apply. This makes a big difference on the credibility of your results.

Hypothesis Testing

As the name suggests, hypothesis testing is about testing the plausibility of your hypothesis.

This is similar to A/B testing. The difference is that A/B testing is a randomised control trial: This is where we compare a treatment group with a control group, and both groups of users are randomised. In hypothesis testing, we compare the outcome of a group from an experiment against a null group to see if theres any statistically significant difference.

Hypothesis testing evaluates the significance of your experimental results and lets you ask questions scientifically based on data.

Regression

Regression is often used for prediction and forecasting.

It models the relationship between variables, i.e. a dependent variable and one or more independent variables. For a model to be considered a regression model, the dependent variable needs to be continuous.

Plenty of companies use regression in some way to predict or forecast things like sales or seasonal events that occur each year.

If you know regression well, then itll significantly help with understanding machine learning as theres a big overlap.

Model Evaluation

Evaluating how your models perform is extremely important in data science.

Theres no point training multiple models without knowing which one to use. Being able to evaluate your statistical or machine learning models will give you a proper way of picking the best models to use for your data science projects.

Linear Algebra

Photo by Robin Spielmann on Unsplash

The entire foundation of machine learning algorithms like deep learning are based on linear algebra. Hence, this is an important topic to know if you want to take machine learning seriously.

Vectors and Matrices

Vectors and matrices are foundational for linear algebra. In addition, large datasets are much easier to work with we represent them in the form of vectors and matrices; this is vital in machine learning.

In machine learning, we use them in cost functions, neural networks, support vector machines, and many more.

If you want to write faster data processing pipelines, popular Python libraries like NumPy are also to designed to handle vectors and matrices extremely efficiently.

Eigenvectors and Eigenvalues

After getting comfortable with vectors and matrices, it would then make sense to read up on eigenvectors and eigenvalues.

When we break down matrices into their simplest representation, we get eigenvectors and eigenvalues. These give valuable insights into the properties of a matrix. And as we said already, large datasets are much easier to work with in the form of matrices and vectors.

We also need eigenvectors and eigenvalues to understand principal component analysis [PCA], which is a technique that reduces the dimensionality of data whilst minimising information loss. Its an important technique for finding features in a large dataset.

Calculus

Photo by Daniel Price on Unsplash

Calculus isnt explicitly used as commonly as statistics, but we need it to solve optimisation problems. You should at least be comfortable with how basic derivatives and integrals work, as they form the foundation of calculus.

In machine learning, we often talk about loss functions, of which there are many different types. These functions use a technique based on derivatives called gradient descent to find the best set of parameters. Thus, without understanding how derivatives work, you wont truly know how these parameters were calculated.

Furthermore, neural networks use integrals during back-propagation, a technique for fine-tuning its weights after giving a prediction. Most people that train neural networks dont even know why it works, but by understanding integrals, youll have a much easier time understanding it in the future.

Discrete Mathematics

Photo by Alexandre Debiève on Unsplash

Modern computer science is built almost entirely on discrete mathematics.

Heres a few examples to illustrate this: computers store data as zeros and ones, and they use boolean algebra to perform calculations on the data; low-level programming languages rely on logical operators; and things like blockchain, cryptography, and computer security also use number theory.

Algorithmic Complexity

Knowing how complex algorithms are will help you have a better idea of how long theyll take to run, and how difficult itll be when using them to solve a problem.

Since I dont come from a computer science background, this was something that I learnt later on. You mightve heard of this as the big-O notation.

Set Theory

During my maths degree, I always thought set theory felt a bit pointless until I started learning about relational databases.

A set is basically a collection of elements. These elements can be any kind of mathematical object. In the context of databases, you can think of a set as a table, with its elements being the rows in the table.

You dont need set theory to work with databases, but its definitely good to know. Set theory helps with understanding how SQL joins work, and itll help you better optimise database models.

Graph Theory

Graph theory is the bedrock of graph databases. This type of database is for modelling data consisting of nodes and relationships.

A good example of this would be a social network. Each person would be a node, and whenever someone follows another person, then that would be a relationship.

A lot of social network data are held in graph databases. To do any social network analysis, it might require some knowledge of graphs and how to apply algorithms in this setting.

Conclusion

In the end, mathematics is unavoidable in data science. Without a good foundation in maths, your potential as a data scientist will be severely limited.

Hopefully, this has given you some ideas on where to start and how much you actually need to learn. Of course, this also depends on what kind of data scientist you are or aiming to be.

For the expert data scientists out there, please let me know if Ive missed anything. As always, if you enjoyed this article, you can check out my other videos on YouTube. And if you want to see what Im up to via email, you can consider signing up to my newsletter!

Originally published at //leonlok.co.uk on February 15, 2022.

Video liên quan

Chủ Đề