Data Science Foundation Self-Study Guide – Data With Purpose

Everything has a begining! Before jumping into the word of Data Science, it´s wise to build a good foundation. In this post, my intention was to share my journey and perhaps inspire you to find your own path into becoming a Data Scientist.

Considering I already have a Bachelor Degree in Computer Science, some of the topics were treated just as a review, so I was not looking to get credits or certificates for these things I have already studied in college. Notes were taken in Evernote during this process creating a “book like structure” so I can review what I learned during the videos and courses I watched.

In this post I´ll cover what I consider the basics steps. Let´s begin!

1.Strong foundation in mathematics

Data science is rooted in mathematics, so it’s essential to have a solid grasp of concepts such as linear algebra, calculus, probability and statistics. Since I took these class in college a while ago and to save time, I decide to review the most important concepts that are related to machine learning, instead of taking these full courses all over again.

My course of choice was this one from DeepLearning.AI Mathematics for Machine Learning and Data Science Specialization found in Coursera. I manage to finish this course in a week.

This is a very dense course, with some tough programming assignments. If you decide to take it, make sure you have some programming skills. The math is simplified and well explained, but if you have studies these topics before I will enjoy it more.

It will cover basicaly the following subjects:

1.1 Linear Algebra

Linear algebra is the secret language of data science, a foundation on which the intricate models and analyses are built. At its core, linear algebra deals with vectors and matrices, and understanding these concepts is crucial for anyone aspiring to become a data scientist. I need to review matrix operations, eigenvalues, and eigenvectors, and how these mathematical tools form the backbone of machine learning and data analysis.

For example, neural networks involve interconnected layers that can be understood as linear transformations, and optimization algorithms based on linear algebra to train models. Moreover, linear algebra is essential for feature engineering, dimensionality reduction, and understanding how data transforms within models.

In very simple terms, imagine data as a big spreadsheet with rows and columns. Linear algebra teaches us how to do special operations on this data, like adding, multiplying, and organizing it. It´s a toolkit to deal with information on matrices and vectors.

If you new to this subject, here is a good place to start:

Khan Academy – Linear Algebra

1.2 Calculus

Calculus helps us understand the rate of change in data and is instrumental in optimizing machine learning algorithms. I had to review the essentials of differentiation and integration, highlighting their significance in functions, optimization, and predictive modeling. To properly understant gradients in deep learning or curve fitting in regression analysis, calculus is an important step in the data science journey.

The course will briefly explain Differentiation (Derivatives) and explore the Gradient Descent Optimization and Newton´s Method in Optimization.

Gradient Descent is an optimization algorithm commonly used in machine learning to minimize the cost or loss function of a model. It is an iterative method used to update the model’s parameters in the direction that reduces the cost. It operates by computing the gradient (derivative) of the cost function with respect to the model parameters and taking steps in the opposite direction of the gradient to minimize the cost.

Newton’s method is also an optimization algorithm used for finding the minimum of a function. It is not specific to machine learning but has applications in optimization problems, including those related to machine learning. Newton’s method uses second-order information, in addition to the gradient, to determine the direction and step size for optimization. It can converge faster than gradient descent in some cases , however, it may not always be feasible to compute for large-scale problems.

The main goal in the end is to minimize the cost or loss function of a model, using these methods.

Khan Academy is a good free resource for newbies to understand the building blocks of these methods:

1.3 Probability & Statistics

Probability and statistics are the core of data science, enabling us to navigate the complexities of uncertainty and extract meaningful insights from raw data. Probability is the cornerstone of addressing uncertainty and variability. They give us the tools to model, quantify, and reason about inherent randomness, noise, and unpredictability.

Statistics complements probability by providing the methodology to draw meaningful insights from data, whether you’re making sense of complex datasets, detecting hidden patterns, making informed decisions based on empirical evidence, or validating hypotheses, which is the final goal of a data science project in th end.

To aid in reviewing these fundamental principles of probability, from random variables and probability distributions to statistical inference, statistical modeling, hypothesis testing, and decision-making in data analysis the following sources were used:

Khan Academy – Statistics and Probability
Krish Naik
- Statistics in Machine Learning Playlist
- Live Statistics Playlist

Keep in mind the following concepts at first: sample and population, describing distributions, probability distribution, central limit theorem, point estimation, confidence interval and hypothesis testing (p-value). Pay a little extra attention and get a deeper unsderstand of Bayes Theorem, cause this will be very important down the line!

1.4 Stocastic Process

1.4.1 – What is a Stochastic Process

Stochastic Process (or Random Process) will not be covered in the above course, but it is also another good topic to have some general knowldge about. Stochastic processes are mathematical models used to describe the evolution of random variables over time. They are a fundamental concept in probability theory and statistics and are commonly used to model uncertainty and randomness in various real-world phenomena.

It´s seem complicated but it is not that much. Imagine you’re playing a game with a spinner that can land on different colors. Each time you spin it, you don’t know which color it will land on – it’s random. The sequence of colors you get after each spin is a bit like a stochastic process.

In the real world things are uncertain or unpredictable. Stochastic processes help us understand and model these uncertainties. They can be used to describe things like the stock market, where the prices of stocks go up and down unpredictably, or the weather, which can change from day to day.

So, think of a stochastic process as a way to study and make sense of randomness in different aspects of life, just like our spinner game where you can’t predict the outcome, but you can analyze the patterns over many spins to make better decisions or predictions.

1.4.2 – Stochastic Process in the context of Machine Learning

Stochastic process may not be directly used as a primary technique for modeling and making predictions, they play a significant role in understanding and dealing with uncertainty in various aspects of machine learning, including data modeling, inference, optimization, and simulation. They can be indirectly related to machine learning in several ways:

Data Generation: Stochastic processes are sometimes used to model the generation of data in machine learning. For instance, time series data, where observations occur at different time points, can be modeled using stochastic processes. A common example is the autoregressive (AR) or moving average (MA) models used to describe the behavior of time series data.
Bayesian Inference: In Bayesian machine learning, stochastic processes play a role in modeling uncertainty. Bayesian models often use prior distributions, which can be viewed as stochastic processes that describe prior beliefs about the parameters of a model. These prior distributions are combined with likelihood functions to update our beliefs in a probabilistic manner, allowing us to make predictions and perform inference.
Reinforcement Learning: In reinforcement learning, the interaction between an agent and its environment can be considered a stochastic process. The agent takes actions in an environment, and the rewards it receives are often subject to randomness. Stochastic processes, like Markov decision processes (MDPs), are used to model and solve such problems.
Monte Carlo Methods: Monte Carlo methods involve using random sampling to estimate complex quantities or perform simulations. These methods are often used in machine learning for tasks like sampling from high-dimensional probability distributions, which is fundamental in Bayesian inference, or for estimating expected values.
Stochastic Optimization: Stochastic optimization algorithms, such as stochastic gradient descent (SGD), are widely used in training machine learning models. While not exactly stochastic processes in the traditional sense, they incorporate randomness through the use of random mini-batches of data to update model parameters. This randomness helps the optimization process escape local minima and reach better solutions.
Random Variables and Distributions: Machine learning involves working with random variables and probability distributions. These random variables and distributions can be seen as elements of stochastic processes, especially when dealing with probabilistic models like Gaussian Processes or Hidden Markov Models.

These process are widely used in Financial applications of Machine Learning, the domain knowledge of my interest, so it made sense to have a high level understanding of it.

Some good references I found to introduce this topic:

2.Programming Languages (Python or R)

Python and R are the most common programming languages used by Data Scientists. This is a choice you have to make, or get to know both of them if you wish. With my background in Computer Science and since I have been working on Python personal projects for a while, I choose to skip this reviewing Python.

For those looking for a good reference to learn programming, I suggest the following sources:

https://pythonprogramming.net/ which was this was my main resource in the begining.
Datacamp Introduction to Python a good place for a newbie.

Another suggestion I would give to someone that is just starting to learn how to program, is to go for the “project-based learning” approach, as describe in this book “Prepared”. I believe it makes learning more interesting and easier. By focusing in a real project of your interest, you will acquire problem solving skills and ways to solve the challenges you face. You will learn faster than just reading or watching courses online. It is a more active learning approach. Also this makes it harder to quit, since you are working on something you want.

Some examples of my earlier projects that may inspire you:

Stock market news WebScraping tool
Stock market news Classifier (with scikit-learn)
Stocks in Play Tool identification tool (similar to SMB Capital Prop Firm uses)
An intraday market simulatior for trading practice based on Yahoo Finance data
Market Profile Tool

3.Operational Research

4. Next steps

Now that you built your foundation, let´s move into the Data Science!

Self-Study guide to become a Data Scientist