3 Introduction to Your Project
3.1 Purpose of the Project Guide
Welcome to the project guide for your TechAcademy data science project! This document will guide you through the different steps of your project and will provide you with useful hints along the way. However, it is not a detailed step by step manual, because we felt like it was important that you develop the skills of coming up with your own way of solving different tasks. This is a great way to apply the knowledge and tools you have acquired in Data Camp.
It might happen that you don’t know how to solve a task. This is a normal part of the coding process, so don’t worry. It is part of the learning experience and we provided you with helpful tips throughout this guide. We also included pictures of what your results could look like. They are meant to be a useful guidance so that you know what you are working towards. Your plots won’t need to look exactly the same way. We compiled this document with most recent data on June 07, 2020, so with the new data your plots will look different anyway. Furthermore, you can find helpful links in the introductory chapters, where your questions might already have been answered. If not, and in the unlikely case that even Google can’t help you, the TechAcademy mentors will help you via Slack or directly during the coding meetups.
At the end of the project guide you will find an overview of all tasks that have to be completed, depending on your track (beginner/advanced). You can use this list to check which tasks still need to be completed and which tasks are relevant for your track.
3.2 What is this Project About?
The coronavirus, or COVID-19, is the topic you have probably heard most of in these past few months. What was especially striking to see is the fact that people talked a lot about coronavirus statistics and instead of bringing clarity to the discussion, we all got a little more confused. Some people said it’s not more deadly than the flu, others said it was 10 times as deadly. Sometimes politicians confidently stated that the situation was under control and the next day schools were closed and people started buying toilet paper like their life depended on it. In times like these it is especially useful to be able to evaluate data by yourself and to see for yourself what is happening. Fortunately, there is a lot of data on the coronavirus available, so all that is left to do is to learn how to handle it and how to interpret it in a meaningful way.
In analogy to the typical data science workflow, we split this project into two parts. First you are going to learn how to perform an Exploratory Data Analysis (EDA). You will have a closer look at the data, transform it and then get to know the different variables and what they look like in different types of visualizations. Beginners will have completed the project after this, but it will be beneficial to also try and work on the next part: In the second part of the project you will come up with a model that will predict the coronavirus development as accurately as possible. You are going to start with a linear regression model, which you modify as you please and then you can explore all the other possibilities of modeling and predicting data.
But first things first: What exactly is EDA and what can you achieve with it?
3.3 Exploratory Data Analysis – getting to know the data set
As a first step you will get to know the data set. This means you will describe the data. A crucial part of data science is to familiarize yourself with the data set. What variables are contained in the data set and how are they related? You can answer these questions easily by creating plots with the data.
This first part of the project is structured in a way that lets you get to know the data thoroughly by completing the given tasks one after the other. As a beginner, you can stop after this part, because you will have fulfilled the necessary requirements. However, if this first part inspires you to learn more, we encourage you to also work on the second part.
This project guide is structured in the following format. Since the concept of Data Science is independent of specific programming languages, we will describe the general approach in this part of the text. After you got the overall concept and understood the task we are asking you to do, you will find language-specific tips and tricks in visually separated boxes. If you decided to participate in our program in R
, you only need to look at those boxes. Conversely, you only need to look at the Python
boxes if you are coding in that language. From time to time it might be interesting to check out the other language – though you can do the same in both, they sometimes have a different approach to the identical problem. It makes sense that you complete the first few beginner chapters mentioned in the introductory chapter. We recommend that you finish the courses at least until and including “Exploratory Data Analysis” for both tracks.
3.4 Prediction – Apply Statistical Methods
This part is mainly for the advanced TechAcademy participants. If you are a beginner and you were able to complete the first part without too many difficulties, we highly recommend trying to do the second part as well. Statistical models are a major part of data science and this is your chance of developing skills in this area.
You got to know the data in the first part and you should be familiar with it so that it is now possible to use it to make predictions about the development of the coronavirus in the future. After having completed the second part, you will send us your predictions and we will then check how accurate your model was. The best model will win!
For this part of the project we recommend the advanced courses mentioned in the introductory chapter. Please note that there are more courses available so if you want to extend your skills even further, feel free and complete more courses on the topics that interest you. We recommend that you finish the courses at least until and including Unsupervised Learning in Python for the Python
track and Machine Learning Toolbox for the R
track.
Ready? After getting a first impression of what this project is all about, let’s get started!