2 What’s Data Science and How Do I Do It?
Data science is a multi-layered field in which the use of the latest machine learning methods is only a sub-area. To get there, you’ll need many steps before – from collecting to manipulating to exploring the data. And eventually, you will need to somehow communicate your findings.
But first things first. To analyze data, it must first be obtained. You need to know where to obtain it and how to integrate it in your respective tools. The data is rarely available as it would be needed for further processing. Familiarizing yourself with the information available, cleaning it up and processing it into the desired formats that can be read by humans and machines are important steps that often make up a large part of the work.
Before the obtained data can be analyzed, the right tool must be selected and mastered: the programming language. The most often used languages for Data Science are R
, which was explicitly developed for statistics, and Python
, which is characterized by its additional versatility. The data scientist does not have to be a perfect software developer who masters every detail and paradigm, but the competent handling of syntax and idiosyncrasies is essential.
There are some well-developed method collections, so-called packages or libraries, which provide a lot of functionality. The use of these collections also has to be learned and mastered.
Once all of this is achieved, the data can finally be analyzed. Here too, it is important to know and understand the multitude of statistical approaches in order to be able to choose the right method for the problem at hand. The newest, best, most beautiful neural network is not always the solution for everything.
One step is still missing in the data science process: understanding and communicating the results. The results are often not spontaneously intuitive or sometimes even surprising. Here, the specific expertise and creativity can be played out, especially in the visualization.
2.1 What’s R?
R
is a programming language that was developed by statisticians in the early 1990s for use in the calculation and visualization of statistical applications. A lot has happened since then and by now, R
is one of the most widely used programming languages in the field of data science. Code in R
does not have to be compiled, but can be used interactively and dynamically. This makes it possible to quickly gain basic knowledge about existing data and to display it graphically.
R
offers much more than just programming, but also a complete system for solving statistical problems. A large number of packages and interfaces are available, with which the functionality can be expanded and integration into other applications is made possible.
2.1.1 RStudio Cloud
Before you can use R
, you usually have to install some separate programs locally on your computer. Typically, you first install a “raw” version of R
. In theory, you can then already start programming. However, it is very difficult to carry out an entire project with it. That’s why there is RStudio, an Integrated Development Environment (IDE) for R
. This includes many essential features that simplify programming with R
. Among other things, an auto-completion of your code, a nicely structured user interface and many expansion options.
Experience has shown that installing R
and RStudio locally takes some effort. Fortunately, RStudio also has a cloud solution that eliminates these steps: RStudio Cloud. There it is possible to edit your project in exactly the same IDE in the browser without any prior installations. You can also easily switch your project from private to public and give your team an insight into your code via a link or by giving them access to the workspace directly. In this way you are able to easily exchange ideas within your team.
We will introduce RStudio Cloud and unlock access to our workspace on our first Coding Meetup. Until then, focus on learning the hard skills of programming with your courses on DataCamp. This brings us to your curriculum in the next section.
2.1.2 Curriculum
The following list shows the required DataCamp courses for the Data Science with R
Track. As a beginner, please stick to the courses of the “beginner” program, ambitious beginners can of course also take the advanced courses afterwards. However, the courses should be worked through in the order in which they are listed.
The same applies to the advanced courses. Here, too, the specified courses should be processed in the given order. Since it can of course happen that you have already mastered the topics of an advanced course, individual courses can be replaced. The topics of the advanced courses are given in key points. If these key points seem familiar to you, then take a look at the table of contents of the corresponding DataCamp course. If you are convinced that this course does not provide any added value for you, it can be replaced by one of the courses in the “Exchange Pool” (see list). However, this exchange course should not be processed until all other courses in the advanced course have been completed.
Both beginners and advanced learners must have completed at least two thirds of the curriculum in order to receive the certificate. For the beginners this means at least up to the course “Data Visualization with ggplot2 (Part 1)” and for the advanced at least up to “Supervised Learning in R
: Classification”. In addition, at least two thirds of the project tasks must have been completed.
R Fundamentals (Beginner)
- Introduction to R (4h)
- Intermediate R (6h)
- Introduction to Importing Data in R (3h)
- Cleaning Data in R (4h)
- Data Manipulation with dplyr (4h)
- Data Visualization with ggplot2 (Part1) (5h)
- Exploratory Data Analysis in R (4h)
- Correlation and Regression in R (4h)
- Multiple and Logistic Regression in R (4h)
Machine Learning Fundamentals in R (Advanced)
- Intermediate R (6h): conditionals, loops, functions, apply
- Introduction to Importing Data in R (3h): utils, readr, data.table, XLConnect
- Cleaning Data in R (4h): raw data, tidying & preparing data
- Importing & Cleaning Data in R: Case Studies (4h): case studies
- Data Visualization with ggplot2 (Part1) (5h): aesthetics, geometries, qplot
- Supervised Learning in R: Classification (4h): kNN, naive bayes, logistic regression, classification trees
- Supervised learning in R: Regression (4h): linear & non-linear regression, tree-based methods
- Unsupervised Learning in R (4h): k-means, clustering, dimensionality reduction
- Machine Learning with caret in R (4h): train()-function, cross-validation, auc
Data Science R (Advanced) – Exchange Pool
2.1.3 Links
- RStudio Cheat Sheets: https://rstudio.cloud/learn/cheat-sheets
- RMarkdown Explanation (to document your analyses): https://rmarkdown.rstudio.com/lesson-1.html
- StackOverflow (forum for all kinds of coding questions): https://stackoverflow.com/
- CrossValidated (Statistics and Data Science forum): https://stats.stackexchange.com/
2.2 What’s Python?
Python
is a dynamic programming language. The code is executed in the interpreter, which means that the code does not first have to be compiled. This makes Python
very easy and quick to use. The good usability, easy readability and simple structuring were and still are core ideas in the development of this programming language.
Basically, you can use Python
to program according to any paradigm, whereby structured and object-oriented programming is easiest due to the structure of the language, but functional or aspect-oriented programming is also possible. These options give users great freedom to design projects the way they want, but also great freedom to write code that is difficult to understand and confusing. For this reason, certain standards that are based on the so-called Python
Enhancement Proposals (PEP) have developed over the decades.
2.2.1 Anaconda and Jupyter
Before you can use Python
, it must be installed on the computer. Python
is already installed on Linux and Unix systems (such as macOS), but often it is an older version. Since there are differences in the handling of Python
version 2 – which is not longer supported anymore – and version 3, we decided to work with version 3.6 or higher.
One of the easiest ways to get both Python
and most of the best known programming libraries is to install Anaconda. There are detailed explanations for the installation on all operating systems on the website of the provider.
With Anaconda installed, all you have to do is open the Anaconda Navigator and you’re ready to go. There are two ways to get started: Spyder or Jupyter. Spyder is the integrated development environment (IDE) for Python
and offers all possibilities from syntax highlighting to debugging (links to tutorials below). The other option is to use Jupyter or Jupyter notebooks. It is an internet technology based interface for executing commands. The big advantage of this is that you can quickly write short code pieces and try them out interactively without writing an entire executable program.
Now you can get started! If you have not worked with Jupyter before, we recommend that you complete the course on DataCamp (https://www.datacamp.com/projects/33) first. There you will get to know many tips and tricks that will make your workflow with Jupyter much easier.
In order to make your work and, above all, the collaboration easier, we are providing you with a platform that contains a Jupyter environment with the necessary libraries, as well as all data necessary for the project. There you will also find a brief explanation in the form of a running Jupyter notebook and introductions to the work with the data sets we have selected. Using a regular folder structure, you can quickly navigate to the folder of your group and your personal folder, in which the necessary files are stored for a smooth start.
We will introduce this environment and unlock access to it during our first Coding Meetup. Until then, focus on learning the hard skills of programming with your courses on DataCamp. This brings us to your curriculum in the next section.
2.2.2 Curriculum
The following list shows the DataCamp courses for the Python
data science track. As a beginner, please follow the courses for the beginner level. These should be processed in the order in which they are listed.
The same applies to the advanced courses. Here, too, the specified courses should be processed in the given order. Since it can of course happen that you have already mastered the topics of an advanced course, individual courses can be replaced. The topics of the advanced courses are given in brief. If these key points seem familiar to you, then take a look at the table of contents of the corresponding DataCamp course.
If you are convinced that this course does not provide any added value for you, it can be replaced by one of the courses in the “Exchange Pool” (see list). However, this course should not be processed until all other courses in the intermediate Python
course have been completed.
Both beginners and advanced learners must have completed at least two thirds of the curriculum in order to receive the certificate. For beginners this means at least up to the course “Manipulating DataFrames with Pandas” and for advanced learners at least up to the “Project: Bitcoin Cryptocurrency Market”. In addition, at least two thirds of the project tasks have to be completed.
Python Fundamentals (Beginner)
- Introduction to Data Science in Python (4h)
- Intermediate Python (4h)
- Python for Data Science Toolbox (Part 1) (3h)
- Introduction to Data Visualization with Matplotlib (4h)
- Manipulating DataFrames with pandas (4h)
- Merging DataFrames with pandas (4h)
- Exploratory Data Analysis in Phyton (4h)
- Introduction to DataCamp Projects (2h)
- Introduction to Linear Modeling in Python (4h)
Data Science with Python (Advanced)
- Intermediate Python (4h): Matplotlib, Dict, Pandas, Loops
- Python Data Science Toolbox (Part 1) (3h): Default arguments, Lambdas, Error handling
- Python Data Science Toolbox (Part 2) (4h): Iterators, generators, List comprehension
- Cleaning Data in Python (4h): Using pandas for Data cleaning
- Exploring the Bitcoin Cryptocurrency Market (3h): Small project
- Exploratory Data Analysis in Phyton (4h): How to start a data analysis
- Introduction to Linear Modeling in Python (4h): Linear Regression, sklearn
- Supervised Learning with Scikit-Learn (4h): Classification, Regression, Tuning
- Linear Classifiers in Python (4h): Logistic regression, SVM, Loss functions
Data Science with Python (Advanced) - Exchange Pool
- TV, Halftime Shows and the Big Game (4h)
- Interactive Data Visualization with Bokeh (4h)
- Time Series Analysis (4h)
- Machine Learning for Time Series Data in Python (4h)
- Advanced Deep Learning with Keras (4h)
- Data Visualization with Seaborn (4h)
- Web Scraping in Python (4h)
- Writing Efficient Python Code (4h)
- Unsupervised Learning in Python (4h)
- Writing Efficient Code with pandas (4h)
- Introduction to Deep Learning in Python (4h)
- ARIMA Models in Python (4h)
2.2.3 Links
Official Tutorials/Documentation:
Further Explanations:
2.3 Your Data Science Project
2.3.1 Coding Meetups and Requirements
Now that you have learned the theoretical foundation in the DataCamp courses, you can put your skills into practice. We have put together a project for you based on real data sets. You can read about the details in the following chapters of this project guide.
Of course, we will also go into more detail about the project and the tools that go with it. We will discuss everything you need to know during the first Coding Meetup, which will take place on May 20, 2020. After that, the work on the project will officially begin.
You can find the exact project tasks together with further explanations and hints in the following chapters.
To receive the certificate, it is essential that you have solved at least two thirds of the “Exploratory Data Analysis” part of the project. For the advanced participants, the entire “forecast using statistical models” part is added. In addition, two thirds of the respective curriculum on DataCamp must be completed. You can find more detailed information on this in the “Curriculum” section of the respective programming language above.
2.3.2 Data Sources
For this project, we use several publicly available data sources. The Covid-19 infection data is provided by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). They provide an accessible, daily updated GitHub repository with worldwide case data.
For Google search data, we use Google Trends, a very interesting free service. If you’re interested in diving further into this data source, you can easily download data from this website or directly via R
or Python
packages.
Our stock market data comes from Yahoo! Finance, which is also easily downloadable directly into your programming environment with convenient packages in both languages.