Disclaimer: This article is a part of “Secrets of the MSc in Statistics and Data Science at KU Leuven”. Note that the written information below is purely subjective and it does not represent any officials from the program.
Author: Alex Puyneers
About the other articles:
- Secrets of the Master of Statistics and Data Science at KU Leuven (Introduction)
- Belgian University Education System & The Master of Statistics and Data Science at KU Leuven
- KU Leuven Master of Statistics and Data Science: Tips for international students & FAQs
- DataCamp: what is it and how can it help me?
- Statistics vs Data Science: Understanding the Differences
PYTHON VS R
Most of the students going for the Master of Statistics and Data Science want to be a data scientist “the sexiest job” according to Forbes, but when starting this course they will see multiple programs that are used in data science, R and Python, this text is to talk about those programs and show the main differences between them.
HISTORY
R-programing is an open-source language made especially for statistical computing and visual representation of data, which is very popular specifically in academia. It is a software that was made for statistical tests and has a lot of packages to visualize data.
Python is also an open-source language but, on the other hand, it is used for a multitude of purposes, not only for data science and statistics alone but also for web development and even to make games. One of the biggest reasons for its popularity derives from this big community of users and developers that make libraries designed for a lot of different purposes.
Integrated Development Environment (IDE)
The first main difference that is possible to see between both programs is their “integrated development environment “(IDE), a software application that assists the programmer make develop efficient code, while it is possible in theory to develop code in a text editor, an IDE will be fundamental for the course since multiple subjects require packages to install.
Python has multiple IDEs, the most used being Jupiter Notebooks, VS Code, and PyCharm, and while Jupiter can be used with R programming, the main IDE for programming in R is R-studio.
INSTALLATION
Installing R-programming is very straightforward, download and install R-programming then install R-studio and it is already possible to start coding, but the R-studio by itself lacks some functions needed for the course of statistics, fortunately, R has a wide range of libraries that you can install by simply writing the line “install.packages(‘x’)” where x is the name of the package, if the package was already installed, then writing “library(x)” activates the installed package.
In Python, the user needs to use the command prompt to install a package that is used to import modules in your Python source code, it is recommended to download Anaconda, a package manager that comes with the most useful packages for mathematics, science, and engineering.
The installation of Python is not simple, and it is very easy to make a mistake that would give problems in activating the packages installed, it is recommended to have a step-by-step guide when installing the programs for Python.
PACKAGES
R has a wide range of libraries and functions that are specifically designed for data manipulation and statistical analysis. It is possible to find approximately 12000 packages on CRAN (open-source repository).
Python can perform many of the same activities as R, including data manipulation, engineering, feature selection, web scraping, and app development, in the past didn’t have many machine learning and data analysis libraries, and nowadays Python has libraries for data manipulation, it is not as extensive as those in R.
The most common libraries that are used in R are the dplyr, a data manipulation library, tidyr to get your data clean and ggplot2, a library for visualizing data.
For Python, the most used packages for data science are NumPy that has a lot of functions for scientific computing, Pandas for data manipulation, and Matplotlib for data visualization.
STRUCTURE
The coding from R is very different from Python, while they have some similarities like not needing to use semi-colons, they are distinct.
After the package is installed in R all it is needed to do is to type library(package) and any command of the package will be available.
In Python, to import a package the user needs to write “import [package]” and every time they want to use a command of the package they need to write [package].[command].
The user can also shorten the name of the package if they wish, if they write “import pandas as pd” then the new command will be pd.DataFrama.
Python has some odd “quirks”, one of those quirks that it has is its odd way to count rows and columns. Python starts with zero, if the user wants the first row of a data frame you need to write pandas.Dataframe[:,0].
Unlike R, Indentation is important in Python. A common practice in programming to make the code have easier readability is mandatory.
As it is possible to see in the example the command print MUST be further away from the if-else statements, otherwise, the code will not work. Please pay attention!
DIFFICULTY
Python is well known for being popular object-oriented programming because of it being beginner-friendly while R is described as a specialized purpose program thus making it difficult to learn.
Despite that statement, the student of a master of statistics has a better grasp of statistical equations than the average person, and that is why R would feel easier to use since the program was specifically built for statistical models and graphs than Python, which despite being simpler to program, it is more convoluted to code.
One example is to import data into the code, in python you need to write a line of code specifying the location of your data, in R-studio on the other hand, the IDE has a button called “import dataset” as shown in the image, which searches not only the data but automatically makes the line of code needed. It also shows how the data will look like before importing it with some options to tweak the data to make sure it is properly structured as the user intends.
ADDITIONAL INFORMATION
While R programming is widely adopted in scientific research and academia, it is very clear that nowadays Python is more popular than R workwise due to its simplicity and multiple functions besides data science, therefore despite R have better tools for data science it is advised to learn python well depending on what career the student is pretending to follow.
Python and R in the Master of Statistics and Data Science at KU Leuven
You may have guessed already. In this program, you will meet (and have to study) more R-based courses than Python because R is known to be strong in academia and statistics. But, there are many Python-focused courses as well and you may design your Individual Study Program with these courses if your interest lies in Python more than R. Again, everything depends on your career path.