ggplot2 style plotting in Python

R is my language of choice for data science but a good data scientist should have some knowledge of all of the great tools available to them. Recently, I have been gleefully using Python for machine learning problems (specifically pandas and the wonderful scikit-learn). However, for all its greatness, I couldn’t help but feel it lacks a bit in the data visualisation department. Don’t get me wrong, matplotlib can be used to produce some very nice visualisations but I think the code is a bit messy and quite unintuitive when compared to Hadley Wickham’s ggplot2.

I’m a huge fan of the ggplot2 package and was delighted to discover that there has been an attempt to replicate its style in Python via the ggplot package. I wanted to compare the two packages and see just how well ggplot matches up to ggplot2. Both packages contain built-in datasets and I will use the mtcars data to build a series of plots to see how they compare, both visually and syntactically.

Here we go….

Scatterplots

Scatterplots are great for bivariate profiling and revealing relationships between variables.  A simple scatterplot using ggplot2 in R:

ggplot(mtcars , aes(x = hp , y = mpg)) +
  geom_point()

Rplot

The same scatterplot using ggplot in Python:

ggplot(mtcars , aes(x = 'hp' , y = 'mpg')) +\
    geom_point() 

figure_1

Not much of a difference there. The syntax is also very similar but in Python’s ggplot, there is a \ after each + when adding a new layer to a plot. When mapping variables to the xy coordinates, the use of inverted commas is also required in ggplot.

Boxplots

Boxplots are very nice for visualising discrete variables and the distributions of variables across them. In R’s ggplot2, I discretise the cyl variable with the factor() function to create a boxplot showing the distributions of mpg across each number of cylinders category (4, 6 and 8).

ggplot(mtcars , aes(x = factor(cyl) , y = mpg)) +
  geom_boxplot()


Rplot01

 

In Python, we need to discretise the cyl variable with the pandas.factorize() function before plotting with ggplot. The ordering is different in the Python plot output but reordering may be possible as it is in R. Also, note that the number of cylinders have been assigned dummy variables where 0 = 6 cylinders, 1 = 4 cylinders, and 2 = 8 cylinders.

mtcars['cyl'] = pd.factorize(mtcars.cyl)[0]
ggplot(mtcars , aes(x = 'cyl' , y = 'mpg')) +\
  geom_boxplot()
figure_1-1

Histograms

A fundamental tool for univariate profiling, histograms show the frequency distribution of a variable. In R’s ggplot2, I plot the distribution of mpg across the mtcars data and add a few more components such as margin outlines and red fill while bins are set to ten and x axis tick labels are modified.

ggplot(mtcars , aes(x = mpg)) +
  geom_histogram(colour = "black" , fill = "red" , bins = 10) +
  scale_x_continuous(breaks = seq(0 , 40, 5))

Rplot02

With Python’s ggplot, the histogram is not as tidy. The shape of the distribution looks a little different as well despite bins also being set to ten but this is just down to how the factoring is carried out in each language; the information within the plots is still the same.

ggplot(mtcars , aes(x = 'mpg')) +\
  geom_histogram(fill = 'red' , bins = 10 , color = 'black')


Facet Plots

The facet wrapping function in ggplot2 can create fantastic visualisations when using larger datasets. A simple example is given in both implementations.

In R’s ggplot2, quarter mile time (qsec) is plotted against horsepower (hp) for each number of cylinders category. The facet wrapping splits the data into the specified discrete variable, in this case cyl, and plots the qsec/cyl relationship for each one.

ggplot(mtcars , aes(x = hp , y = qsec)) +
 geom_point() +
 facet_wrap(~factor(cyl))

Rplot03

And in Python’s ggplot (note the same dummy variables for cyl are used), a similar ouput is seen. The slight difference is the absence of the grey border along the top of each plot in ggplot.

ggplot(mtcars , aes(x = 'hp' , y = 'qsec')) +\
  geom_point() +\
  facet_wrap(~'cyl')
figure_1-3

Making Things a Little Fancier

The previous examples are very simple, but fundamental, plots in data science. With ggplot2 in R, one can be highly creative with their data visualisations by representing categories by colour, facet wrapping, etc. to create plots which hold a lot of information. The following plots are quick examples of how one can be more creative using both packages.

In ggplot2, mpg is plotted against hp with each data instance now coloured according to number of cylinders.
ggplot(mtcars , aes(x = hp , y = mpg , colour = factor(cyl))) +
  geom_point()
Rplot04
In Python’s ggplot, a similar approach is used to colour each data instance by model of vehicle.
ggplot(mtcars , aes(x = 'hp' , y = 'mpg' , color = 'name')) +\
  geom_point()
figure_1-4
Using another built-in dataset (diamonds), we can look at the distribution of diamond prices by cut and colour is ironically used to represent diamond colour.
ggplot(diamonds , aes(x = price , fill = color)) +
  geom_histogram(colour = "black") +
  facet_wrap(~cut)
Rplot05
Python’s ggplot can produce a similar type of visualisation which in my opinion is less slick than the ggplot2 version.
ggplot(diamonds , aes(x = 'price' , fill = 'color')) +\
  geom_histogram(colour = 'black') +\
  facet_wrap('cut')
figure_1-5

 Conclusion

That concludes this brief comparison of ggplot2 and ggplot. It is by no means exhaustive and I’m sure there are many ways of modifying Python plots in ggplot which I am unaware of for now. However, I am very grateful to the ggplot package creator Greg Lamp for allowing R fans to create ggplot2 style plots in Python and look forward to using the package in my Python endeavours.

Advertisements

3 thoughts on “ggplot2 style plotting in Python”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s