Python Plotting for Exploratory Data Analysis

Plotting is an essential component of data analysis. As a data scientist, I spend a significant amount of my time making simple plots to understand complex data sets (exploratory data analysis) and help others understand them (presentations).

In particular, I make a lot of bar charts (including histograms), line plots (including time series), scatter plots, and density plots from data in Pandas data frames. I often want to facet these on various categorical variables and layer them on a common grid.

To that end, I made, a brief introduction to Python plotting libraries and a “rosetta stone” comparing how to use them. I also included comparison to ggplot2, the R plotting library that I and many others consider a gold standard.

Understanding Probabilistic Topic Models By Simulation

I gave a talk last week at Research Triangle Analysts on understanding probabilistic topic models (specificly LDA) by using Python for simulation. Here’s the description:

Latent Dirichlet Allocation and related topic models are often presented in the form of complicated equations and confusing diagrams. Tim Hopper presents LDA as a generative model through probabilistic simulation in simple Python. Simulation will help data scientists to understand the model assumptions and limitations and more effectively use black box LDA implementations.

You can watch the video on Youtube:

I gave a shorter version of the talk at PyData NYC 2015.

Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use

When I worked at RTI International, I worked on an exploratory analysis of Twitter discussion of electronic cigarettes. A paper on our work was just published in the Journal of Internet Medical Research: Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.1

Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. … We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,5081,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, 184471) followed by home/room/bed (12.5%, 59471), school (12.1%, 57471), in public (8.7%, 41471), the bathroom (5.7%, 27471), and at work (4.5%, 21471).

  1. I have no idea what “Infoveillance” means. [return]

Nonparametric Latent Dirichlet Allocation

Today is my last day at Qadium. Next week, I am joining the data science team at Distil Networks.

I’ve been privileged to work with Eric Jonas on the data microscopes project for the past 8 months. In particular, I contributed the implementation of Nonparametric Latent Dirichlet Allocation.

I published a collection of notes on nonparametric Bayesian methods and Latent Dirichlet Allocation at I hope this will be useful to other students and researchers of these methods.

Last year, I published nine interviews with Internet friends about how an academically-minded, 22-year old college senior should work on a Ph.D. Many people have told me the interviews have been helpful for them or that they’ve emailed them to others.

I decided to make a dedicated website to host the interviews. You can find it at

I hope this continues to be a valuable resource. I’d encourage you to share this with anyone you know who is thinking through this question.