Plotting is an essential component of data analysis. As a data scientist, I spend a significant amount of my time making simple plots to understand complex data sets (exploratory data analysis) and help others understand them (presentations).
In particular, I make a lot of bar charts (including histograms), line plots (including time series), scatter plots, and density plots from data in Pandas data frames. I often want to facet these on various categorical variables and layer them on a common grid.
To that end, I made pythonplot.com, a brief introduction to Python plotting libraries and a “rosetta stone” comparing how to use them. I also included comparison to ggplot2, the R plotting library that I and many others consider a gold standard.
I gave a talk last week at Research Triangle Analysts on understanding probabilistic topic models (specificly LDA) by using Python for simulation. Here’s the description:
Latent Dirichlet Allocation and related topic models are often presented in the form of complicated equations and confusing diagrams. Tim Hopper presents LDA as a generative model through probabilistic simulation in simple Python. Simulation will help data scientists to understand the model assumptions and limitations and more effectively use black box LDA implementations.
Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. … We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,508⁄1,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, 184⁄471) followed by home/room/bed (12.5%, 59⁄471), school (12.1%, 57⁄471), in public (8.7%, 41⁄471), the bathroom (5.7%, 27⁄471), and at work (4.5%, 21⁄471).
I have no idea what “Infoveillance” means.
Last year, I published nine interviews with Internet friends about how an academically-minded, 22-year old college senior should work on a Ph.D. Many people have told me the interviews have been helpful for them or that they’ve emailed them to others.
I decided to make a dedicated website to host the interviews. You can find it at shouldigetaphd.com.
I hope this continues to be a valuable resource. I’d encourage you to share this with anyone you know who is thinking through this question.
I gave a talk at a recent Research Triangle Analysts meetup on Scikit-learn, the excellent machine learning libary for Python. The talk wasn’t recorded, but you can see the IPython notebook that I presented from.