class: center, middle ### W4995 Applied Machine Learning # Visualization and Matplotlib 01/27/20 Andreas C. Müller ??? Hi everybody. Today we'll be diving into visualization and matplotlib. Next time we'll finally actually start with machine learning. Again, slides and slide notes for this are online. FIXME add slide on how much worse hue and saturation are than bar plot / position / length. Show by example. FIXME Make more clear that not picking appropriate parameters might hide structure? FIXME add covariance plot with dendrogram for ordering! FIXME add example of subsampling for overplotting FIXME Actually perfect length! FIXME how to add legends to scatter plots maybe? FIXME weird structure, going back between matplotlib and principles. reorder! FIXME first matplotlib basics, then details and general thoughts maybe? FIXME more density plots? FIXME https://github.com/astrofrog/mpl-scatter-density FIXME better graph for different channels / destinguish categorical and continuous channels? FIXME clean up colormap evaluation plot? FIXME talk about legends? FIXME gridspec instead of matlab style subplots? FIXME do some seaborn FIXME more from Jake's book FIXME more from Claus Wilke's book FIXME 3d animations /interactivity FIXME constraint layouts? --- class: center, middle # Principles of Data Visualization ??? Today I want to talk about visualization and matplotlib. We will not talk about numpy and pandas, even though they are very important for doing data science with python. There are two reasons: I find it easier to talk about visualization in the abstract, while pandas usage is often very case specific. Also, there is not a lot of material out there about matplotlib, unfortunately, while there is a lot of great material out there on pandas. I highly recommend the book by Wes McKinney on "Python data analysis", which is just a pandas handbook. The best guide to all the python libraries that we need, including matplotlib, is the "Python Data Science Handbook" that I mentioned earlier, and you should really check it out. In terms of visualization, all the DSI folks in this course are also taking the Exploratory data analysis and visualization course, which will go into way more detail than we will here - and we'll focus a bit more on the python interfaces. --- .center[ [![:scale 45%](images/wilke.png)](https://serialmentor.com/dataviz/) ] ??? There's this great free book by Claus Wilke on Data Visualization. It's a great introduction and a pretty short read and you should read it. It's not a python book, it's just theory but it's great. --- class: center, padding-top, spacious # Why? ??? So first, I want to ask why we might want to visualize data. And I think there are two main reasons. The first is for ourselves, to explore the data, make hypothesis and find interesting trends. The other is to communicate either the whole data, or particular aspects or findings about the data. I think both are important, but for the matter of this class, I think the exploration is more important. In general, you should think about what the point of any visualization is, and whether the method you chose is the best for this purpose. In many of my capstone reports last I have seen people show all kinds of charts and graphs and often the description was "here is a bar-plot of quantity X". So what? Why are you showing me this? What's the conclusion? Ideally a visualization should answer a question. In exploratory analysis, the answer might be "there's nothing interesting here", but you should still have some idea what you're looking for. So I want to briefly talk about some very basics of data visualization. I’ll post two great books as references if you’re curious, and I linked to a PhD thesis which has a great introduction, and from which I stole some illustrations. -- # Explore -- # Communicate --- class: spacious Above else, show the data.
Maximize the data-ink ratio. .quote_author[E. Tufte] ??? Here two quotes I want to start with. The first two are from Edward Tufte, the godfather of statistical visualization. He says “above else, show the data”. So the data should be in focus, and not any fancy method you use to show it. He also says “Maximize the data-ink ratio”. What he means here is that all the ink you’re using (even if it’s virtual ink) should show the data, and should vary with the data. You want fewer static elements in your graphic, and more elements that are actually important to show the data. The other quote, from William Cleveland is: Tools matter. I think for him the tools are more the visualization methods, but I think also the software tools matter, which is why I want to talk about them. Finally, My 2 cent here are: label stuff. All axis should be labels, all elements should have legends. There is not reason not to label something. If anyone ever shows me an unlabeled graph, I tell them to do the graph again. Also: just spend more time on visualizations. A simple graph for a paper usually takes me a couple of hours to make. And I do them entirely using python and matplotlib. If you hand in any graphs with your homework, don't hand in anything you haven't spend at least 10 or 20 minutes on. You make a graph to communicate something clearly. That needs thinking about it, and spending time making it right. -- Tools matter. .quote_author[W. S. Cleveland] -- Label stuff.
Spend more time. .quote_author[me] --- class: center, middle # Visual Channels ![:scale 60%](images/visual_channels.png) ??? Before we start with plotting, here’s a quick summary of visual channels that I took from the thesis “Systematising glyph design for visualization”. These are ways in which you can show information. Visualization means taking some multivariate data and mapping different aspects of the data to different visual channels. There is obviously length, as in bar-charts, angle, curvature, shape as in graphs and scatter plots. There’s area, which is often useful, but it’s important to say whether you’re using area or distance to show information. For circles that’s often unclear. There is 3d volume, which is really something you probably want to avoid. Then there’s color. There’s lightness, saturation, transparency and hue, and they all work very differently. It’s generally accepted that using hues for quantitative changes is not a great idea, and lightness or saturation works much better. Textures are something I also try to avoid, position is clearly one of the most important channels, while the rest are not something I use very often - though containment and connection can be helpful. --- class: center, middle # Channels for quantitative data .center[ ![:scale 25%](images/visual_channels2.png) ] ??? There have been studies of which of these are good ways to relate quantities. The winner is position, closely followed by length. These are clearly the most accurate ways to depict a value. Then angle, area, volume, texture, saturation and finally hue. So hue is basically the worst way to encode any quantitative information. Unfortunately these studies didn’t include brightness. It’s also good to be aware of which of these channels have pop-out effect, which allows you to find particular items very quickly. Here you can see color, orientation, curvature, size, symbol, movement, blur and contrast all working well - actually you can’t see the movement, but I think you would agree. It’s important to know that the number of different values are important to how much something pops out, and if you use too many orientations or sizes or colors, this effect is lost. --- class: center ![scale: 100%](images/channels_fuel.png) ??? Here's an image from "Fundamentals of visualization" illustrating how different quantities can be mapped to different aspects of the data. These are fuel efficiencies of cars, and we have five variables we're visualizing for each kind of car: fuel efficiency, displacement, power, weight, and number of cylinders. I think this is a great illustration, and it also shows you what channel "pops" more. For a real visualization this might be a bit too much information squeezed in for my taste. Though given that there's not that many data points, it's maybe fine. Using five channels is a lot! Usually I would use three channels in most graphs. --- class: center, middle # Colormaps .left-column[ ![:scale 90%](images/sequential_color_maps.png)
![:scale 90%](images/qualitative_color_maps.png) ] .right-column[ ![:scale 90%](images/diverging_color_maps.png)
![:scale 90%](images/misc_color_maps.png) ] .reset-column[ ![:scale 50%](images/perceptually_uniform_colormaps.png) ] ??? But let’s talk more about color, in particular color maps. Color maps are ways to map quantities from a continuous number to a color, and they are used whenever you plot something with color - or even with gray scale. There are several different types that are important to distinguish. Sequential colormaps usually go from one hue or saturation to another, while also changing lightness. The have two extremes and interpolate between them. That works well if you’re particularly interested in the extreme values, but they might not offer a lot of contrast in the middle. There’s also diverging colormaps, which have a focus point in the middle, these here have either grey or white, and then have different hues going in either direction. This is great if you have a neutral point and then deviations for that, like profits that can be zero or positive and negative, and you can easily see which side of the zero any point is on. --- class: center, middle # Colormaps .left-column[ ![:scale 90%](images/sequential_color_maps.png)
![:scale 90%](images/qualitative_color_maps.png) ] .right-column[ ![:scale 90%](images/diverging_color_maps.png)
![:scale 90%](images/misc_color_maps.png) ] .reset-column[ ![:scale 50%](images/perceptually_uniform_colormaps.png) ] ??? Quantitative colormaps are mean to not show a continuum, but just some discrete values, and they are designed in such a way to have optimum contrast for a particular number of discrete values. Finally, there are the miscellaneous colormaps shown here, and you can see them really everywhere, in particular rainbow colormaps like jet and rainbow. Don’t use them. Really, never ever use them. I’ll explain why in a second, but if I see it in this course, not only will I take points of your homework, I’ll also look at you really disappointedly. And there are other colormaps that have more than two hues, but that don’t have the problems that these very colorful ones have, and these are the perceptually uniform ones down here. These are carefully designed to ideally show quantitative information. So let's talk about how they are different. --- class: center ![:scale 35%](images/colormap_evaluation_gray.png) ??? So what’s the problem with these other colormaps. This is an evaluation of the grey color map. Let's look at the panels on the right first. There are three heat maps here. So each pixel here represents a floating point value on some scale. On the left you see at the top the perceived changes in the color map, using a model of the human visual system. You can see that it's pretty constant. And at the bottom left you can see the the color map as a path in the 3d color space. This space is not RGB, it's a CIE color space that corresponds more to how humans perceive color. In the RGB space, humans can detect finer nuances in green than in red, for example. Each color map is a path in this 3d space. And ideally, you'd like this to be a smooth path, and you'd like the speed to be constant. The perceptual contrast at the top is something like the derivative, or speed, of the color map going through the perceptual space. --- class: center ![:scale 35%](images/colormap_evaluation_gray.png) .left-column[ ![:scale 80%](images/colormap_evaluation_jet.png) ] .right-column[ ![:scale 80%](images/colormap_evaluation_viridis.png) ] ??? Now let’s look at this in the jet color map. you see that green ring that looks like it has some boundary? Or the red core? Where are these in the greyscale? They are not there, because the are not in the data. So why does it look like there are edges? here is the color map, and this shows the differences in perceived color. You see these spikes? that’s where we perceive edges, because the color map has edges! Also here is the colormap converted to greyscale, say if someone printed it. and below are the brightness deltas. Do you see something? It’s not monotonous. It doesn’t go from dark to bright, it backs up on itself. -- ??? And here’s one of the perceptual uniform colormaps In comparison. You can see that there’s a bit more detail then in the gray one, but no artificial contours appear. That’s because there are no perceptual edges. Also, the change in brightness is constant, so if you convert it to greyscale, you just got the greyscale colormap back. The other plots here show how it looks for various forms of color-blindness, and the path of the colormap in a perceptual 3d color space. I posted a cool video explaining way more details on this on the course website. --- class: center, middle # matplotlib ??? Which brings us to matplotlib. So matplotlib is the python plotting library on which basically all python plotting libraries build, and because it’s so important, we’re gonna go through it in some detail. Matplotlib was designed as a general plotting library and not really with data analysis in mind, which is why it is a bit cumbersome in some cases, and not as elegant for data science as you might be used to if you're coming from R. But I think it’s still good to know how to effectively create graphics in python if you’re gonna work with python, which we will. Matplotlib is very powerful, and any figure you can imagine, you can create with matplotlib. Unfortunately sometimes it takes a bit of code, though. --- class: center, middle # matplotlib v2 (also v3 now) update now! (you can enable classic style if your really want) ??? So matplotlib version 2 has been out for quite a while now, though I still see people using matplotlib v1. I would really highly encourage you to upgrade. version two changes some of the default styles, but you can still enable the classic styles if you want. In particular version two uses virids as the default color map. If you create plots in version 1.X and show them in 2.X, they might look very differently or might not be comprehensible at all. --- # Other libs - pandas plotting - convenience - seaborn - ready-made stats plots - bokeh - alternative to matplotlib for in-browser - several ggplot translations / interfaces - bqplot - plotly - altair (the cool new kid) - yellowbrick (plotting for sklearn) ??? There are some other interface you should look into which I won’t discuss in class. Pandas has some built-in plotting on top of matplotlib. It’s basically shortcuts to do some operations more easily on dataframes, and I highly encourage you to use it. You’ll get the most out of it if you actually understand matplotlib, though. Similarly seaborn has many tools for statistical analysis and visualization, also using pandas dataframes, also based on matplotlib. Seaborn is really great for more complex statistical plots. There is a completely separate project, bokeh, which uses javascript (but not d3) to do visualizations in the browser. I’m not entirely convinced yet, though. There’s also several reimplementations and interfaces to ggplot, if that’s what you’re into, and bq plot, a tool for time series visualization in the jupyter notebook. And finally there's a startup called plotly that also provides a plotting interface. --- # Imports ```python from matplotlib.pyplot import * from pylab import * from numpy import * ``` ??? So before we start, a word on imports. What do you think about these imports? Why are they bad? They hide where functionality came from. Is this squareroot from numpy or math or your own function? Is this plot from seaborne or pandas or matplotlib or something else? Never use import *. It makes your code less readable. And we don’t want that, right? Always use explicit imports and standard naming conventions. I haven’t decided if I’ll subtract points for that, but if I see that I will know that I failed. -- NO! ```python import matplotlib.pyplot as plt import numpy as np ``` YES! --- # matplotlib & Jupyter .left-column[ `%matplotlib inline` .smaller[ - sends png to browser - no panning or zooming - new figure for each cell - no changes to previous figures ] ] -- .right-column[ `%matplotlib notebook` .smaller[ - interactivity in old notebook - doesn't work in jupyter lab - need to create separate figures - ability to update figures ] ] -- .clear-column[ `%matplotlib widget` .smaller[ - interactive widget - all figure features - need to create separate figures explicitly - ability to update figures - installation instructions: https://github.com/matplotlib/jupyter-matplotlib ] ] ??? I highly recommend that you use jupyter to play around with matplotlib figures and visualizations. To show matplotlib figures embedded in jupyter, you have to call one of two magics. Jupyter magics are those that start with a percentage. There’s matplotlib inline, there was matplotlib notebook and now there's matplotlib widget, and you should run one of them, and only one of them, at the very start of the notebook. Both will allow you to show the figures inside the the browser, but they are quite different. [read/explain table] So with inline you're always in the clear but get the least features. I recommend using jupytelab, so notebook is not possible. Installing the support for widget seems worth it and takes only 5 minutes, if you care about interactivity or updating figures. FIXME: add https://github.com/matplotlib/jupyter-matplotlib/blob/master/matplotlib.gif ?? Ugly slide, maybe just do screenshots instead? --- # Figures and Axes figure = one window or one image file axes = drawing areas with coordinate system ![:scale 40%](images/figure_axis.png) by default: each figure has one axis ??? Now let’s start with the actual matplotlib. The most basic concepts in matplotlib are Figures and axes. A figure is a single window or a single output file. A figure can contain multiple axes, which are basically subplots. Axes are the things you actually draw on, and usually each one has a single coordinate system. By default, each figure has one axes. So her is a figure with two axes from the scikit-learn examples. --- class: some-space # Creating Figures and Axes .smaller[ .wide-left-column[ 1st way: don’t.
Creates figure with axes on plot command 2nd way: `fig = plt.figure()`
Creates a figure with axes, sets current figure. Can add more / different axes later. 3rd way: `fig, ax = plt.subplots(n, m)`
Creates a figure with a regular grid of n x m axes. ] .narrow-right-column[ ![:scale 100%](images/subplots.png) ] ] ??? There’s several ways to create figures and axes. The first one is that you don’t. With your first plotting command, matplotlib will automatically create a figure, and that figure will contain axes and that’s what you’ll draw on. Once you created a figure, each subsequent plot command will apply to the current axes, which is usually the last axes you created. So if you want a second figure, you’ll have to create it explicitly, for example using the plt.figure command. That creates the single default axes, but you can also add more later. You can also create a figure with a grid of axes by using the subplots command (note the s in the end). You can see how that looks on the right for a 2 by 2 grid. I use that a lot. fig is an object representing the figure and ax is a numpy array of size n time m, with each entry being one axes object. --- class: spacious # More axes via subplots .smaller[ .wide-left-column[ ```python ax = plt.subplot(n, m, i) # or plt.subplot(nmi) # places ax at position i in n x m grid # (1-based index) ax11 = plt.subplot(2, 2, 1) ax12 = plt.subplot(2, 2, 2) ax21 = plt.subplot(2, 2, 3) ax22 = plt.subplot(2, 2, 4) ``` equivalent: ```python fig, axes = plt.subplots(2, 2) ax11, ax12, ax21, ax22 = axes.ravel() ``` ] .narrow-right-column[ ![:scale 100%](images/subplots2.png) ]] ??? If you just created a figure but you want more axes, you can also add them one subplot at a time with the subplot command, this time without the s. The subplot command takes three numbers - you can leave out the comma between them if they are single digits, but please dont. The first two numbers specify a imaginary grid for the whole figure, let’s say I want a 2 by 2 grid. The third number is at which position in that grid I want my figure created. The numbers start with one and go column by column. I created the 2 by 2 grid here and labeled the axes according to the variable name. you could create all of them at the same time with the subplot commands, but there are two reason why you might choose not to. --- # More axes via subplots ```python ax = plt.subplot(n, m, i) # or plt.subplot(nmi) # places ax at position i in n x m grid # (1-based index) ``` .left-column[ ```python ax11 = plt.subplot(2, 2, 1) ax12 = plt.subplot(2, 2, 2) ax2 = plt.subplot(2, 1, 2) ``` ] .right-column[ ![:scale 90%](images/complex_subplots.png) ] ??? The first one is that the grid you specify doesn’t need to be the actual grid. It just tells the single subplot about where it should be. So you can create all kinds of different layouts. For example, I can create a 2 by 2 plot where the second row is a single axes, by telling it it should be at the second position for a 2 by 1 plot. This allows me to create irregular grids, which is often quite useful. --- # Two Interfaces Stateful interface - applies to current figure and axes object oriented interface - explicitly use object .wide-left-column[ ```python sin = np.sin(np.linspace(-4, 4, 100)) plt.subplot(2, 2, 1) plt.plot(sin) plt.subplot(2, 2, 2) plt.plot(sin, c='r') fig, axes = plt.subplots(2, 2) axes[0, 0].plot(sin) axes[0, 1].plot(sin, c='r') ``` ] .narrow-right-column[ ![:scale 120%](images/subplots_sin.png) ] ??? So now is the part where things become a bit tricky. There’s basically two different ways to use matplotlib, two interfaces if you want, one is the stateful one, and one is the object oriented interface. If you’re using the stateful interface, you just call the module-level plot functions like plt.plot here, and it will plot in the current axes. What do you think will happen here? In the object based interface, you are explicit about which axes you want to draw into, so you hold onto your axes objects, and call the plotting commands on those! What do you think will happen here? This is another reason why some people like to add subplots one by one: they use the stateful interface. They add a subplot, plot into it, then add the next one and so on. So the plot commands are exactly the same for both interfaces, but there are some differences. --- # Differences Between the Interfaces .smaller[ .left-column[ `plt.title` `plt.xlim, plt.ylim` `plt.xlabel, plt.ylabel` `plt.xticks, plt.yticks` ] .right-column[ `ax.set_title` `ax.set_xlim, ax.set_ylim` `ax.set_xlabel, ax.set_ylabel` `ax.set_xticks, ax.set_yticks (& ax.set_xtick_labels)` ]] .reset-column[ ```python ax = plt.gca() # get current axes fig = plt.gcf() # get current figure ``` ] ??? The formatting options in the object oriented interface all have a “set_” in front of them, while they don’t in the stateful interface. Also, for setting the tick marks, there are separate commands for the locations and the labels in the object oriented interface, but not in the stateful interface. In general, the object oriented interface provides more functionality and is more explicit and powerful, but the stateful interface is a bit terser. If you started plotting but then you want to use something that’s only part of the object oriented interface, you can always get the current axes or current figure with the gca or gcf commands. I use the stateful interface if I have a single axes and I don’t need any of the advanced functionality of the object oriented interface, so basically for simple plots. Whenever I have more than one axes objects, so for any grid, I always use the object oriented interface - but that’s just my personal preference. --- # Plotting commands - Gallery: http://matplotlib.org/gallery.html - Plotting commands summary: http://matplotlib.org/api/pyplot_summary.html ??? I hope that sufficiently obfuscated everything. So lets finally start with the plotting. matplotlib has two pages that are helpful for figuring out how to plot something: the gallery and the plotting commands summary. I will go through only some of the commands from the summary that I think are particular important and their most important aspects. --- class: compact # plot .smallest[ ```python fig, ax = plt.subplots(2, 4, figsize=(10, 5)) ax[0, 0].plot(sin) ax[0, 1].plot(range(100), sin) # same as above ax[0, 2].plot(np.linspace(-4, 4, 100), sin) ax[0, 3].plot(sin[::10], 'o') ax[1, 0].plot(sin, c='r') ax[1, 1].plot(sin, '--') ax[1, 2].plot(sin, lw=3) ax[1, 3].plot(sin[::10], '--o') plt.tight_layout() # makes stuff fit - usually works ``` ] .center[ ![:scale 60%](images/plot.png) ] ??? Clearly plot it the most important one. Plot allows you to do line plots and scatter plots. If you give a single variable, it will plot it against it’s index, if you provide two, it will plot them against each other. By default, plot creates a line-plot, but you can also use “o” to create a scatterplot. You can change the appearance of the line in many ways, including width, color, dashing and markers. --- class: center # plot ![:scale 70%](images/matplotlib_plot_figsize_why.png) ??? One thing that I find slightly annoying and that always trips me off is that in subplots, it's rows, then columns, while the figure size is width, then height. --- class: compact # scatter .tiny[ ```python fig, ax = plt.subplots(1, 4, figsize=(10, 3), subplot_kw={'xticks': (), 'yticks': ()}) ax[0].plot(x, y, 'o') ax[1].scatter(x, y) ax[2].scatter(x, y, c=x-y, cmap='bwr', edgecolor='k') ax[3].scatter(x, y, c=x-y, s=sizes, cmap='bwr', edgecolor='k') ``` .center[ ![:scale 100%](images/matplotlib_scatter.png) ] ] ??? While plot can create scatter plots, those are quite limited. the scatter function allows you to do scatter plots that not only encode the position of the points, but visualizes additional variables via color or size. In the bottom left, I colored the points by their distance to the diagonal, in the bottom right, I also added random size variations. Here I also used “subplot_kw” which allows you to specify some arguments for all subplots in a figure. I use it here to remove the ticks. --- # histogram .tiny[ ```python fig, ax = plt.subplots(1, 3, figsize=(20, 3)) ax[0].hist(data) ax[1].hist(data, bins=1000) ax[2].hist(data, bins="auto") ``` ] ![:scale 100%](images/matplotlib_histogram.png) ??? Histograms are clearly also important. By default, histograms have ten bins, which is never right. You can adjust the number of bins, and I usually try to find the point when it will be too fine. There’s also a heuristic for finding the binsize which you can use with bins=”auto” --- # bars .left-column[ .tiny[ ```python plt.bar(range(len(gross)), gross) plt.xticks(range(len(gross)), movie, rotation=90) plt.tight_layout() ``` ] ![:scale 100%](images/matplotlib_bar.png) ] .right-column[ .tiny[ ```python plt.barh(range(len(gross)), gross) plt.yticks(range(len(gross)), movie, fontsize=8) ax = plt.gca() ax.set_frame_on(False) ax.tick_params(length=0) ``` ] ![:scale 100%](images/matplotlib_barh.png) ] ??? For barcharts, you always need to provide the position of the bar as well as the length. Usually that’s done via a range. If you use ticklabels, it’s usually a good idea to rotate them so you can actually read them. But I don’t really like tilting my head, so I often prefer horizontal bar-charts, which work the same way. The way I specified the positions here, the first bar is at the bottom. We could flip the axes or change the position if we wanted it at the top. --- # heatmaps .left-column[ .smallest[ ```python fig, ax = plt.subplots(2, 2) im1 = ax[0, 0].imshow(arr) ax[0, 1].imshow(arr, interpolation='bilinear') im3 = ax[1, 0].imshow(arr, cmap='gray') im4 = ax[1, 1].imshow(arr, cmap='bwr', vmin=-1.5, vmax=1.5) plt.colorbar(im1, ax=ax[0, 0]) plt.colorbar(im3, ax=ax[1, 0]) plt.colorbar(im4, ax=ax[1, 1]) ``` ]] .right-column[ ![:scale 115%](images/matplotlib_heatmap.png) ] ??? You can plot heatmaps with the imshow command. Previous to matplotlib v2 this automatically enabled interpolation, which you can see at the top right. Interpolation might hide data or might give the impression of more data then there is, by showing a smooth transition. You should always disable interpolation for heatmaps. At the bottom you can see some results with a gray colormap and with a diverging color map. Here, the background is zero and it makes sense to represent the neutral differently. So I ensured that white is mapped to zero, and we can clearly distinguish positive from negative entries, which is much harder with the other color maps. Doing colorbars on multiple axes can be tricky. You need to store the matplotlib image that is returned by imshow in an object, and then call the colorbar command with the image and the axes to which you want to attach the colorbar. --- class: compact # .left-column[ .tiny[ ```python fig, ax = plt.subplots(1, 3, figsize=(10, 4), subplot_kw={'xlim': (-1, 1), 'ylim': (-1, 1)}) ax[0].scatter(x, y) ax[1].scatter(x, y, alpha=.1) ax[2].scatter(x, y, alpha=.01) ``` ] .center[ ![:scale 100%](images/matplotlib_overplotting.png)] ] ??? A command that I discovered way too late is the hexgrid. Hexgrids basically allow two-dimensional density maps. If you have a lot of points, a scatterplot can become too crowded to understand what’s going on. You can work around that a bit by using the alpha value, but that often throws away a lot of information. A better way is to use a hexgrid and plot the density in each grid cell directly. That also allows the use of arbitrary colormaps. Using hexgrids also allows you to map the density, for example using a logarithm, if the differences in density are very large between different regions. --- class: compact # hexgrids .left-column[ .tiny[ ```python fig, ax = plt.subplots(1, 3, figsize=(10, 4), subplot_kw={'xlim': (-1, 1), 'ylim': (-1, 1)}) ax[0].scatter(x, y) ax[1].scatter(x, y, alpha=.1) ax[2].scatter(x, y, alpha=.01) ``` ] .center[ ![:scale 100%](images/matplotlib_overplotting.png)] ] .right-column[ .tiny[ ```python plt.figure() plt.hexbin(x, y, bins='log', extent=(-1, 1, -1, 1)) plt.colorbar() plt.axis("off") ``` ] .center[![:scale 100%](images/matplotlib_hexgrid.png)] ] ??? --- class: compact # twinx, twiny .left-column[ .tiny[ ```python ax1 = plt.gca() ax.plot(years, phds, label="math PhDs awarded") ax.plot(years, revenue, c='r', label="revenue by arcades") ax.set_ylabel("Math PhDs awarded") ax.set_ylabel("revenue by arcades") ax.legend() ``` ] .center[ ![:scale 100%](images/matplotlib_twinx1.png)] ] -- .right-column[ .tiny[ ```python ax1 = plt.gca() line1, = ax1.plot(years, phds) ax2 = ax1.twinx() line2, = ax2.plot(years, revenue, c='r') ax1.set_ylabel("Math PhDs awarded") ax2.set_ylabel("revenue by arcades") ax2.legend((line1, line2), ("math PhDs awarded", "revenue by arcades")) ``` ] .center[![:scale 100%](images/matplotlib_twinx2.png)] ] ??? The last thing I want to mention is twin axes. Here I show two time series, the number of math PhDs awarded in the us and the revenue made by arcades in the US. If I plot them both in the same coordinate system, the number of PhDs will look just flat, because it lives on a completely different scale. Using the object oriented interface, I can create a twin x axis for the revenue to show both time series with their own scales. --- # Aspect Ratios .center[ ![:scale 60%](images/aspect-ratios.png) ] https://eagereyes.org/basics/banking-45-degrees ??? One very important but often overlooked aspect of plotting is aspect ratios. Here are three plots showing the same data with different aspect ratios. If you're using jupyter, the aspect ratio is usually given by the figure size or subplot size. With interactive plots you can resize them, for non-interactive plots you have to usually change the grid or figure size, but it might be worth it. I stole this figure from thie interesting blogpost on eagereyes. This actually talks about what's called banking to 45 degrees. There is a famous paper that showed that it's in some sense perceptually optimal to have most of the line segments be around 45 degrees, but later it was found there is issues with the study. It seems to be a good idea in general, but it also depends on what you want to highlight about the data. --- # Baselines .padding-top[ .center[ ![:scale 80%](images/baselines.png) ]] https://eagereyes.org/basics/baselines ??? Another very related problem is baselines. Again, these two plots show the same data, but they look very different because of where the zero is. And depending on what you want to show, either one might be the right one. Baselines and aspect ratio issues often show up if we display multiple things in the same graph. Because matplotlib automatically adjusts the boundaries, showing more things changes the axes ranges. The twinx command we just discussed can sometimes help with that. Again, there's a cool blogpost here. FIXME show how adding data distorts exsisting data (i.e. gaussian blob can be made to look narrow or wide by adding another gaussian blob at a differnt prosition) --- # 3D plotting .left-column[ .smallest[ ```python from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt import numpy as np fig = plt.figure() ax = fig.gca(projection='3d') ax.plot(xs, ys, zs, lw=0.5) ``` ] ] .right-column[ ![:scale 100%](images/lorenz.png) ] ??? Here's how to do 3d plotting. This example looks kind of cool, using only standard line plots, but in 3d. You have to import Axes3D as shown above, even though it's not used. Then you need to set the projection for the axes you're plotting into to "3d", and off you go. --- # 3D Scatters .left-column[ .smallest[ ```python from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target fig = plt.figure() ax = fig.add_subplot( 111, projection='3d') ax.scatter(X[:, 1], X[:, 2], X[:, 3], c=y) ``` ] ] .right-column[ ![:scale 100%](images/3dscatter.png) ] ??? However, our plots are more commonly scatterplots and they don't involve some fancy lower-dimensional manifold. Here's a plot of the first three features of the iris dataset, and I think doing this in 3d adds nothing. If you have interactivity it might add a bit for exploration, but if you show this plot to anyone, it is just some arbitrary 2d projection of these three dimensions, and we don't need a 3d plot for that. --- # Just Don't .center[ ![:scale 70%](images/3dhist.png)] ??? There's some other 3d plots you might be tempted to do, like this. Usually 3d is just confusing and hard to read. This example of 2d data as a 3d histogram for example is terrible. You could easily do this with a heatmap, or maybe an array of bar-charts. I hope none of you ever creates a graph like this. Some people use these 3d bar charts if the data was originally 1d, that's even worse. This is taken from the matplotlib user guide, btw. Just because you can doesn't mean you should. --- # Do's and Don't summarized ## Do - take your time - label everything - think about questions and story -- ## Don't - Use 3D unless really necessary - Use piecharts (usually) - have non-varying "ink" - Edit figure manually --- # More Resources .padding-top[ ![:scale 23%](images/tufte.jpg) ![:scale 23%](images/cleveland.jpeg)  [![:scale 23%](images/wilke.png)](https://serialmentor.com/dataviz/) ![:scale 23%](images/pdsh.png) ] Also: https://matplotlib.org/tutorials/ ??? The first two are a bit dated but are definitely classics. The last one is still in draft form, it's done using R but all the content is language agnostic. For the matplotlib part, look at Jake's Data Science Handbook, which you hopefully all check out yet. There's a lot more detail in there than what I covered here. The last two are free, the first two are not. --- class: middle # Questions ?