class: center, middle ### W4995 Applied Machine Learning # Testing, Visualization and Matplotlib 01/24/18 Andreas C. Müller ??? Hi everybody. Some quick announcement. I decided to give everybody that wants to audit access to courseworks, mostly so that they can follow the announcements and discussions on piazza. If you're auditing and you haven't gotten an invite to courseworks yet, please drop me an email. Also, for all of you on the waiting list: Spots that open up in the class are constantly being filled from the waiting list, which is still over 100 students. Dropping by my office without an appointment or writing me emails or hassling our staff will not get you in faster, it might do the opposite, though. Today we'll pick up where we left of Monday, talking about a bit more software engineering, and then diving into visualization and matplotlib. Next time we'll finally actually start with machine learning. Again, slides and slide notes for this are online. --- class: center, middle # Unit Tests and integration tests ??? So last time we talked about how we can guard ourselves against some simple issues with our syntax checkers, but that doesn’t find all errors, and it doesn’t tell us if our code works. So we now we’ll talk about unit testing and integration testing. Who of you has worked with an automatic testing framework? What is it for and why would we want it? --- class: spacious # Why test? - Ensure that code works correctly. - Ensure that changes don’t break anything. - Ensure that bugs are not reintroduced. - Ensure robustness to user errors. - Ensure code is reachable. ??? So yes, we want to make sure that our code is correct. We also want to make sure that if we rewrite something, it remains correct. If you have good tests, it is much easier to aggressively refactor your code. If your tests are passing, everything works! Another important kind of test is no-regression tests. You found a bug, you fixed it, and now you want to make sure you don’t re-introduce it. The easiest way is to write a test that tests for the bug. You also want to make sure that your program behaves reasonable, even in edge-cases such as invalid input. And finally, testing can help you find code that is actually unreachable by measuring coverage. If you can’t write a test that will reach a particular part of the code, it’s never executed, and you should probably think about that. --- class: center, middle # Test-driven development? ??? I love tests. But there are some people that love tests even more, and they practice what’s called test-driven development. Who here has heard about that? In test driven development, you write the tests before you write the code. And there are advantages to that. I don’t usually do that myself. However, I do test-driven debugging. If I find a bug, I first write a test that fails if the bug is present. I need to do that later anyhow, because I need a non-regression test, and it usually makes debugging much easier. I often do example driven development, where I write down some simple use-cases and then write implementations to fullfil them. --- # Types of tests - Unit tests – function does the right thing. - Integration tests – system / process does the right thing. - Non-regression tests – bug got removed (and will not be reintroduced). ??? I usually think of tests in terms of three kinds: unit tests, that test the smallest possible unit, usually one function. Then, there’s integration tests, that test that the different parts of the software actually work together in the right way. That is often by testing several application scenarios. And finally, there’s non-regression tests, which can be either a unit tests, an integration test, or both, but they are added for a specific scenario that failed earlier, and you’ll accumulate them as your project ages. --- # How to test? - py.test – http://doc.pytest.org - Searches for all test_*.py files, runs all test_* methods. - Reports nice errors! - Dig deeper: http://pybites.blogspot.com/2011/07/behind-scenes-of-pytests-new-assertion.html ??? There are several frameworks to help you with unit testing in python. There’s the built-in unittest module, there’s the now somewhat abandoned nosetests. For the course we’ll be using the py.test module, which is also what I’d recommend for any new projects. What it does is it searches all files starting with `test`, runs them, and reports a summary. The tests should contain assert statements, and if any of them fail, you’ll get an informative error. This is actually done with a considerable amount of magic, which you can read up on here, if you’re into rewriting the AST. --- # Example .smaller[ ```python # content of test_sample.py def inc(x): return x + 2 def test_answer(): assert inc(3) == 4 ``` ] ![:scale 60%](images/pytest_failure.png) ??? So here I have a slightly modified version of the example from the pytest website. We have a function called inc that’s supposed to increment a number. But there’s a bug: it adds two instead of one. And we have a test, which is called `test_answer`, that calls the inc function with the number three and checks if three is correctly incremented. If we call py.test on this file, or on the folder of this file, py.test will run the `test_answer` function - because it starts with `test_`. It tells us it ran one test, and that test failed. We also get a traceback showing us the line and what the actual value was. We got 5 instead of 4. Depending on how much you know about programming, this error message should come as a bit of a surprise to you. The assert only gets a boolean, but pytest actually unpacks the `False` and gives us more information. --- # Example ```python # content of test_sample.py def inc(x): return x + 1 def test_answer(): assert inc(3) == 4 ``` ![:scale 70%](images/pytest_passed.png) ??? So now we go back and fix our increment function, and run py.test again. This time, it tells us the tests passed. Does that mean the function is correct? No, but we could test more. How? We could do the same with more numbers. Tests are usually only necessary, not sufficient for correctness. Usually all test for a project are in a separate file or even a separate folder. In this example we had the test and the function to be tested in the same file, but that’s not good style. The actual implementation should be completely separate from the tests. --- # Test coverage .smaller[ .left-column[ ```python # inc.py def inc(x): if x < 0: return 0 return x + 1 def dec(x): return x - 1 ``` ] .right-column[ ```python # test_inc.py from inc import inc def test_inc(): assert inc(3) == 4 ``` ] ] .reset-column[ ![:scale 70%](images/pytest_coverage1.png)] ??? Here’s a more complex example, with inc.py containing two functions and test_inc.py containing the same test. I added an if into the inc function, and a dec function. If we run the test, they still pass. But clearly we’re not testing everything. In this example, it’s easy to see that we don’t test the if branch and we don’t test the dec function at all. If you project is larger and more complex, it’s much harder to figure out whether you covered all the edge-cases, though. That’s where test coverage tools come in handy. --- # Test coverage .smaller[ .left-column[ ```python # inc.py def inc(x): if x < 0: return 0 return x + 1 def dec(x): return x - 1 ``` ] .right-column[ ```python # test_inc.py from inc import inc def test_inc(): assert inc(3) == 4 ``` ]] .reset-column[ ![:scale 50%](images/pytest_coverage2.png) ] ??? So instead of just calling py.test, we specify --cov inc, which means we want to test the coverage of the inc module or file. Now we get a coverage report that tells us that out of the 6 statements in inc.py, two were not covered, resulting in 67% coverage. There are several ways to figure out which lines we missed, I like the html report the most. --- # HTML report .larger[ ```bash $ py.test --cov inc --cov-report=html ``` ] .left-column[![:scale 100%](images/pytest_html_report1.png)] .right-column[![:scale 100%](images/pytest_html_report2.png)] ??? So we specify --cov-report=html, and py.test will create a html report for us. We get an overview that contains the same information as before, but we’ll also get a detailed view of inc.py, which shows us which lines we covered and which lines we missed. And now we can clearly see that we never reached the return 0 line in inc and we never tested dec. Whenever your write tests, you should make sure they cover all cases in your code, in particular the different algorithmic pieces. Covering all error messages is possibly not as important. You should usually aim for coverage in the high nineties. If you look at projects on github, you can often see a badge that says “coverage X %”, showing the code coverage. These badges are actually automatically generated, and next I want to talk about how that’s done. --- class: center, middle # Continuous integration (with GitHub) ??? This is the magic of continuous integration, which I mentioned already last time. Continuous integration is a general paradigm in software engineering, of automatically running integration tests whenever you change your software. I want to talk about it in particular in the context of github, because that’s what you’ll likely be using, both here and later in industry – at least in a startup. If you go to google or facebook of amazon, they all have their own frameworks, but the same principles apply. They often have even more involved systems, like continuous deployment, which actually automattically puts new code into production if it passes tests. We're not gonna talk about that here. Continuous deployment gets somewhat trickier with machine learning algorithms. --- # What is Continuous integration? - Run command on each commit (or each PR). - Unit testing and integration testing. - Can act as a build-farm (for binaries or documentation). - requires clear declaration of dependencies. - Build matrix: Can run on many environments. - Standard serviced: TravisCI, Jenkins and CircleCI ??? Ok so what’s continuous integration in more detail? It runs some sequence of commands on a cloud machine every time a commit is made, either just for a particular branch, or for each pull request. Usually the command is just running the test suit, but you can also check coverage or style or build binaries or rebuild your documentation. This is done on a clean cloud machine, so you need to be very explicit about your dependencies, which is good. You can specify a build matrix of different systems you want to run, such as linux and os X, different versions of Python and different versions of the dependencies, like older and newer numpy versions. There are some standard services out there that are free for open source and educational purposes, travis, jenkins and circle are some of them. We’ll use travis for this course. --- # Benefits of CI - Can run on many systems - Can’t forget to run it - Contributor doesn’t need to know details - Can enforce style - Can provide immediate feedback - Protects the master branch (if run on PR) ??? So what’s the benefits of doing this? You can run it on many systems, and in parallel. You might not have all the operating systems that you want it to run on, and you might not go through the hassle of trying out every patch on every machine. Because CI is automatic, you can’t forget to run it. There’s github integration that will give you a red x or a green check mark, and everybody will know whether your commit was ok or not. You can make a change to a package without knowing all the requirements and even without knowing how to run the tests. The CI will complain if you broke something. And you get immediate feedback while you’re working, because tests are run every time you commit! Most projects will only merge pull requests that pass CI, and that means the master branch can not break and will always be in working condition, which is important. --- # What does it do? - Triggered at each commit / push - Sets up a virtual machine with your configuration. - Pulls the current branch. - Runs command. Usually: install, then test. - Reports success / Failure to github. ??? Ok so let’s go through the steps that the CI performs. Let’s say you configured travis for one of your branches on github. If you push to that branch, travis will be triggered. A virtual machine will be set up with the configuration that you specified. The machine pulls your current version of the project. If it’s a pull request, it might also use the code that would be the result of a merge, so it would be your changes applied to the current project. Then some commands are run that you can specify. Usually these are installing the package, possibly including dependencies, and then running the test suite. After the test-suite runs, it will report back to github. --- # Setting up TravisCI .smaller[ - Create account linked to your github account: http://travis-ci.org
![:scale 60%](images/travis_activate.png) - Check out docs at https://docs.travis-ci.com ```yaml language: python python: - "2.7" - "3.4" - "3.5" - "3.6" - "3.7" - "nightly" # command to install dependencies install: - pip install -r requirements.txt # command to run tests script: - pytest # or py.test for Python versions 3.5 and below ``` ] ??? So in general to set up travis you have to create an account that is linked to your github account, and enable travis-ci on the repository you want. For public repositories and students its free, otherwise you have to pay for the service. For the homework you'll need to enable this for your repo. There a docs online at this website, and the only thing you need to do is create a .travis.yml file with the configuration. Here, we specify the versions of python we want to run, the install command, and the final command. This will run pip install with a requirements file. You could also run python setup.py install,or specify requirements here directly. --- # Using Travis - Triggered any time you push a change - Integrated with Pull requests - Try a pull request on your own repository! ??? Travis is run automatically each time you push a change, so in the homework, push a change and then see the status page. You can also do a pull request on your own repository, so you can see the pull request integration. It’s pretty cool and you should try it out. --- class: center, middle # Documentation ??? So now, we come to the last bit that is about general software development, documentation. We talked earlier about how important it is to write readable code. But no code is perfectly documenting itself. You definitely need to write additional documentation. Also, many people don’t like digging through code, so some documentation that’s more easily accessible is helpful. I decided to take this off the homework this year, because people were struggling a lot with setting up the infrastructure last year. That's a bit unfortunate, because it's really cool once you see all the pieces coming together. --- # Why document? - Allow people to use the code without reading it. - Your code is harder to understand than you think. - Input types and output types are unclear in dynamic languages. - Often implicit assumptions about input. ??? So why do we document? The most obvious reason is that you want other people to use your code, and most people don't want to read the full implementation of a function before they can use it. You can document the functionality, interface and assumptions, and anyone can use your code, without having to dig through all the details. Even if someone wants to read the code, documentation makes reading the code easier. Others will thank you, including your future self. While you should document any function or class in any language, in dynamically typed languages in particular, it’s often unclear what the assumptions about the input are, and what the type of the output is. So these are really crucial to document. And if the input is some complex object like a dictionary or a custom class, often you make assumptions about the content of this object that are not obvious from the code. --- # Python documentation standards .compact[ .smaller[ - PEP 257 for docstrings for class, methods and functions ```python def inc(x): """Add one to a number. This function takes as argument a number, and adds one to it. The result is returned. """ if x < 0: return 0 return x + 1 ``` - Additional inline documentation ```python if x < 0: # x is less than zero return 0 return x + 1 ``` ]] ??? Python has some standards for documentation, described in pep 257. Every class, method or function should have a docstring, at the very least all public ones. These are particularly helpful because they can be easily viewed inside a python session or with Jupyter Notebook. For docstrings we use triple quotes, with a single line Explanation in the first line, then an empty line, then a more detailed explanation. This docstring is stored in the `__doc__` attribute of the object. In addition to that, you can use inline documentation when you think it might be useful, using the pound. --- # NumpyDoc format See .smaller[https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt] ```python """ Parameters ----------- x : type Description of parameter x. """ ``` ??? You might have noticed that the python folks like standards. In the data science community there’s a stricter standard, which is the numpydoc format. It’s what you’ll see in numpy, scipy, pandas and scikit-learn for example. The documentation has several sections for Parameters, return values, examples, notes and more. The most important parts are the parameters and return values, which you should document as you can see here. A detailed description of the documentation format is in the doc I linked to. You should always document all parameters and all return values. This documentation is in restructured text format, which we’ll come to in a bit. This format is a bit idiosyncratic with a space before the colon, but it’s important that you actually stick to this. --- # NumpyDoc example .left-column[ ![:scale 95%](images/numpydoc_multinomial_nb.png) ] .right-column[ ![:scale 95%](images/numpydoc_fit.png) ] ??? Here you can see two examples from scikit-learn. On the left is the docstring for the MultinomialNB class. The class docstring follows the class definition, and the parameters are the parameters to the init function. You can see parameters, attributes, examples, notes and references. Next to it is the fit method, which just has a one-line description and then the parameters and return values. You can see that the type descriptions can get pretty specific, giving the matrix shapes. You might ask why we’re using this specific format and why we’re using restructured text. And the answer to this is sphinx. --- # Sphinx and reStructured Text .left-column[ ![:scale 95%](images/sphinx_svm_rst.png) ] .right-column[ ![:scale 95%](images/sphinx_svm_website.png) ] ??? So what’s this sphinx? It’s basically a static page generator that takes restructured text and produces HTML and pdf output. This is how the scikit-learn or numpy or pandas documentation is generated for example. You can see the rst code on the left and the generated website on the right. Obviously there’s also a theme file that governs the layout. RST is similar to markdown but it also allows internal references, figures and simple ways to extend the format with custom patterns. Sphinx can also automatically generate html and pdf documentation from the docstrings itself, using a module called autodoc. That’s how the API documentation is generated, and that’s where numpydoc comes in. --- class: center, middle # Autodoc and NumpyDoc ![:scale 50%](images/sphinx_autodoc.png) ??? Here is the documentation of the DummyClassifier as rendered by Sphinx using autodoc. You can see the different sections here. There’s more sections, and also documentation for the methods down below. Sphinx also adds in links to the source code on github if you like. So now obviously you want to show this to people interested in your project, and you don’t want them to install sphinx and build the docs before they can read them. One option would be to build this locally and then use github pages to host the site. That’s fine, but also a little cumbersome. --- class: center, middle # Principles of Data Visualization ??? For the rest of today I want to talk about visualization and matplotlib. We will not talk about numpy and pandas, even though they are very important for doing data science with python. There are two reasons: I find it easier to talk about visualization in the abstract, while pandas usage is often very case specific. Also, there is not a lot of material out there about matplotlib, unfortunately, while there is a lot of great material out there on pandas. I highly recommend the book by Wes McKinney on "Python data analysis", which is just a pandas handbook. The best guide to all the python libraries that we need, including matplotlib, is the "Python Data Science Handbook" that I mentioned earlier, and you should really check it out. In terms of visualization, all the DSI folks in this course are also taking the Exploratory data analysis and visualization course, which will go into way more detail than we will here - and we'll focus a bit more on the python interfaces. --- class: center, padding-top, spacious # Why? ??? So first, I want to ask why we might want to visualize data. And I think there are two main reasons. The first is for ourselves, to explore the data, make hypothesis and find interesting trends. The other is to communicate either the whole data, or particular aspects or findings about the data. I think both are important, but for the matter of this class, I think the exploration is more important. In general, you should think about what the point of any visualization is, and whether the method you chose is the best for this purpose. In many of my capstone reports last semester I had people show all kinds of charts and graphs and often the description was "here is a bar-plot of quantity X". So what? Why are you showing me this? What's the conclusion? Ideally a visualization should answer a question. In exploratory analysis, the answer might be "there's nothing interesting here", but you should still have some idea what you're looking for. So I want to briefly talk about some very basics of data visualization. I’ll post two great books as references if you’re curious, and I linked to a PhD thesis which has a great introduction, and from which I stole some illustrations. -- # Explore # Communicate --- class: padding-top, spacious Above else, show the data.
Maximize the data-ink ratio. .quote_author[E. Tufte] ??? Here two quotes I want to start with. The first two are from Edward Tufte, the godfather of statistical visualization. He says “above else, show the data”. So the data should be in focus, and not any fancy method you use to show it. He also says “Maximize the data-ink ratio”. What he means here is that all the ink you’re using (even if it’s virtual ink) should show the data, and should vary with the data. You want fewer static elements in your graphic, and more elements that are actually important to show the data. The other quote, from William Cleveland is: Tools matter. I think for him the tools are more the visualization methods, but I think also the software tools matter, which is why I want to talk about them. -- Tools matter. .quote_author[W. S. Cleveland] --- class: center, middle # Visual Channels ![:scale 60%](images/visual_channels.png) ??? Before we start with plotting, here’s a quick summary of visual channels that I took from the thesis “Systematising glyph design for visualization”. These are ways in which you can show information. There is obviously length, as in bar-charts, angle, curvature, shape as in graphs and scatter plots. There’s area, which is often useful, but it’s important to say whether you’re using area or distance to show information. For circles that’s often unclear. There is 3d volume, which is really something you probably want to avoid. Then there’s color. There’s lightness, saturation, transparency and hue, and they all work very differently. It’s generally accepted that using hues for quantitative changes is not a great idea, and lightness or saturation works much better. Textures are something I also try to avoid, position is clearly one of the most important channels, while the rest are not something I use very often - though containment and connection can be helpful. --- class: center, middle # Picking Channels .left-column[ ![:scale 50%](images/visual_channels2.png) ] .right-column[ ![:scale 100%](images/visual_channels3.png) ] ??? There have been studies of which of these are good ways to relate quantities. The winner is position, closely followed by length. These are clearly the most accurate ways to depict a value. Then angle, area, volume, texture, saturation and finally hue. So hue is basically the worst way to encode any quantitative information. Unfortunately these studies didn’t include brightness. It’s also good to be aware of which of these channels have pop-out effect, which allows you to find particular items very quickly. Here you can see color, orientation, curvature, size, symbol, movement, blur and contrast all working well - actually you can’t see the movement, but I think you would agree. It’s important to know that the number of different values are important to how much something pops out, and if you use too many orientations or sizes or colors, this effect is lost. --- class: center, middle # Colormaps .left-column[ ![:scale 90%](images/sequential_color_maps.png)
![:scale 90%](images/qualitative_color_maps.png) ] .right-column[ ![:scale 90%](images/diverging_color_maps.png)
![:scale 90%](images/misc_color_maps.png) ] .reset-column[ ![:scale 50%](images/perceptually_uniform_colormaps.png) ] ??? But let’s talk more about color, in particular color maps. Color maps are ways to map quantities from a continuous number to a color, and they are used whenever you plot something with color - or even with gray scale. There are several different types that are important to distinguish. Sequential colormaps usually go from one hue or saturation to another, while also changing lightness. The have two extremes and interpolate between them. That works well if you’re particularly interested in the extreme values, but they might not offer a lot of contrast in the middle. There’s also diverging colormaps, which have a focus point in the middle, these here have either grey or white, and then have different hues going in either direction. This is great if you have a neutral point and then deviations for that, like profits that can be zero or positive and negative, and you can easily see which side of the zero any point is on. --- class: center, middle # Colormaps .left-column[ ![:scale 90%](images/sequential_color_maps.png)
![:scale 90%](images/qualitative_color_maps.png) ] .right-column[ ![:scale 90%](images/diverging_color_maps.png)
![:scale 90%](images/misc_color_maps.png) ] .reset-column[ ![:scale 50%](images/perceptually_uniform_colormaps.png) ] ??? Quantitative colormaps are mean to not show a continuum, but just some discrete values, and they are designed in such a way to have optimum contrast for a particular number of discrete values. Finally, there are the miscellaneous colormaps shown here, and you can see them really everywhere, in particular rainbow colormaps like jet and rainbow. Don’t use them. Really, never ever use them. I’ll explain why in a second, but if I see it in this course, not only will I take points of your homework, I’ll also look at you really disappointedly. And there are other colormaps that have more than two hues, but that don’t have the problems that these very colorful ones have, and these are the perceptually uniform ones down here. These are carefully designed to ideally show quantitative information. So let's talk about how they are different. --- class: center ![:scale 35%](images/colormap_evaluation_gray.png) ??? So what’s the problem with these other colormaps. This is an evaluation of the grey color map. Let's look at the panels on the right first. There are three heat maps here. So each pixel here represents a floating point value on some scale. On the left you see at the top the perceived changes in the color map, using a model of the human visual system. You can see that it's pretty constant. And at the bottom left you can see the the color map as a path in the 3d color space. This space is not RGB, it's a CIE color space that corresponds more to how humans perceive color. In the RGB space, humans can detect finer nuances in green than in red, for example. Each color map is a path in this 3d space. And ideally, you'd like this to be a smooth path, and you'd like the speed to be constant. The perceptual contrast at the top is something like the derivative, or speed, of the color map going through the perceptual space. --- class: center ![:scale 35%](images/colormap_evaluation_gray.png) .left-column[ ![:scale 80%](images/colormap_evaluation_jet.png) ] .right-column[ ![:scale 80%](images/colormap_evaluation_viridis.png) ] ??? Now let’s look at this in the jet color map. you see that green ring that looks like it has some boundary? Or the red core? Where are these in the greyscale? They are not there, because the are not in the data. So why does it look like there are edges? here is the color map, and this shows the differences in perceived color. You see these spikes? that’s where we perceive edges, because the color map has edges! Also here is the colormap converted to greyscale, say if someone printed it. and below are the brightness deltas. Do you see something? It’s not monotonous. It doesn’t go from dark to bright, it backs up on itself. -- ??? And here’s one of the perceptual uniform colormaps In comparison. You can see that there’s a bit more detail then in the gray one, but no artificial contours appear. That’s because there are no perceptual edges. Also, the change in brightness is constant, so if you convert it to greyscale, you just got the greyscale colormap back. The other plots here show how it looks for various forms of color-blindness, and the path of the colormap in a perceptual 3d color space. I posted a cool video explaining way more details on this on the course website. --- class: center, middle # matplotlib ??? Which brings us to matplotlib. So matplotlib is the python plotting library on which basically all python plotting libraries build, and because it’s so important, we’re gonna go through it in some detail. Matplotlib was designed as a general plotting library and not really with data analysis in mind, which is why it is a bit cumbersome in some cases, and not as elegant for data science as you might be used to if you're coming from R. But I think it’s still good to know how to effectively create graphics in python if you’re gonna work with python, which we will. Matplotlib is very powerful, and any figure you can imagine, you can create with matplotlib. Unfortunately sometimes it takes a bit of code, though. --- class: center, middle # matplotlib v2 update now! (you can enable classic style if your really want) ??? So matplotlib version 2 has been out for about a year now, though I still see people using matplotlib v1. I would really highly encourage you to upgrade. version two changes some of the default styles, but you can still enable the classic styles if you want. In particular version two uses virids as the default color map. If you create plots in version 1.X and show them in 2.X, they might look very differently or might not be comprehensible at all. --- # Other libs - pandas plotting - convenience - seaborn - ready-made stats plots - bokeh - alternative to matplotlib for in-browser - several ggplot translations / interfaces - bqplot - plotly ??? There are some other interface you should look into which I won’t discuss in class. Pandas has some built-in plotting on top of matplotlib. It’s basically shortcuts to do some operations more easily on dataframes, and I highly encourage you to use it. You’ll get the most out of it if you actually understand matplotlib, though. Similarly seaborn has many tools for statistical analysis and visualization, also using pandas dataframes, also based on matplotlib. Seaborn is really great for more complex statistical plots. There is a completely separate project, bokeh, which uses javascript (but not d3) to do visualizations in the browser. I’m not entirely convinced yet, though. There’s also several reimplementations and interfaces to ggplot, if that’s what you’re into, and bq plot, a tool for time series visualization in the jupyter notebook. And finally there's a startup called plotly that also provides a plotting interface. --- # Imports ```python from matplotlib.pyplot import * from pylab import * from numpy import * ``` ??? So before we start, a word on imports. What do you think about these imports? Why are they bad? They hide where functionality came from. Is this squareroot from numpy or math or your own function? Is this plot from seaborne or pandas or matplotlib or something else? Never use import *. It makes your code less readable. And we don’t want that, right? Always use explicit imports and standard naming conventions. I haven’t decided if I’ll subtract points for that, but if I see that I will know that I failed. -- NO! ```python import matplotlib.pyplot as plt import numpy as np ``` YES! --- # matplotlib & Jupyter .left-column[ `%matplotlib inline` - sends png to browser - no panning or zooming - new figure for each cell - no changes to previous figures ] .right-column[ `%matplotlib notebook` - interactive widget - all figure features - need to create separate figures explicitly - ability to update figures - doesn't work with jupyter lab (right now) ] ??? I highly recommend that you use jupyter to play around with matplotlib figures and visualizations. To show matplotlib figures embedded in jupyter, you have to call one of two magics. Jupyter magics are those that start with a percentage. There’s matplotlib inline and matplotlib notebook, and you should run one of them, and only one of them, at the very start of the notebook. Both will allow you to show the figures inside the the browser, but they are quite different. [read/explain table] --- # Figures and Axes figure = one window or one image file axes = one drawing area with coordinate system ![:scale 40%](images/figure_axis.png) by default: each figure has one axis ??? Now let’s start with the actual matplotlib. The most basic concepts in matplotlib are Figures and axes. A figure is a single window or a single output file. A figure can contain multiple axes, which are basically subplots. Axes are the things you actually draw on, and usually each one has a single coordinate system. By default, each figure has one axes. So her is a figure with two axes from the scikit-learn examples. --- class: some-space # Creating Figures and Axes .smaller[ .wide-left-column[ 1st way: don’t.
Creates figure with axes on plot command 2nd way: `fig = plt.figure()`
Creates a figure with axes, sets current figure. Can add more / different axes later. 3rd way: `fig, ax = plt.subplots(n, m)`
Creates a figure with a regular grid of n x m axes. ] .narrow-right-column[ ![:scale 100%](images/subplots.png) ] ] ??? There’s several ways to create figures and axes. The first one is that you don’t. With your first plotting command, matplotlib will automatically create a figure, and that figure will contain axes and that’s what you’ll draw on. Once you created a figure, each subsequent plot command will apply to the current axes, which is usually the last axes you created. So if you want a second figure, you’ll have to create it explicitly, for example using the plt.figure command. That creates the single default axes, but you can also add more later. You can also create a figure with a grid of axes by using the subplots command (note the s in the end). You can see how that looks on the right for a 2 by 2 grid. I use that a lot. fig is an object representing the figure and ax is a numpy array of size n time m, with each entry being one axes object. --- class: spacious # More axes via subplots .smaller[ .wide-left-column[ ```python ax = plt.subplot(n, m, i) # or plt.subplot(nmi) # places ax at position i in n x m grid # (1-based index) ax11 = plt.subplot(2, 2, 1) ax21 = plt.subplot(2, 2, 2) ax12 = plt.subplot(2, 2, 3) ax22 = plt.subplot(2, 2, 4) ``` equivalent: ```python fig, axes = plt.subplots(2, 2) ax11, ax21, ax12, ax22 = axes.ravel() ``` ] .narrow-right-column[ ![:scale 100%](images/subplots2.png) ]] ??? If you just created a figure but you want more axes, you can also add them one subplot at a time with the subplot command, this time without the s. The subplot command takes three numbers - you can leave out the comma between them if they are single digits, but please dont. The first two numbers specify a imaginary grid for the whole figure, let’s say I want a 2 by 2 grid. The third number is at which position in that grid I want my figure created. The numbers start with one and go column by column. I created the 2 by 2 grid here and labeled the axes according to the variable name. you could create all of them at the same time with the subplot commands, but there are two reason why you might choose not to. --- # More axes via subplots ```python ax = plt.subplot(n, m, i) # or plt.subplot(nmi) # places ax at position i in n x m grid # (1-based index) ``` .left-column[ ```python ax11 = plt.subplot(2, 2, 1) ax21 = plt.subplot(2, 2, 2) ax2 = plt.subplot(2, 1, 2) ``` ] .right-column[ ![:scale 90%](images/complex_subplots.png) ] ??? The first one is that the grid you specify doesn’t need to be the actual grid. It just tells the single subplot about where it should be. So you can create all kinds of different layouts. For example, I can create a 2 by 2 plot where the second row is a single axes, by telling it it should be at the second position for a 2 by 1 plot. This allows me to create irregular grids, which is often quite useful. --- # Two Interfaces Stateful interface - applies to current figure and axes object oriented interface - explicitly use object .wide-left-column[ ```python sin = np.sin(np.linspace(-4, 4, 100)) plt.subplot(2, 2, 1) plt.plot(sin) plt.subplot(2, 2, 2) plt.plot(sin, c='r') fig, axes = plt.subplots(2, 2) axes[0, 0].plot(sin) axes[0, 1].plot(sin, c='r') ``` ] .narrow-right-column[ ![:scale 70%](images/subplots_sin_last.png) ![:scale 70%](images/subplots_sin_first.png) ] ??? So now is the part where things become a bit tricky. There’s basically two different ways to use matplotlib, two interfaces if you want, one is the stateful one, and one is the object oriented interface. If you’re using the stateful interface, you just call the module-level plot functions like plt.plot here, and it will plot in the current axes. What do you think will happen here? In the object based interface, you are explicit about which axes you want to draw into, so you hold onto your axes objects, and call the plotting commands on those! What do you think will happen here? This is another reason why some people like to add subplots one by one: they use the stateful interface. They add a subplot, plot into it, then add the next one and so on. So the plot commands are exactly the same for both interfaces, but there are some differences. --- # Differences Between the Interfaces .smaller[ .left-column[ `plt.title` `plt.xlim, plt.ylim` `plt.xlabel, plt.ylabel` `plt.xticks, plt.yticks` ] .right-column[ `ax.set_title` `ax.set_xlim, ax.set_ylim` `ax.set_xlabel, ax.set_ylabel` `ax.set_xticks, ax.set_yticks (& ax.set_xtick_labels)` ]] .reset-column[ ```python ax = plt.gca() # get current axes fig = plt.gcf() # get current figure ``` ] ??? The formatting options in the object oriented interface all have a “set_” in front of them, while they don’t in the stateful interface. Also, for setting the tick marks, there are separate commands for the locations and the labels in the object oriented interface, but not in the stateful interface. In general, the object oriented interface provides more functionality and is more explicit and powerful, but the stateful interface is a bit terser. If you started plotting but then you want to use something that’s only part of the object oriented interface, you can always get the current axes or current figure with the gca or gcf commands. I use the stateful interface if I have a single axes and I don’t need any of the advanced functionality of the object oriented interface, so basically for simple plots. Whenever I have more than one axes objects, so for any grid, I always use the object oriented interface - but that’s just my personal preference. --- # Plotting commands - Gallery: http://matplotlib.org/gallery.html - Plotting commands summary: http://matplotlib.org/api/pyplot_summary.html ??? I hope that sufficiently obfuscated everything. So lets finally start with the plotting. matplotlib has two pages that are helpful for figuring out how to plot something: the gallery and the plotting commands summary. I will go through only some of the commands from the summary that I think are particular important and their most important aspects. --- class: center # plot ![:scale 70%](images/matplotlib_plot.png) ??? Clearly plot it the most important one. Plot allows you to do line plots and scatter plots. If you give a single variable, it will plot it against it’s index, if you provide two, it will plot them against each other. By default, plot creates a line-plot, but you can also use “o” to create a scatterplot. You can change the appearance of the line in many ways, including width, color, dashing and markers. --- class: center # plot ![:scale 70%](images/matplotlib_plot_figsize_why.png) ??? One thing that I find slightly annoying and that always trips me off is that in subplots, it's rows, then columns, while the figure size is width, then height. --- class: center, middle # scatter ![:scale 60%](images/matplotlib_scatter.png) ??? While plot can create scatter plots, those are quite limited. the scatter function allows you to do scatter plots that not only encode the position of the points, but visualizes additional variables via color or size. In the bottom left, I colored the points by their distance to the diagonal, in the bottom right, I also added random size variations. Here I also used “subplot_kw” which allows you to specify some arguments for all subplots in a figure. I use it here to remove the ticks. --- class: center, middle # histogram ![:scale 90%](images/matplotlib_histogram.png) ??? Histograms are clearly also important. By default, histograms have ten bins, which is never right. You can adjust the number of bins, and I usually try to find the point when it will be too fine. There’s also a heuristic for finding the binsize which you can use with bins=”auto” --- class: center, middle # bars .left-column[ ![:scale 100%](images/matplotlib_bar.png) ] .right-column[ ![:scale 100%](images/matplotlib_barh.png) ] ??? For barcharts, you always need to provide the position of the bar as well as the length. Usually that’s done via a range. If you use ticklabels, it’s usually a good idea to rotate them so you can actually read them. But I don’t really like tilting my head, so I often prefer horizontal bar-charts, which work the same way. The way I specified the positions here, the first bar is at the bottom. We could flip the axes or change the position if we wanted it at the top. --- class: center, middle # heatmaps ![:scale 50%](images/matplotlib_heatmap.png) ??? You can plot heatmaps with the imshow command. Previous to matplotlib v2 this automatically enabled interpolation, which you can see at the top right. Interpolation might hide data or might give the impression of more data then there is, by showing a smooth transition. You should always disable interpolation for heatmaps. At the bottom you can see some results with a gray colormap and with a diverging color map. Here, the background is zero and it makes sense to represent the neutral differently. So I ensured that white is mapped to zero, and we can clearly distinguish positive from negative entries, which is much harder with the other color maps. Doing colorbars on multiple axes can be tricky. You need to store the matplotlib image that is returned by imshow in an object, and then call the colorbar command with the image and the axes to which you want to attach the colorbar. --- class: center # ![:scale 35%](images/matplotlib_overplotting.png) ??? A command that I discovered way too late is the hexgrid. Hexgrids basically allow two-dimensional density maps. If you have a lot of points, a scatterplot can become too crowded to understand what’s going on. You can work around that a bit by using the alpha value, but that often throws away a lot of information. A better way is to use a hexgrid and plot the density in each grid cell directly. That also allows the use of arbitrary colormaps. Using hexgrids also allows you to map the density, for example using a logarithm, if the differences in density are very large between different regions. --- class: center # hexgrids ![:scale 35%](images/matplotlib_overplotting.png) ![:scale 35%](images/matplotlib_hexgrid.png) ??? --- class: center, middle # twinx, twiny ![:scale 60%](images/matplotlib_twinx.png) ??? The last thing I want to mention is twin axes. Here I show two time series, the number of math PhDs awarded in the us and the revenue made by arcades in the US. If I plot them both in the same coordinate system, the number of PhDs will look just flat, because it lives on a completely different scale. Using the object oriented interface, I can create a twin x axis for the revenue to show both time series with their own scales. --- class: middle # Questions ?