Next-generation Python Big Data Tooling, powered by Apache Arrow
Dealing with Data, Intermediate
Description: The Python data stack has struggled to interoperate well with big data systems. Apache Arrow provides standard in-memory columnar data structures that will enable Python programmers to participate in big data problems in a more natural and performant way. This talk will discuss the Apache Arrow project itself and the state of the new tools being created to help Python work better with Apache Hadoop and Apache Spark.
Bio: Wes McKinney is a software engineer at Cloudera. He is the creator of Python’s pandas library and the Ibis project, a committer to the Apache Parquet and Apache Arrow projects, and the author of the O'Reilly Media book, Python for Data Analysis. Previously, Wes was the founder and CEO of DataPad.
Exploring complex data with Elasticsearch and Python
Dealing with Data, Intermediate
Description:Elasticsearch is a powerful open-source search and analytics engine with applications that stretch far beyond adding text-based search to a website. Learn how Elasticsearch can be used with Python and Django to crunch through complex datasets and quickly build powerful interfaces for exploring information.
Simon Willison is an engineering director at Eventbrite, a Bay Area ticketing company working to bring the world together through live experiences. Simon works as part of a small product research and prototyping lab helping develop new concepts for Eventbrite products and features. Simon joined Eventbrite through their acquisition of Lanyrd, a Y Combinator funded company he co-founded in 2010. He is a co-creator of the Django Web Framework.
Abstract: Jupyter is an open source, language agnostic, interactive computing platform used in scientific computing and data science that provides multiple tools tailored for different workflows, from traditional terminal-style control to the popular web-based Notebook. The Jupyter Notebook is a web application that allows users to create and share documents that contain live code, equations, visualizations and explanatory text. Jupyter is the evolution of the original ideas in the IPython interactive shell, as we generalized them into a language agnostic protocol that has now been implemented in over 50 separate languages.
One project within the Jupyter ecosystem, JupyterHub, is a multi-user environment for Jupyter Notebooks that runs off a central server and that can be used to serve Notebooks to classes of students, corporate workgroups, or scientific research groups. JupyterHub is the backbone for UC Berkeley’s new Undergraduate Data Science Education Program, an ambitious program that aims to provide every freshman with core knowledge and skills in data science.
In this talk we will discuss and demonstrate the many development activities underway at Project Jupyter, including IPython 5.0, JupyterHub, and JupyterLab, and how these tools are used in data science, industry, scientific research, and education.
Bio: Jamie Whitacre is the technical project manager for Project Jupyter, an open-source scientific computing and data science ecosystem used extensively in academia and industry. Project Jupyter operates out of the Berkeley Institute for Data Science (BIDS) at UC Berkeley. Matthias Bussonnier is a postdoctoral researcher at BIDS and a core developer for Jupyter and IPython.
Interactive Data Visualization Applications for the Browser with Bokeh
Bryan Van de Ven
Abstract: With support from the DARPA XDATA Initiative, commercial engagements, and contributions from over 150 community members, the Bokeh visualization library (http://bokeh.pydata.org) has grown into a large, successful open source project with heavy interest and following on GitHub (https://github.com/bokeh/bokeh). The principal goals of Bokeh are to provide capability to developers and domain experts: easily create and share interactive, versatile, and powerful visualizations that extract insight from data sets that may be remote, large, or streaming. Bokeh provides a platform for anyone to create interactive data and visualization applications in the browser for themselves, their colleagues, or for a wider audience.
This talk will give a quick overview of recent developments, and demonstrate some of the newest capabilities of Bokeh including:
* Bokeh applications and the second generation Bokeh server (that is more performant, better documented, and much simpler to use and deploy)
* APIs for streaming data (both in the notebook and Bokeh applications)
* The ability to extend Bokeh with your own custom functionality (for example to create 3D plots or network graphs)
* Recent GIS features such as support for GeoJSON and tiled map data sources
* The new Datashader library that can be used together with Bokeh to visualize billions of data points.
Finally the talk will discuss near-term plans for the project, it's governance, and community development.
Bio: Bryan studied undergraduate CS and Math at UT Austin, and graduate Physics at UCLA. Currently he leads the technical effort for work done on the Bokeh project at Continuum Analytics. Previously, he has worked on feature detection and classification systems for submarine platforms, automated tools for financial risk modeling, and workflow optimization for fluid mixing simulations. He has also taught Basic, Advanced, and Scientific Python courses to more than 1500 students in the last four years.
Caffe + Jupyter + Pandas It’s not rocket science, well sorta.
Dealing with Data, Intermediate
Description:In this talk I will walk the users through the entire process of building a convolutional neural network for image classification. The process starts with a flask application to label your data, followed by characterizing, training, and evaluating the CNN using Pandas, Jupyter Notebooks, and Bokeh plots. Finally we show how the CNN can be deployed and used in real-world applications.
Abstract: Convolutional Neural Networks: they’re new, they’re big, they’re complex, they’re poorly documented and accordingly they are a little scary. At Planet we will image the entire earth every day, and to deliver that data to customers we need to analyze images without it ever being seen by human eyes. In this talk we’ll cover how to build, train, and characterize a neural net for image classification all from the comfort and safety of a Jupyter notebook. This talk will serve as a template for building and using your very own CNN.
Bio: Katherine Scott is a senior software engineer at Planet working on image classification. Prior to planet Ms. Scott was the co-founder and CTO of Tempo Automation and a co-founder at Sight Machine. Katherine is currently the Program Chair for the Open Source Hardware Association.
Description: This talk will present strategies in Python for handling data that is too large to fit in memory and/or too slow to process in one thread, but small enough to still fit in one machine.
Abstract: Unless you work at a large internet company, you probably don't have BIG data, but you might have LARGE data. Large data consume an unacceptable amount of time and memory when medium strategies are used, but also incur unnecessary financial and latency costs when big strategies are used. Two basic strategies for handling large data, chunking and parallelization, will be discussed with live coded examples in Python.
Bio: I'm a research scientist currently living in the Bay Area and working in neuroethology, human evolution, and natural language processing. I currently work at D-Lab, where I help researchers apply advances in computation to their research paradigms.
Description: Moving data through transformations and from one place to another is a big part of data science/eng. We’ve been using Airflow for several months at Clover Health and have learned a lot about its strengths and weaknesses. We will use this talk to give a practical introduction to Airflow that gives people the information they need to decide whether Airflow is right for them and how to get started.
Abstract: Airflow is a popular pipeline orchestration tool for Python that allows users to configure complex (or simple!) multi-system workflows that are executed in parallel across any number of workers. A single pipeline might contain bash, Python, and SQL operations. With dependencies specified between tasks, Airflow knows which ones it can run in parallel and which ones must run after others. Airflow is written in Python and users can add their own operators with custom functionality, doing anything Python can do.
At Clover Health, we’ve been pushing Airflow’s limits, digging into the source code, and contributing patches upstream. In this talk, we’ll cover the basics of Airflow so you can use what we’ve learned to start your Airflow journey on the right foot. This talk aims to answer questions such as: What is Airflow useful for? How do I get started? What do I need to know that’s not in the docs?
Bio: I have been a scientific Python developer since 2008. I’ve worked in atmospheric science, astronomy, urban planning, web applications, and healthcare. I maintain several open source Python libraries and am currently a data engineer at Clover Health.
Caravel - A data visualization, exploration and dashboarding platform
Dealing with Data, Intermediate
Description: Airbnb developed Caravel to provide all employees with interactive access to data while minimizing friction. Caravel's main goal is to make it easy to slice, dice and visualize data. It empowers each and everyone to perform analytics at the speed of thought.
Abstract: Topics include:
* Intuitively visualizing datasets while filtering, pivoting, and changing views
* Creating and sharing simple dashboards
* Caravel's rich set of visualizations
* Caravel's extensible, high-granularity security/permission model allowing intricate rules on who can access individual features and the dataset
* Caravel's enterprise-ready authentication with integration with major authentication providers (database, OpenID, LDAP, OAuth, and REMOTE_USER through Flask AppBuilder)
* Caravel's simple semantic layer, allowing users to control how data sources are displayed in the UI by defining which fields should show up in which drop-down and which aggregation and function metrics are made available to the user
* Caravel’s deep integration with Druid
* Caravel’s integration with most RDBMS through SQLAlchemy
Bio: Maxime Beauchemin works at Airbnb as part of the Data Tools team, developing open source products that reduce friction that help generating insight from data. He is the creator and a leading maintainer of Apache Airflow [incubating] (a workflow engine) and Caravel (a data visualization platform). Before Airbnb, Maxime worked at Facebook on computation frameworks around engagement and growth analytics, at Yahoo! on social properties analytics, and at Ubisoft as a data warehouse architect.
Data in a dynamic system: Strategies for backwards compatibility
Dealing with Data, Intermediate
Description: There are several unanswered questions in deploying huge schema or logic changes: How do you modify systems with zero downtime or service interruption? How do you optimize online data migrations to allow for fallbacks? Any changes in schema or code in dynamic systems may cause existing users to experience downtime. The talk focuses on strategies to ensure backwards compatibility and prevent breaking data integrity.
Abstract: In an ideal scenario, feature development is easy. Just replace the old code with new code, and you’re done. This is, in fact, true for a system in state of inertia. However, in a dynamic system, with constantly moving pieces of business logic, this presents a hard problem. There are several unanswered questions while deploying huge schema or logic changes: How do you make code and schema changes with zero downtime or service interruption? How do you optimize online migrations of data to allow for fallbacks? Any changes in schema or code in dynamic systems may cause existing users to experience downtime. The talk focuses on strategies to ensure backwards compatibility and prevent breaking data integrity.
Bio: Trisha works as a Software Engineer at Affirm, a take on modern banking started by Max Levchin. At Affirm, Trisha has worked on several projects including the creation of the underlying financial system, architecture of systems for underwriting data processing, and several other product features. She graduated from the University of Pennsylvania studying Computer Science.
Description: Image acquisition and processing have become a standard method for qualifying and quantifying experimental measurements in many fields of science and engineering. Python provides many computational tools that can be used to perform image processing. In this talk, we will walk through the most common workflow in image processing along with examples.
Abstract: Image acquisition and processing have become a standard method for qualifying and quantifying experimental measurements in many fields of science and engineering. Python offers the following advantage: simpler syntax, powerful libraries and modules that focuses on increasing the productivity and most importantly it is free and open-source.
We will learn image processing through a simple and common workflow. We will read a high-resolution image of a mice. We will filter the image to reduce noise and improve the quality of the image. We will then segment the image, so that we obtain only the bones. We will clean up the over-segmented regions using morphological operations. We will perform measurements on the segmented image. Finally, we will discuss the workflow with a Python code.
Bio: Ravi Chityala is a Senior Engineer at Elekta Inc. He has more than 12 years of experience in image processing and scientific computing. He is also a part time instructor at the UCSC Extension, San Jose, CA, where he teaches advanced Python to programmers. He uses Python for web development, scientific prototyping and computing and as a glue to automate process. He is the co-author of the book, "Image Processing and Acquisition using Python" published by CRC Press.
Make sense of Deep Neural Networks using TensorBoard
Dealing with Data, Intermediate
Description:In this talk we look at some ways in which the TensorBoard utility can be used to better understand the structure of Deep Neural Networks and how they function. Best practices on how to use the TensorFlow Python API to make your models and results more interpretable are discussed.
Abstract: Deep Neural Networks are fast becoming the face of modern Machine Learning. But understanding how they work can be a real challenge, especially while you are trying to build a model. Google's recently published library, TensorFlow, includes a lesser-used utility called TensorBoard that can be used to visualize the structure of your neural network model and inspect how data flows through it. This talk will demonstrate some techniques which will help you use TensorBoard more effectively, and better understand how TensorFlow computations work. Code walkthroughs will be done in iPython notebooks, which will be made available to attendees.
Bio: Arpan likes to find computing solutions to everyday problems. He is interested in human-computer interaction, robotics and cognitive science. He obtained his PhD from North Carolina State University, focusing on biologically-inspired computer vision. Working at Udacity, he develops content for artificial intelligence and machine learning courses.
Description: This talk will be about walking through the steps to put a TensorFlow project into production on the web with Flask and Heroku. The goal is to introduce the project and show how TensorFlow can be used online for real data tasks, and discuss other considerations for deployment of a TensorFlow project.
Abstract: TensorFlow is a deep learning library with Python and C++ bindings that was released in 2015. The talk start with a brief intro to TensorFlow, and then dive into the specific steps to set up a simple project that can be served online.
Bio: Kendall is a lead software engineer at YesGraph, where he uses machine learning and Flask to power better invite flows for mobile and web apps. Previously he worked as an independent software consultant for four years, and before that he was a hardware designer at Qualcomm in San Diego for three years. Kendall was an an organizer of the San Diego Python Users Group, where he helped plan six one-day workshops on various Python topics.
Bio2: David Clark has a background in astrophysics, where he used Python extensively to analyze astronomical data. He recently transitioned careers to data science. Currently he is doing consulting for two startups. At Palo Alto Scientific, Inc., he uses the machine learning library TensorFlow to model sensor data from a wearable and infer a runner’s performance. He is also doing work for Quantea, Inc., making a dashboard using the Python libraries Bokeh and Pandas.
Description: During this talk the attendees will have an opportunity to use the ELK(Elasticsearch, Logstash, Kibana) stack to visualize their complex log data.
Abstract: Data is the new bacon. For all industries, including health, security, entertainment, etc., it is impossible for anyone to store and analyze data without using an automated platform. A unified platform is needed to provide data visualization and extract intelligence.
Elasticsearch is a distributed, real-time, search and analytics platform. With the help of a restful API, Elasticsearch saves data and auto indexes the parsed data.
During our talk, we will walk attendees through configuring the ELK stack and visualize datasets on Kibana.
Bio: Varang Amin is working as a Sr Staff Engineer at Palo Alto Networks. Darlene Wong is working as a Sr Staff Engineer at Palo Alto Networks.
Bio2: Darlene Wong is working as a Sr Staff Engineer at Palo Alto Networks. Before PAN, she worked in development role at Juniper Networks & Cisco Systems.
Description: Commonly used in image recognition, speech to text and text analysis, Principal Components Analysis (or PCA) separates the signal from the noise in your data and reduces your dimensionality so that meaningful analyses can be performed.
Abstract: PCA is vital for reducing high dimensional models with sparsity issues, without sacrificing the information contributed by each feature. In this talk, I will be explaining what happens under the hood during PCA, making the code and math accessible and interpretable.
Bio: Rumman comes to data science from a quantitative social science background. Prior to joining Metis, she was a data scientist at Quotient Technology, where she used retailer transaction data to build an award-winning media targeting model. Her industry experience ranges from public policy, to economics, and consulting. Her prior clients include the World Bank, the Vera Institute of Justice, and the Los Angeles County Museum of the Arts. She holds two undergraduate degrees from MIT, a Masters in Quantitative Methods of the Social Sciences from Columbia, and she is currently finishing her Political Science PhD from the University of California, San Diego. Her dissertation uses machine learning techniques to determine whether single-industry towns have a broken political process. Her passion lies in teaching and learning from teaching. In her spare time, she teaches and practices yoga, reads comic books, and works on her podcast.