Musings from rstudio::conf2020

I recently had the opportunity to attend Rstudio::conf2020 in San Francisco and it was such an enjoyable event. I had the opportunity to meet some of the best and brightest minds in the R community and reaffirm my appreciation of Rstudio and community they have fostered. I wanted to spend a little time reflecting on the event.

Keras and Tensorflow in R

Prior to the conference, I elected to take a couple additional days off work to take a workshop about deep learning with my good friend and coworker Jen. I wanted to take this workshop because I don’t have the opportunity to tackle many deep learning problems in my day job. I have done a little work with Keras and tensorflow in Python, but I wanted to see what the API was like in R. Also, it seemed like the workshop with the biggest learning opportunity.

In the 14 hour workshop we managed to cover the following topics:

  • The maths behind multilayer perceptrons and several flavors of gradient descent.
  • back propagation and MLP.
  • The keras framework.
  • The keras functional API for combining models.
  • Convolutional neural networks and image recognition.
  • Transfer learning and the various prefabricated models on TensorflowHub.
  • NLP for topic modeling.
  • recurrent neural networks and long-term short-term memory models.

It was a lot to cover and the lead instructor, Bradley Boehmke did a great job keeping the class engaged and making the material understandable.

Here are my big takeaways:

  1. Keras in R is a very thin wrapper for the python library:
  • It is zero indexed.
  • It is object oriented.
  • It leverages the exact same verbs as the Python library.
  1. It is computationally brutal. They spun up an AWS server with GPUs for us and the LSTM modeling still took a considerable amount of time even with toy data.
  2. It is extremely interesting and I wish I had more opportunities (and resources) to apply these learnings.

I am good at self teaching, but it was very refreshing to take a course with instructors with extensive knowledge in these topics.

If you are curious about the workshop, you can clone the repo here. The repository contains all of the coursework and slides used for the workshop.

Conference

After the workshops, the rest of attendees whom did not take a workshop arrived conference began. I got to meet with several coworkers whom I work with frequently, but had yet to meet in person, as well as some new coworkers due to a recent acquisition. It was great to hear about the new technologies that we will get to work with and the expertise that comes with a large acquisition.

Day 1 Keynotes

JJ Alaire and Opensource Software

This was one of the highlights of the conference. The first keynote was given by JJ Alaire, the CEO of Rstudio. During his address, he spoke about the importance of open source and Rstudio’s goals. The big announcement was that Rstudio is been approved as a Public Benefit Corporation (B Corp). If you are not familiar with B corporations, here is a snippet from their site:

Certified B Corporations are a new kind of business that balances purpose and profit. They are legally required to consider the impact of their decisions on their workers, customers, suppliers, community, and the environment. This is a community of leaders, driving a global movement of people using business as a force for good.

This announcement was followed by a standing ovation. And rightly so. Rstudio’s commitment to their community and stakeholders (not shareholders) is heartwarming.

Google Brain

After JJ’s speech, two researcher scientists, Fernanda Viégas and Martin Wattenberg from the Google Brain project spoke about the general musings of the future of data science and the importance of data visualization. Some of the big topics were: debug your data first, loss functions are a part of UX, and the UMAP algorithm.

Debugging Data

As much as we want to believe that other industries spend less time cleaning data, it simply isn’t true. In my work, I handle a high veracity of bad data. We handle customer tracking data from dozens of implementers, with very litle qc and no data protocol. Those in tech might deal with less veracity, but are swamped in the sheer velocity of streaming data. Data quality is always the problem. The speakers showed off how one of their computer vision algorithms was struggling with a couple of edge cases in ImageNet. It was positive that a cat was actually a frog. Low and behold, ImageNet had some misclassified data!

Loss Function is a UX decision

This is a topic we deal with a lot in my job. We tend to think of loss as a math problem to solve. If you specify the loss properly, you will get a better model right? Well, not always. Think about your metrics, who you are trying to reach, and retention rates. Suddenly this looks like a user experience problem. For example, in an electric car charging model we built, we typically report several models using different optimization metrics. If you optimize for income equality, a model will try and find optimal charger locations that are uniform across the income distribution. Whereas if you optimize for gross revenue, the charger locations are clustered near high income areas. You also might optimize for electric grid health and that adds yet another outcome. These are all great models that report very different results.

Uniform Manifold Approximation and Projections

Okay, on to the nerdier part of the talk. Along the lines of data visualization, Fernanda and Martin showed off a new algorithm making waves in the deep learning world, UMAP. UMAP can be thought of as a replacement for the slow TSNE algorithm for view embeddings. The big change is that UMAP is fast. Very fast. The bench-marking I have seen show at least a 8x speed increase of TSNE using the sci-kit implementation. If you are interested in a demo, definitely check out the Embedding Projector from Google Brain. It is pretty amazing that we can use these clustering algorithms to show extremely high dimensionality, collapsed to three dimensions.

Meetups

During the conference, I had the opportunity to meet so many wonderful and brilliant people. Fernanda Viégas and Martin Wattenberg were a pleasure to talk with about the importance of visualization in data science. I got to meet Hadley Wickham which was incredible. We got to chatting about the future of tidyverse (Keep a look out for dplyr 1.0.0). Finally, I also met with Jenny Bryan, the author or googlesheets and readxl. We talked about debugging, teaching, and the future of readxl (readr and readxl will have a unified API shortly!).

Wrap Up

I came back from rstudio::conf with so much optimism and energy. I went to a lot of talks, met wonderful people, and ate some wonderful food (for another post). Typically conferences are extremely draining but even though I am tired, I am so inspired to take the learnings back to my work. Thank you Rstudio!