Office Hours

3 years spent eating, sleeping, and breathing data science

Banner vector created by vectorpocket —

When I got my first corporate “data scientist” gig a few years ago, I barely knew what a decision tree was, I had no clue about why people kept talking about random forests, and I had no grasp on what people actually meant when they talked about “AI,” which to me was primarily associated with dystopian movies. I was quite overwhelmed, to put it mildly.

These days I feel a lot more at ease, comfortably reading research papers from a wide range of topics within AI and ML, giving keynote talks, being a lead data scientist in a corporate pharma…

Hands-on Tutorials

A comprehensive MLOps tutorial with open source tools

People vector created by pch.vector —

Getting machine learning (ML) models into production is hard work. Depending on the level of ambition, it can be surprisingly hard, actually. In this post, I’ll go over my personal thoughts (with implementation examples) on principles suitable for the journey of putting ML models into production within a regulated industry; i.e., when everything needs to be auditable, compliant, and in control — a situation where a hacked together API deployed on an EC2 instance is not going to cut it.

Machine Learning Operations (MLOps) refers to an approach where a combination of DevOps and software engineering is leveraged in a…

State of the art results using Chemprop & graph neural networks

Picture of experiments and machine learning working together
Picture of experiments and machine learning working together
Designed by macrovector —

In this post, we use Machine Learning / AI to predict the properties of small molecules (a task known as QSAR). This is done by using state-of-the-art graph neural networks from the open-source library Chemprop.

Typical pharmaceuticals come in the form of small molecules that can regulate some biological processes in our bodies. Unfortunately, an unimaginable amount of things can go wrong in this process; the compounds can be toxic, clear very slowly from our bodies, interact with unintended other molecules, etc. We, therefore, want to very carefully be testing these small molecules before they ever get injected into anyone.

A tutorial on simplifying MLOps using open source tools

People vector created by pch.vector & pikisuperstar —

Figuring out what people mean when they say “MLOps” is hard. Figuring out how to properly do MLOps, even for a technical person, is perhaps even more difficult. How difficult must doing MLOps be, then, for a citizen data scientist that knows nothing of web technologies, Kubernetes, monitoring, cloud infrastructure, etc.? Here I continue exploring how to set up an open-source MLOps framework for this purpose: specifically, I outline and show how a combination of Databricks, mlflow, and BentoML can potentially provide a compelling, extensible, and easy-to-use MLOps workflow for end-users.

I have previously discussed that an MLOps framework must…

Trend detection on Towards Data Science posts

Screenshot by author — live visualization at end of post

Today I scraped 41739 medium post titles from the publication Towards Data Science. Based on these titles, I wanted to try and find out which topics have been trending over time, as well as which topics are trending today.

In order to perform the trend detection analysis I wanted to use some clever way of giving context to all this data; essentially a way to convert all the blog post titles into a numerical format which preserves the context of the titles. A good way to go about this kind of challenge is usually to use a pre-trained model from…

Hands-on Tutorials

Can we visualize the context being used for search?

Photo by henry perks on Unsplash

Using transformer-based models for searching text documents is awesome; nowadays it is easy to implement using the huggingface library, and results are often very impressive. Recently I wanted to understand why a given result was returned— my initial thoughts went to various papers and blog posts relating digging into the attention mechanisms inside the transformers, which seems a bit involved. In this post I test out a very simply approach to get a glimpse into the context similarities picked up by these models when doing contextual search with some simple vector math. Let’s try it out.

For the purpose of…

Making Sense of Big Data

Use PaCMAP instead for increased interpretability & speed

Plot created by author

Dimensionality reduction techniques such as t-SNE¹, UMAP², and TriMap³ are ubiquitous within the field of data science, and given their impressive visual performance (combined with ease of use), they can therefore rightfully be found in any complete exploratory data analysis of a given dataset. However, these specific methods (t-SNE, UMAP and TriMAP) likely should not be your first go-to option for dimensionality reduction . …

Mathias Gruber

Chief Data Scientist & Full Stack Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store