Office Hours

3 years spent eating, sleeping, and breathing data science

Banner vector created by vectorpocket — www.freepik.com

When I got my first corporate “data scientist” gig a few years ago, I barely knew what a decision tree was, I had no clue about why people kept talking about random forests, and I had no grasp on what people actually meant when they talked about “AI,” which to me was primarily associated with dystopian movies. I was quite overwhelmed, to put it mildly.

These days I feel a lot more at ease, comfortably reading research papers from a wide range of topics within AI and ML, giving keynote talks, being a lead data scientist in a corporate pharma…


Advice for anyone looking or hiring for data science jobs

Image representing that these are my reflections on what makes a perfect data science candidate.
Image representing that these are my reflections on what makes a perfect data science candidate.
Office vector created by macrovector — www.freepik.com

Data science is as popular as ever but paradoxically also seems more fragmented and ill-defined than ever before. It can be quite difficult for newcomers to figure out how to break into the field, and perhaps even more difficult, it can be for managers to figure out how to hire for positions unless you know exactly what you’re looking for.

In this post, I summarize my reflections on what I look for in data science candidates. Disclaimer: these are reflections based on my time working in biotech and pharma companies where data science is a supporting function and not a…


A simple trick for improving model performance when labels are ordered.

Abstract vector created by vectorjuice — www.freepik.com

When you have a multiclass classification problem and there is an order to the classes, it is known as an “ordinal regression” problem. An example could be the classification of a student’s performance into categories A > B > C > D > E. The issue in solving this kind of problem using a normal classifier is that the model will assume that the error of misclassifying an A as a D is just as bad as misclassifying A as a B — this is obviously not true since the difference between A and D is much bigger than between…


When data science becomes an obsession.

Technology photo created by freepik — www.freepik.com

For the longest time, I have wondered why some data scientists spend every waking moment of their day obsessively consuming knowledge, honing their skills, participating in competitions, creating hobby projects, and generally expanding their horizons. Meanwhile, others are more comfortable with the status quo and exploiting current skills to solve the problems that come their way.

Of course, any data scientist worth their salt will be expanding and improving their skills continuously; that is one of the prerequisites for becoming a good data scientist in the first place. This is not what I’m talking about here. I’m talking about the…


State of the art results using Chemprop & graph neural networks

Picture of experiments and machine learning working together
Picture of experiments and machine learning working together
Designed by macrovector — www.freepik.com

In this post, we use Machine Learning / AI to predict the properties of small molecules (a task known as QSAR). This is done by using state-of-the-art graph neural networks from the open-source library Chemprop.

Typical pharmaceuticals come in the form of small molecules that can regulate some biological processes in our bodies. Unfortunately, an unimaginable amount of things can go wrong in this process; the compounds can be toxic, clear very slowly from our bodies, interact with unintended other molecules, etc. We, therefore, want to very carefully be testing these small molecules before they ever get injected into anyone.


A tutorial on simplifying MLOps using open source tools

People vector created by pch.vector & pikisuperstar — www.freepik.com

Figuring out what people mean when they say “MLOps” is hard. Figuring out how to properly do MLOps, even for a technical person, is perhaps even more difficult. How difficult must doing MLOps be, then, for a citizen data scientist that knows nothing of web technologies, Kubernetes, monitoring, cloud infrastructure, etc.? Here I continue exploring how to set up an open-source MLOps framework for this purpose: specifically, I outline and show how a combination of Databricks, mlflow, and BentoML can potentially provide a compelling, extensible, and easy-to-use MLOps workflow for end-users.

I have previously discussed that an MLOps framework must…


Trend detection on Towards Data Science posts

Screenshot by author — live visualization at end of post

Today I scraped 41739 medium post titles from the publication Towards Data Science. Based on these titles, I wanted to try and find out which topics have been trending over time, as well as which topics are trending today.

In order to perform the trend detection analysis I wanted to use some clever way of giving context to all this data; essentially a way to convert all the blog post titles into a numerical format which preserves the context of the titles. A good way to go about this kind of challenge is usually to use a pre-trained model from…


Hands-on Tutorials

A comprehensive MLOps tutorial with open source tools

People vector created by pch.vector — www.freepik.com

Getting machine learning (ML) models into production is hard work. Depending on the level of ambition, it can be surprisingly hard, actually. In this post, I’ll go over my personal thoughts (with implementation examples) on principles suitable for the journey of putting ML models into production within a regulated industry; i.e., when everything needs to be auditable, compliant, and in control — a situation where a hacked together API deployed on an EC2 instance is not going to cut it.

Machine Learning Operations (MLOps) refers to an approach where a combination of DevOps and software engineering is leveraged in a…


Hands-on Tutorials

Can we visualize the context being used for search?

Photo by henry perks on Unsplash

Using transformer-based models for searching text documents is awesome; nowadays it is easy to implement using the huggingface library, and results are often very impressive. Recently I wanted to understand why a given result was returned— my initial thoughts went to various papers and blog posts relating digging into the attention mechanisms inside the transformers, which seems a bit involved. In this post I test out a very simply approach to get a glimpse into the context similarities picked up by these models when doing contextual search with some simple vector math. Let’s try it out.

For the purpose of…

Mathias Gruber

Chief Data Scientist & Full Stack Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store