**Beyond the Buzzwords: Artificial Intelligence & Machine Learning in Healthcare**

It’s no secret that the number of artificial intelligence (AI) startups in healthcare is exploding. Data aggregator CB Insights reports that nearly 70 received funding in 2016, up from less than 20 in 2012. Indeed, their latest market map of the space shows 106 companies applying advanced statistical techniques to try and solve healthcare’s most intractable challenges. As interest grows in the space, so does the hype; articles with headlines like “‘AI-powered’ is tech’s meaningless equivalent of ‘all-natural’” aren’t hard to come by. So what’s really going on?

This article, part one of two, seeks first to unpack the question: what *is* AI? With a foundation established, we’ll look at the way it’s changing the healthcare industry in a forthcoming follow-up.

*Artificial Intelligence? Machine Learning?*

AI is not a new term; born in the 1950s, much of the early work in the field focused on strong (or general) AI – technology whose intellectual capability is functionally equal to a human’s. Unsurprisingly, progress was slow and attention quickly shifted to weak (or applied) AI – technology focused on addressing a narrowly defined task. All modern AI is weak AI at best, even the well-known examples such as Apple’s Siri, Google’s AlphaGo, or IBM’s Watson.

A means to the end of powering weak AI, machine learning rose to prominence in the 1980s to give “computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959). Machine learning is proving to be the most successful approach to powering modern AI and is where most of today’s advancements lie, even spawning related subfields such as deep learning and neural networks.

Given that machine learning underpins modern AI, understanding it is critical to understanding why companies that harness it hope to make such an impact on the practice of medicine. Much of machine learning can be broken down into two disciplines: **supervised learning** and **unsupervised learning**. Unsupervised learning techniques are often used to organize and sanitize large datasets before using a supervised learning algorithm to discover new insights.

*Unsupervised Learning*

Say we have a massive dataset with thousands of health record data fields for millions of patients and we’re interested in making predictions using the data that could inform treatment decisions. Analyzing all of the patients across all of the dimensions is impractical, but unsupervised learning techniques allow us to group together similar patients through a technique known as clustering and use something called ‘dimensionality reduction’ to effectively compress the data.

Though important and interesting, dimensionality reduction relies too heavily on linear algebra to explore here, but clustering is more approachable and there are two common techniques:

**K-Means Clustering:** let the data organize themselves into a set number (k) of groups. To organize our millions of patients into just 15 groups, the algorithm would pick 15 patients, or centroids, and assign each other patient in the data set to the centroid (patient) to whom they “look” most similar based on the known data about them. Then the centroid is redefined as the average of all the data points (patients) clustered around it. The algorithm repeats until the centroids no longer “move” very far with each iteration. If you know the desired number of groups in advance, this represents a simple and powerful means of organizing data accordingly.

**Hierarchical Clustering: **let the data organize themselves into any number of groups. Hierarchical clustering allows the number of groups to be changed after the analysis is performed, offering more flexibility. To organize our patients using this tool, the algorithm would start with one cluster for each patient (n clusters in total) and group together the two clusters (patients) who “look” most similar and recalculate the similarity between all of the clusters (now n – 1). The algorithm repeats until just two clusters (groups of patients) remain and the results can be plotted as a tree. By “cutting” the tree at various points, we can end up with 2, 4, 6, etc. clusters of similar patients.

*Supervised Learning*

Now that the data are more organized, we can tackle more interesting questions, such as predicting the lifespan of a patient with cancer, using supervised learning techniques. Supervised learning algorithms build models of the world using a large quantity of known examples (our patients and their health records) that are labeled correctly (their lifespans post-diagnosis are known). All supervised learning algorithms work to uncover the relationship between variables in a dataset and the variable of interest, de novo. An appropriate algorithm, fit well on our training dataset of past patients, could allow us to predict a future patient’s outcome in real time based solely on the contents of their medical record.

Supervised learning is used for two tasks: regression and classification. Regression predicts continuous variables (e.g., how long will this patient with cancer live?) and classification groups data into classes (e.g., what kind of cancer does this patient have?). Here’s an introduction to some common algorithms.

**Regression – Ordinary Least Squares (Linear): **with only a few variables to analyze, simplicity reigns. In our example, this algorithm would define an equation specifying a patient’s lifespan by defining weights for each data field in a patient’s health record and then assigning values to the weights such that the error* (the discrepancy between predicted and actual lifespan) is a small as possible. If all we had access to for a cancer patient was the size of their tumor, how old they were, and what medications they were taking, this approach might work well – you can define the equation and solve it with calculus. As you add variables, however, it breaks down.

**Regression – Gradient Descent: **with a lot of variables, let the computer figure out the right model by trial and error. To start, put *all* ** of the fields of the health record in a long equation akin to linear regression and randomly assign a weight to each variable. From there, take partial derivatives with respect to the weight on each one to in order answer the question: if I increase the weight I assign to, for example, the size of the tumor by a tiny amount, does my prediction of lifespan get better or worse? Rinse and repeat for each variable, stopping once you’ve found the value for the weight such that a small change to it has no impact on the accuracy of the prediction using the test data – that’s the optimal equation.

**Classification – Random Forests: **a decision tree can be powerful; a virtual “forest” of them immensely so. The goal in building decision trees is to maximize the amount of knowledge gained by each split in the tree. In our specific example, a powerful tree might split on age a couple of times, then cancer stage, followed by response to treatment, etc. Decision trees have the benefit of being relatively easy to define and very easy to use once they are. One downside, however, is they can be very sensitive to outliers. Statisticians have circumvented this by creating a super-decision tree called a random forest which, as the name implies, is made up of a bunch of individual decision trees averaged together. The result is a more efficient and powerful predictive tool.

*So What?*

Machine learning and, by extension, applied AI is transforming a growing number of industries at an accelerating pace. The goal of this article, the first of two in the series, was to go beyond the buzzwords and provide a flavor of just some of the statistical techniques reshaping the practice of medicine. Next time, we’ll explore how these techniques are being applied to real-world scenarios by cutting edge companies.

Continue to Part 2 of the article

**Technically, the sum of each error, squared.*

***But be careful not to overfit the model to the training data. After a certain point, adding more variables may improve the model’s performance on the training data but may actually hurt the model’s performance with never-before-seen real-world data. Managing this tradeoff effectively is critical to the successful application of machine learning but could merit an entire article to itself.*