THE FUTURE IS HERE

Machine Learning Series Day 4 (K-NN)

I promise it’s not just another “ML Article.”

Terminology:

  1. Euclidean Distance: The distance between data “points” (p1, p2, …, pn). It computes the square root of the sum of the squares of the differences between the data “points.”
  2. Manhattan Distance: The distance between data “points” (p1, p2, …, pn). It computes the sum of the absolute differences between the data “points.”
  3. Chebyshev distance: Unlike the previous two methods, it calculates the maximum of the absolute differences between the data “points.”
  4. K: Or neighbors. It’s a core concept of the K-Nearest Neighbor. It determines how much values we are using in our model.

Concept:

You probably have had a similar thought process as the K-Nearest Neighbor (KNN) algorithm. Let’s say you’re a baseball scout (I read that the Oakland A’s used K-Nearest Neighbor to discover undervalued player). As a baseball scout, you are responsible for finding the next superstar, the next Mike Trout. But in a sport filled with great players, how can you make such a distinction?

Trending AI Articles:

1. Setting up a scalable data exploration environment with Spark and Jupyter Lab

2. Visual Music & Machine Learning Workshop for Kids

3. The Proof of Machine Consciousness Project

4. Is Flubber Sentient?

You go to baseball camps and are stunned by the level of talent in the field. But there’s one player that stands out. This young player has a smooth and powerful swing. His fielding is impeccable. He reminds you of Carlos Correa and Nolan Arenado — MLB superstars. Notice your thought process. You are comparing the characteristics of previous players to help you access the future of this young superstar. It’s very similar to a KNN model!

You find the similarities of your prospect with previous players: players who never made it, current MLB players, and retired MLB players. You have three categories your bucketing potential players in desired players, mediocre players, and superstars. You categorize your prospect on how similar he is a prospect is to previous players, simple enough?

Details:

Supervised/Unsupervised: Supervised

Regression/Classification: Classification

Unlike other classification algorithms, you do not need a testing set or validation set. Instead, your only dataset will be your training set.

You must first decide the total number of neighbors (symbolically represented by k). In our baseball analogy, this means the total number of identical players that you will be compared with your prospect. You can choose 1. However, I would recommend against it.

There’s a bias in only using 1 neighbor.

  • Quick Note: I heard a podcast with Daniel Kahneman on “Conversation with Tyler.” Kahneman discusses the effect of biases. If the temperature of the room, hungriness, the day of the week, or any external factor can sway our decisions, it’s arduous to have a consistent decision throughout life.
  • Kahneman suggests involving more people to reduce individual biases. Google uses four interviewees to corrects for systematic biases.
  • Four people (including yourself), according to Daniel Kahneman, has shown to reduce bias by 50%. I can’t find the study he cited Tyler’ podcast, so that’s a bummer. Let me know if you know the study Kahneman is referring!
  • Sorry, that I got a bit sidetracked, but the parallel between machine learning and psychology is impressive!

You should not choose an even number for the neighbors.

If you do decide to use four nearest neighbors (baseball players), and 2 of them are MLB Hall of Famers, and the other 2 are 30-year-old minor league players, how do we break the tie?

Hence, it’s suggested to use an odd number.

Visual:

The image above describes the objective/goal of the K-Nearest Neighbor. Notice that we are deciding if a new example is “Class 1” or “Class 2” based on its neighbors. If K=3, there are data values that suggest it’s “Class 2,” while there’s an only data value that suggests its “Class 1.” Hence, the majority wins, and it’ll incline to “Class 2.”

Mathematics/Statistics:

The distance formula used in KNN models:

1.Euclidean distance

Typically, KNN models use the Euclidean distance. The image above describes a dataset with two dimensions. However, as you probably should know, most datasets are not two-dimensional!

Formula:

For simplicity purposes, let’s say we have a three-dimensional dataset. How do we calculate the Euclidean distance?

  • Observation_1: [1, 7, 9]
  • Observation_2: [11, 21, 4]
  • (1–11)² + (7–21)² + (9–4)² = 100 + 196 + 25 = 321
  • The square root of 321 ~ 17.91
  • The distance between these two observations is 17.91!

2.Manhattan distance

Another method, though used less frequently, is the Manhattan distance.

Formula:

Using the vectors from the example above, the calculation would be:

  • Observation_1: [1, 7, 9]
  • Observation_2: [11, 21, 4]
  • |1–11| + |7–21| + |9–4| = 10 + 14 + 5 = 29
  • The distance between the two observations is 29 now!

3.Chebyshev distance

Another method would be to use the Chebyshev distance. Also known as the chessboard distance because “in the game of chess the minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares.”

Formula:

The formula states that the distance is the most significant value of each data value difference (please refer to the computation below if my explanation is unclear).

  • Observation_1: [1, 7, 9]
  • Observation_2: [11, 21, 4]
  • max(|1–11| + |7–21| + |9–4|) = max(10, 14, 5) = 14!
  • The distance between the two observations is 14 now!

Thoughts:

Geometrically, the distance formulas are below.

Also, for the most part, the Euclidean distance is more often than not used.

Final Thoughts:

KNN is a good model because it doesn’t rely on any assumptions on the data. It also easy to understand.

However, it’s computational expensive because it requires to store all the training data known as the Curse of Dimensionality.

The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.

Intuitively, there’a great Quora post that describes this phenomenon. Imagine you lose coin in a 100-yard line. This task is difficult but imagine your buddy loses a coin in a 100-yard by 100-yard field. He has more ground to cover. Now imagine another buddy losses a coin in a stadium that has a height of 100-yards. Notice the difficult?

As we include more features/variables, the complexity of the problem increases exponentially.

Sources:

WANT MORE…

If so, I suggest following my Instagram page. I post summaries and thoughts on a book that I have and am currently reading.

Instagram: Booktheories, Personal

Follow me on: Twitter, GitHub, and LinkedIn

AND if you liked this article, I’ll appreciate it if you click on the like button below. THANKS!

Don’t forget to give us your 👏 !

https://medium.com/media/c43026df6fee7cdb1aab8aaf916125ea/href


Machine Learning Series Day 4 (K-NN) was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.