PokeML: Understanding Clustering though Pokemon Data (2024)

Patrick Martin

13 min read

Sep 19, 2021

Anyone who has played Pokemon has noticed that there are certain tropes that Pokemon fall into: Pikachu, Plusle, Pachirisu, and Dedenne are all cute Electric-type Pokemon; there are a variety of “big monster” Pokemon like Rhydon, Nidoking, Tyranitar, and Aggron; and many other classes one might notice. As data scientists, we ask ourselves: can we identify these groupings using the data? What other groupings can we find?

The algorithmic process of finding unknown groupings — or clusters — in data is called clustering. Clustering is generally unsupervised, that is that we do not know what the clusters should be in advance, although we might have opinions on what they should look like. The supervised version of clustering — in which we have cluster labels that we want to predict — is called classification, which will be the subject of a future PokeML article.

PokeML: Understanding Clustering though Pokemon Data (2)

We might believe, however, that there are well-defined clusters in the data, that this could be the observations of a classification problem that is merely missing the labels. The existence of a classification function then allows us to reason mathematically about what our data must look like. In particular, if the classification function is continuous, that similar Pokemon belong to the same class, then a fact from topology tells us that the data must be disconnected — there must be clusters!

Continuity and disconnectedness are fundamentally topological statements, so we have to make sure that our encoding of the data is compatible with the intrinsic topology of the problem. This is why we went through great pains to encode aspects of Pokemon as bit-vectors and used fancy techniques to generate alternative encodings of Pokemon moves and abilities. Now that we have those, we can perform the clustering.

The most basic clustering algorithm is known as k-Means and works when the data lies in a vector space. We first pick the number of clusters we wish to find — a downside to this type of algorithm — and then the k-Means algorithm iterates the following two steps, first randomly picking k points (centroids) in the space, each representing a single cluster.

Assign each datapoint to its nearest centroid
Recompute the centroid as the mean of the datapoints assigned to it

PokeML: Understanding Clustering though Pokemon Data (3)

After this process stabilizes, we have cluster assignments. While simple, this process has a few drawbacks. k-Means tends to look for circular (or really “Voronoi-esque”) clusters, which are not the only way clusters can realize, also as a randomized and iterative algorithm, it is sensitive to the initialization and may stabilize in a local minimum. Finally, while our original Pokemon matrix lives in a vector space, as the Pokemon are literally represented as vectors, it is not obvious how we should incorporate the topic model information. For the following, we will replace the ability and move bit-vectors with the average of the topic distributions for the abilities and moves (separately) corresponding to the Pokemon. Additionally, we will standardize the columns of the Pokemon matrix to have unit variance, and assign them to 30 clusters.

Doing this on the original Pokemon matrix, we get some small clusters with <20 members, several with 20–60 members, and one with over 100 members! In the topic model one, it appears that clusters are more polarized, with cluster sizes being mostly <20 or >60 in size. The real check is done by hand, though, in which we look at the clusters themselves and see if things make sense. For brevity, we’ll look at a random selection of the clusters.

Original Pokemon Matrix clusters
#4 (4): Baltoy, Claydol, Bronzor, Shuckle 
#11 (83): Marshtomp, Poliwhirl, Drizzile, Kingdra, Quagsire ...
#14 (5): Morelull, Shiinotic, Foongus, Vileplume, Amoonguss 
#17 (70): Azelf, Lunatone, Uxie, Calyrex, Solrock ...
#26 (2): Zekrom, Kyurem-Black 
#28 (62): Shelgon, Bagon, Vibrava, Salamence, Flygon ...Topic Model Pokemon Matrix clusters
#0 (9): Conkeldurr, Darmanitan-G, Kangaskhan, Darmanitan, Rhydon ...
#4 (13): Rhyhorn, Swinub, Shelmet, Spheal, Seedot ...
#7 (1): Heatmor 
#12 (62): Moltres, Flygon, Arcanine, Dracozolt, Arctozolt ...
#26 (73): Tangela, Bellossom, Grookey, Lilligant, Thwackey ...
#29 (2): Urshifu, Urshifu-Rapid Strike

I would say these clusters are decent. Some small clusters do pick out rather unique Pokemon — Heatmor, the Urshifus, Zekrom & Kyurem — and the larger clusters have some rhyme and reason. There are some iffy results, though, like whether Bronzor truly belongs in a cluster with Shuckle and Claydol but without Bronzong or what Cluster #4 from the topic model data is supposed to mean.

One benefit of k-Means being both simple and a randomized algorithm is that we can use its results to describe the Pokemon themselves. We might think that Pokemon that are assigned to a small cluster are more “unique”; by running k-Means multiple times and tracking the size of the cluster a Pokemon is assigned to, we can measure this quality. Since cluster size tends to vary a lot, looking at the mean is not too instructive. Instead, we consider the 75th percentile of a Pokemon’s cluster size, which allows us to value both the size and variance of the clusters.

Original matrix:
Most "unique": Wobuffet, Zygarde 10%, Cosmog, Cosmoem, Zacian-Crowned
Least "unique": Toxel, Mawile, Dunsparce, Dreepy, DubwoolTopic model matrix:
Most "unique": Necrozma Dawn Wings, Cosmog, Cosmoem, Rotom-Heat, Rotom-Wash
Least "unique": Seadra, Wartortle, Drakloak, Popplio, Mudkip

These also seem reasonable: Cosmog and Cosmoem are certainly unique Pokemon and the more typical Pokemon identified by the topic model matrix are generally not-fully-evolved Water types, which is the most common type. The more typical Pokemon from the original matrix are somewhat surprising — aside from Dubwool, these Pokemon don’t feel particularly standard.

This leads to another issue with k-Means clustering, and a broad suite of clustering algorithms: every element is assigned to a cluster, even if it doesn’t necessarily belong there. A datapoint that is well isolated from a cluster may still be assigned to that cluster — in k-Means, for example, so long as that point does not influence the centroid too much (through distance or due to other outliers) it will be assigned to the nearest cluster. For this reason it can be misleading to look at ‘random’ members of a cluster; instead, the members closest to the centroid (or the centroid itself!) are more representative.

There are various ways of mitigating this issue of ill-assigned cluster membership, and we will discuss two of them in the remainder of this article. The first is agglomerative clustering, which forms clusters by grouping elements together sequentially (agglomeratively) in order of increasing “distance”, stopping when a desired criterion is met. As we are concerned about Pokemon that are far from all of the true clusters, we might hope that at some point towards the end of the process the distance between two merged clusters begins to spike, indicating that we are merging objects that do not belong together.

PokeML: Understanding Clustering though Pokemon Data (4)

Indeed, that is precisely what we see. We can also use a neat trick to identify the elbow: by rotating the plot so that the first and last datapoint are horizontally aligned, the elbow will be around where the lowest point of the rotated plot is. By stopping our clustering at that point, in this case with 31 clusters, we can hope that the clusters are fairly well-formed. Let’s take a look!

Original Pokemon Matrix clusters
#5 (88): Piloswine, Boldore, Sealeo, Armaldo, Swinub ...
#9 (9): Clefairy, Igglybuff, Wigglytuff, Cleffa, Clefable ...
#13 (2): Zygarde, Zygarde-Complete 
#14 (46): Gloom, Petilil, Budew, Treecko, Grovyle ...
#17 (2): Xerneas, Yveltal 
#19 (29): Ponyta, Darumaka, Growlithe, Arcanine, Rapidash ...
#20 (1): Comfey 
#21 (2): Toxtricity, Toxtricity-Low Key

Here we see some pretty well-formed clusters — the X&Y Box Legendaries are grouped together (#17), as are the Clefairy/Jigglypuff/Chansey trios (#9). The one-offs also make sense: Comfey, Toxtricity, and Zygarde are all reasonably unique.

One additional benefit to agglomerative clustering — and many other non-k-Means methods — is that they only require our datapoints to live in a metric space, not a vector space. In short, this means we need only know the distance between any two of our points, not where in space each point lies. This is extremely useful for our Topic Model matrix, given how much we struggled above to encode the varied topic distributions in a way that made sense for k-Means; we merely averaged the distributions, which destroyed a lot of information. Using that method for agglomerative clustering, we get one cluster with 306 members, nearly half of the Pokemon! If we want better clustering, we’re going to have to change our encoding.

Instead, we may compute a distance between Pokemon Movesets and Abilities by matching the topic distributions of the Moves and Abilities between the two Pokemon. Part of this is simple: Moves and Abilities that are shared between the two Pokemon do not increase the distance. However, for example, how should we measure the distance between Magikarp and Feebas?

Magikarp
Abilities: Swift Swim, Rattled
Moves: Bounce, Flail, Hydro Pump, Splash, TackleFeebas
Abilities: Swift Swim, Oblivious, Adaptability
Moves: Dive, Dragon Pulse, Flail, Splash, Tackle, (and 32 others)

As someone with experience with Pokemon, I might want to pair Bounce with Dive (as both are two-turn physical attacks) and Hydro Pump with Dragon Pulse (as both are single-target special attacks), but I don’t want to have to do this by hand. I also need to be careful — perhaps pairing Hydro Pump with Dive would be better (single-target Water-type attacks) — and make sure I find the optimal pairing each time. The problem is that there are 34⋅33⋅32≈36,000 ways to pair Magikarp’s three unmatched moves with three of Feebas’ 34 unmatched moves, which is computationally infeasible. For the two Pokemon with the most disjoint Movesets — Charizard and Calyrex-Ice, learning 60 and 79 Moves the other does not, respectively — the number of pairings has 100 digits!

PokeML: Understanding Clustering though Pokemon Data (5)

What we have is a case of the linear sum assignment problem: what matching produces the lowest total distance? This thankfully has a relatively fast solution, the Hungarian Algorithm, which reduces the computational cost of this computation from the naive O(n!) from before to a more manageable O(n²). Computing these distances for all 734 Pokemon was still difficult, with my non-optimized implementation in Python taking a little over an hour to do.

Replacing the vector encodings with a pairwise distance matrix does come at an additional cost, however. In the previous agglomerative clustering, the distance metric we used was cluster variance: the pair of clusters with minimal combined total variance was merged at each step. This is a special type of metric in two ways: one, it often works pretty well, and two, it can be computed using Ward’s method and hence is fairly fast. Instead, sklearn provides three other metrics, or linkages; it turns out that for our dataset the “complete” linkage works the best — “single” and “average” both produce a single cluster containing over 700 Pokemon, which is not particularly useful.

Topic Model Pokemon Matrix Clusters (34)
#0 (179): Marshtomp, Mudkip, Corsola-Galar, Meowth-Galar ...
#3 (106): Tangela, Grookey, Bellossom, Lilligant, Vileplume ...
#5 (25): Smoochum, Mime Jr, Cutiefly, Pichu, Igglybuff ...
#10 (2): Celesteela, Stakataka 
#11 (3): Solgaleo, Lunala, Necrozma 
#22 (1): Bunnelby 
#24 (1): Zeraora 
#31 (1): Mew

This appears to have worked pretty well too! Certainly, the distribution of cluster sizes is more concentrated than in the original matrix, with two clusters containing over a third of the Pokemon, but the clusters generally make sense. The Alolan Legendaries are grouped together in #11, and Baby Pokemon appear to be grouped in #5. On the other hand, Bunnelby does not particularly feel like it belongs in a cluster all by itself.

Spectral clustering is another clustering algorithm we can use and works directly with a pairwise distance matrix. The “spectral” in the name comes from the spectrum in linear algebra, another word for the distribution of the eigenvalues of a matrix; values λ for which there is a nonzero vector v (an eigenvector) satisfying Mv=λv.

Imagine if we had a perfect adjacency matrix for our clustered data: an n×n matrix where an entry contains a 1 if the elements corresponding to the row and to the column for that entry were in the same cluster, and a 0 otherwise. We can immediately determine some eigenvectors and eigenvalues for this system: a characteristic vector of a cluster, with 1s in the rows whose elements belong to the cluster and 0s elsewhere, is an eigenvector whose eigenvalue is the number of elements in the cluster.

PokeML: Understanding Clustering though Pokemon Data (6)

Moreover, if we were to subtract from each entry on the diagonal of this matrix the size of the corresponding cluster — also called the degree — each of these characteristic vectors would have an eigenvalue of zero. Conversely, if we look at the eigenvectors corresponding to the zero-eigenvalue, they should look like characteristic vectors! This property is useful enough that this special matrix has a name — the Laplacian.

Of course, we do not have a perfect adjacency matrix. But, we might hope that if we construct a matrix such that an entry is close to 1 when the corresponding elements are close and is close to 0 when the corresponding elements are far, that we might get something that works well enough. It turns out that this is the case!

We often compute this adjacency matrix through applying a kernel element-wise to the pairwise-distance matrix that satisfies k(0)=1 and k(∞)=0; k(x)=exp(-γ x²) is a popular choice, with γ>0 a tuning parameter. Additionally, the difference between our constructed adjacency matrix and the “true” adjacency matrix means that interpreting the eigenvectors as characteristic vectors can be messy. Since the eigenvectors should really look like characteristic vectors, we can artificially make components for each datapoint from its entry in each characteristic vector, and simply use k-Means on this representation to assign each datapoint to a cluster.

Because each cluster contributes a vector to the kernel of the Laplacian, we can examine the distribution of eigenvalues to determine the number of clusters. As usual, we are looking for a sharp cliff, however even if we don’t find one we can still take the lowest eigenvectors anyway to see what we get. We can also adjust the shape of the eigenvalue curve by adjusting the value of γ, which determines a soft cutoff for how far away two objects can be and still count as being ‘nearby’. Taking the average row-sum of the adjacency matrix can also give a good indication of the appropriateness of the choice of γ; half as much as the average expected cluster size is a good rule of thumb.

PokeML: Understanding Clustering though Pokemon Data (7)

It turns out, however, that both the original Pokemon matrix and the Topic Model matrix give terrible spectra, with really no reasonable cliff for any value of γ and with clustering results that just put all but a handful of Pokemon into a single cluster. To salvage this, we can change the kernel function we use; while the Gaussian kernel from before is popular, the k-nearest neighbor kernel is also widely used. For this kernel, an entry is 1 if the corresponding datapoints are each among the k closest datapoints to the other, and zero otherwise. Taking k to be 40, we get reasonable cluster sizes. But how are the memberships?

Original Pokemon Matrix clusters
#0 (29): Anorith, Walrein, Kabuto, Mamoswine, Kabutops ...
#8 (5): Lilligant, Steenee, Bounsweet, Cutiefly, Slurpuff 
#12 (7): Mienfoo, Mienshao, Tyrogue, Riolu, Hitmontop ...
#19 (2): Ditto, Stunfisk 
#20 (13): Qwilfish, Corsola, Skrelp, Tentacool, Tentacruel ...
#27 (6): Aggron, Dragonite, Krookodile, Nidoqueen, Charizard ...
#29 (16): Pawniard, Meowth-Galar, Bisharp, Impidimp, Nuzleaf ...
#33 (10): Meltan, Magnezone, Registeel, Magneton, Steelix ...Topic Model Pokemon Matrix clusters
#4 (4): Thwackey, Impidimp, Fraxure, Amaura 
#10 (5): Beldum, Karrablast, Rolycoly, Sinistea, Drampa 
#12 (13): Sneasel, Azumarill, Lickitung, Weavile, Beartic ...
#20 (55): Shelgon, Bagon, Drizzile, Larvesta, Gloom ...
#25 (6): Mr. Mime-Galar, Grubbin, Mr. Rime, Hippopotas ...
#26 (13): Mesprit, Kirlia, Gardevoir, Togetic, Togekiss ...
#35 (1): Kubfu 
#39 (8): Sealeo, Walrein, Spheal, Poliwhirl, Gible ...

The clusters for the original matrix seem pretty decent; #27 in particular is possibly impressive for clustering together some decent ‘monster’ Pokemon, and the other clusters represent generally some Type-based clusterings. The topic model clusters however are pretty bad here, with most of them not identifying any discernable pattern among their members.

This is an important lesson to learn about clustering and unsupervised methods in general. In this article, we, as data scientists, have made numerous active choices in how we consider our data; from the encoding of our data, to the distance metric used, to the clustering algorithm, and the parameters used for the algorithm. The results we get, hence, are not purely data-driven. Data-informed, yes, but we must be cognizant of our fingers on the scale.

It is common for people to present algorithmic results as being “unbiased” and to place them on a pedestal above more curated descriptions. One area of particularly high stakes where this comes up is in the United States redistricting process. In an environment where undemocratic gerrymandering is left unchecked, blind algorithmic approaches are often promoted as a viable solution. As data scientists, however, we must recognize that the construction of the algorithm necessarily involves making decisions that may result in undesirable results, just like our spectral clustering of the topic model matrix. Moreover, because these results arise through an algorithm, it can be difficult to explain what went wrong.

There are many more clustering algorithms out there, each with its own benefits and drawbacks. Density-based methods, like DBSCAN, explicitly label some datapoints as “isolated/noise” if their nearest neighbor is too far away, which can help avoid the universal membership problem of k-Means. Alternative metrics and kernels can also be used, especially the minimax path distance, which helps identify non-convex and/or non-compact clusters in the data. However, as always, it is better to first try the simpler and cheaper methods on a dataset, and adjust to fancier methods as the problem becomes better understood.

We started this article by talking about clustering as solving an unknown classification problem. In the next article, we’ll take it to the next step: how do we do classification? We’ll finally use some “real-world” Pokemon data and try to make some predictions — I hope to see you there!