This notebook investigates a genetic algorithm approach to a clustering problem posted on OR Stack Exchange.
We start by generating a random problem instance.
# Set the problem dimensions.
n_points <- 500 # number of points
n_clusters <- 8 # number of clusters
max_size <- 70 # maximum size of any cluster
# Generate the required number of points randomly in the unit square.
set.seed(123)
points <- matrix(data = runif(2 * n_points), ncol = 2)
# The weight of the edge between any two points will be the $e^-d$ where $d$ is the distance between the points.
weight <- matrix(0, nrow = n_points, ncol = n_points)
for (i in 1:499) {
for (j in (i + 1):500) {
w <- exp(-norm(points[i,] - points[j,], type = "2"))
weight[i, j] <- w
weight[j, i] <- w
}
}
We will use the GA library to build and solve the genetic algorithm.
library(GA)
Loading required package: foreach
Loading required package: iterators
____ _
/ ___| / \ Genetic
| | _ / _ \ Algorithms
| |_| |/ ___ \
\____/_/ \_\ version 3.2.3
Type 'citation("GA")' for citing this R package in publications.
Attaching package: ‘GA’
The following object is masked from ‘package:utils’:
de
Let \(N\), \(G\) and \(M\) be respectively the number of points, number of groups and maximum group size. The minimum size for any group is \(m=N-(G-1)M.\)
The chromosome will consist \(G+N\) values in the interval (0, 1). The first \(G\) values will be used to determine the sizes \(n_1,\dots,n_G\) for the clusters, as described below. The remaining \(N\) values will be used to determine a permutation of the indices \(1,\dots,N\) of the points. The first \(n_1\) indices in the permutation will be used to select the first group, and so on.
Computation of the group sizes is as follows. Let \(a=m/N\) and \(b=M/N\) be the minimum and maximum fractions of the population to be assigned to any cluster, noting that \(m=N-(G-1)M\implies a=1 - (G-1)b.\) Now let \(x_1,\dots,x_G\in (0,1)\) be the first \(G\) entries in the chromosome, and set \(y_i = 1 - x_i/\sum_j x_j \in (0,1).\) Observe that \(\sum_i y_i = G-1.\) Finally, set \(\pi_i = a + (b-a) y_i \in (a, b).\) We have \(\sum_i \pi_i = G\cdot a + (b-a) \sum_i y_i = G\cdot a + (b-a)(G-1) = (G-1)b + a = 1.\) Our target cluster sizes will be \(n_i \approx N \cdot \pi_i.\)
We have to account for rounding error in the calculation of the integer cluster sizes. If the sum of the rounded sizes is less than \(N,\) we add 1 to the smallest cluster size and recheck. If the sum is greater than \(N,\) we subtract 1 from the largest cluster size and recheck. We continue this until \(\sum_i n_i = N.\)
The values of \(a\) and \(b\) are calculated below.
min_size <- n_points - (n_clusters - 1) * max_size # m
min_frac <- min_size / n_points # a
max_frac <- max_size / n_points # b
The following function takes a vector of \(G\) values between 0 and 1 and returns a vector of cluster sizes.
# Input: vector of n_clusters values between 0 and 1.
# Output: vector of n_clusters cluster sizes summing to n_points.
clusterSizes <- function(x) {
y <- 1 - x / sum(x)
p <- min_frac + (max_frac - min_frac) * y
n <- round(p * n_points) # initial group sizes
s <- sum(n)
# Correct for rounding overage.
while (s > n_points) {
i <- which.max(n)
n[i] <- n[i] - 1
s <- s - 1
}
# Correct for rounding underage.
while (s < n_points) {
i <- which.min(n)
n[i] <- n[i] + 1
s <- s + 1
}
# Return the cluster sizes.
n
}
We now define a function that decodes a chromosome into a list of clusters.
# Input: a chromosome (vector of n_points + n_clusters values in the interval (0,1))
# Output: a list of clusters
decode <- function(chromosome) {
# Convert the first n_clusters values into a vector of cluster sizes.
size <- clusterSizes(chromosome[1 : n_clusters])
# Isolate the remaining chromosome entries and get the permutation index.
x <- chromosome[-(1 : n_clusters)]
indices <- sort(x, index.return = TRUE)$ix
# Split the indices into clusters.
clusters <- list(sort(indices[1 : size[1]]))
indices <- indices[-(1 : size[1])]
for (i in 2 : n_clusters) {
clusters[[i]] <- sort(indices[1 : size[i]])
indices <- indices[-(1 : size[i])]
}
# Return the clusters.
clusters
}
The next function takes a list of indices of points and computes the sum of the edge weights for all edges between the points.
# Input: a vector of indices of points
# Output: the sum of all edge weights for edges connecting the points
edgeWeights <- function(points) {
points |>
# Find all combinations of pairs of indices (as a 2 x n matrix).
combn(2) |>
# Convert each column to the weight for the pair of indices.
apply(2, function(x) weight[x[1], x[2]]) |>
# Sum the results.
sum()
}
The fitness function takes chromosome and computes a score by creating the associated clusters and then summing within clusters the weights of all the edges between points in the cluster.
# Input: a chromosome
# Output: a fitness score
score <- function(chromosome) {
chromosome |>
# Convert to a list of clusters.
decode() |>
# Convert each cluster to a score.
sapply(edgeWeights) |>
# Sum the cluster scores.
sum()
}
We now have the necessary components to create and solve a genetic algorithm.
heuristic <- ga(type = "real-valued", fitness = score, lower = rep.int(0, n_points + n_clusters), upper = rep.int(1, n_points + n_clusters))
GA | iter = 1 | Mean = 9459.460 | Best = 9537.662
GA | iter = 2 | Mean = 9461.318 | Best = 9544.804
GA | iter = 3 | Mean = 9457.868 | Best = 9544.804
GA | iter = 4 | Mean = 9458.149 | Best = 9547.672
GA | iter = 5 | Mean = 9457.409 | Best = 9547.672
GA | iter = 6 | Mean = 9459.330 | Best = 9547.672
GA | iter = 7 | Mean = 9465.291 | Best = 9547.672
GA | iter = 8 | Mean = 9468.206 | Best = 9547.672
GA | iter = 9 | Mean = 9473.636 | Best = 9547.672
GA | iter = 10 | Mean = 9479.253 | Best = 9547.672
GA | iter = 11 | Mean = 9488.914 | Best = 9547.672
GA | iter = 12 | Mean = 9501.866 | Best = 9554.068
GA | iter = 13 | Mean = 9509.117 | Best = 9554.068
GA | iter = 14 | Mean = 9519.208 | Best = 9554.068
GA | iter = 15 | Mean = 9521.152 | Best = 9554.068
GA | iter = 16 | Mean = 9526.567 | Best = 9554.068
GA | iter = 17 | Mean = 9529.371 | Best = 9559.814
GA | iter = 18 | Mean = 9531.218 | Best = 9559.814
GA | iter = 19 | Mean = 9534.709 | Best = 9559.814
GA | iter = 20 | Mean = 9536.798 | Best = 9559.814
GA | iter = 21 | Mean = 9542.304 | Best = 9571.527
GA | iter = 22 | Mean = 9542.107 | Best = 9571.527
GA | iter = 23 | Mean = 9543.754 | Best = 9571.527
GA | iter = 24 | Mean = 9544.826 | Best = 9571.527
GA | iter = 25 | Mean = 9545.350 | Best = 9571.527
GA | iter = 26 | Mean = 9546.201 | Best = 9571.527
GA | iter = 27 | Mean = 9547.011 | Best = 9571.527
GA | iter = 28 | Mean = 9550.285 | Best = 9571.527
GA | iter = 29 | Mean = 9553.20 | Best = 9572.06
GA | iter = 30 | Mean = 9554.331 | Best = 9572.060
GA | iter = 31 | Mean = 9557.195 | Best = 9572.060
GA | iter = 32 | Mean = 9557.124 | Best = 9572.060
GA | iter = 33 | Mean = 9560.374 | Best = 9572.060
GA | iter = 34 | Mean = 9560.747 | Best = 9572.060
GA | iter = 35 | Mean = 9561.721 | Best = 9572.060
GA | iter = 36 | Mean = 9563.976 | Best = 9578.507
GA | iter = 37 | Mean = 9565.592 | Best = 9578.507
GA | iter = 38 | Mean = 9566.327 | Best = 9578.507
GA | iter = 39 | Mean = 9569.979 | Best = 9580.699
GA | iter = 40 | Mean = 9571.661 | Best = 9580.699
GA | iter = 41 | Mean = 9573.251 | Best = 9582.615
GA | iter = 42 | Mean = 9574.114 | Best = 9582.701
GA | iter = 43 | Mean = 9577.751 | Best = 9582.795
GA | iter = 44 | Mean = 9579.734 | Best = 9583.895
GA | iter = 45 | Mean = 9580.713 | Best = 9586.414
GA | iter = 46 | Mean = 9581.290 | Best = 9586.414
GA | iter = 47 | Mean = 9582.012 | Best = 9586.414
GA | iter = 48 | Mean = 9582.743 | Best = 9586.930
GA | iter = 49 | Mean = 9582.005 | Best = 9588.262
GA | iter = 50 | Mean = 9582.488 | Best = 9588.262
GA | iter = 51 | Mean = 9581.834 | Best = 9588.262
GA | iter = 52 | Mean = 9582.955 | Best = 9588.262
GA | iter = 53 | Mean = 9583.483 | Best = 9588.262
GA | iter = 54 | Mean = 9582.507 | Best = 9588.262
GA | iter = 55 | Mean = 9582.812 | Best = 9588.262
GA | iter = 56 | Mean = 9582.59 | Best = 9590.41
GA | iter = 57 | Mean = 9583.376 | Best = 9590.410
GA | iter = 58 | Mean = 9584.265 | Best = 9591.811
GA | iter = 59 | Mean = 9584.689 | Best = 9591.811
GA | iter = 60 | Mean = 9585.394 | Best = 9591.811
GA | iter = 61 | Mean = 9586.077 | Best = 9592.063
GA | iter = 62 | Mean = 9586.476 | Best = 9592.063
GA | iter = 63 | Mean = 9586.858 | Best = 9592.063
GA | iter = 64 | Mean = 9587.191 | Best = 9592.063
GA | iter = 65 | Mean = 9588.644 | Best = 9592.063
GA | iter = 66 | Mean = 9588.875 | Best = 9594.215
GA | iter = 67 | Mean = 9590.171 | Best = 9594.215
GA | iter = 68 | Mean = 9590.590 | Best = 9597.214
GA | iter = 69 | Mean = 9590.389 | Best = 9597.214
GA | iter = 70 | Mean = 9592.026 | Best = 9598.226
GA | iter = 71 | Mean = 9591.353 | Best = 9598.226
GA | iter = 72 | Mean = 9592.764 | Best = 9598.226
GA | iter = 73 | Mean = 9593.816 | Best = 9599.177
GA | iter = 74 | Mean = 9595.249 | Best = 9601.117
GA | iter = 75 | Mean = 9595.638 | Best = 9601.117
GA | iter = 76 | Mean = 9595.962 | Best = 9601.117
GA | iter = 77 | Mean = 9596.763 | Best = 9601.117
GA | iter = 78 | Mean = 9598.100 | Best = 9601.117
GA | iter = 79 | Mean = 9598.703 | Best = 9601.117
GA | iter = 80 | Mean = 9597.555 | Best = 9601.117
GA | iter = 81 | Mean = 9599.113 | Best = 9601.117
GA | iter = 82 | Mean = 9599.006 | Best = 9601.117
GA | iter = 83 | Mean = 9600.180 | Best = 9603.731
GA | iter = 84 | Mean = 9599.674 | Best = 9603.731
GA | iter = 85 | Mean = 9600.682 | Best = 9604.041
GA | iter = 86 | Mean = 9601.003 | Best = 9604.041
GA | iter = 87 | Mean = 9601.656 | Best = 9604.041
GA | iter = 88 | Mean = 9601.918 | Best = 9605.122
GA | iter = 89 | Mean = 9601.751 | Best = 9605.122
GA | iter = 90 | Mean = 9602.253 | Best = 9605.122
GA | iter = 91 | Mean = 9602.961 | Best = 9605.439
GA | iter = 92 | Mean = 9602.703 | Best = 9607.646
GA | iter = 93 | Mean = 9602.281 | Best = 9607.646
GA | iter = 94 | Mean = 9604.136 | Best = 9609.038
GA | iter = 95 | Mean = 9605.266 | Best = 9609.459
GA | iter = 96 | Mean = 9605.377 | Best = 9609.459
GA | iter = 97 | Mean = 9606.405 | Best = 9609.459
GA | iter = 98 | Mean = 9607.098 | Best = 9612.844
GA | iter = 99 | Mean = 9607.044 | Best = 9612.844
GA | iter = 100 | Mean = 9607.554 | Best = 9612.844
The final clustering is obtained by decoding the best solution found.
clusters <- heuristic@solution[1,] |> decode()
clusters
[[1]]
[1] 5 16 19 27 31 46 51 62 69 79 89 103 109 111 113 140 142 145 148 159 180 191
[23] 193 197 217 223 233 243 246 267 289 302 303 307 321 328 344 353 364 367 371 373 380 393
[45] 398 409 418 423 429 450 456 457 462 469 472 479 481 493 500
[[2]]
[1] 20 41 50 58 72 117 120 121 133 139 150 151 154 157 167 170 187 192 200 215 216 241
[23] 244 248 249 261 263 268 272 274 280 298 300 305 306 313 318 320 334 343 354 365 368 407
[45] 410 415 425 430 442 447 461 467 483 490
[[3]]
[1] 25 28 33 35 37 39 42 45 48 56 66 67 68 70 81 93 98 114 116 122 123 152
[23] 160 163 165 171 173 178 181 182 209 220 239 242 258 260 265 271 279 282 284 285 287 311
[45] 312 323 325 326 330 335 339 342 351 355 369 375 385 421 437 439 463 468 480 484 488 497
[67] 498
[[4]]
[1] 4 11 13 15 21 24 26 29 55 83 87 88 90 92 94 97 102 104 106 107 126 132
[23] 143 153 161 168 179 186 202 218 221 222 235 237 240 245 253 254 264 266 286 293 294 295
[45] 296 310 319 324 327 332 340 358 378 381 394 395 399 411 412 428 436 438 443 453 455 458
[67] 485
[[5]]
[1] 6 18 23 47 74 78 80 91 96 99 110 112 118 129 135 141 156 164 177 183 184 203
[23] 207 212 214 219 227 234 255 256 288 316 331 341 347 350 377 383 400 401 417 426 444 466
[45] 471 474 476 486 491
[[6]]
[1] 1 3 7 10 14 30 36 49 53 54 63 71 73 77 82 108 115 119 124 128 131 138
[23] 147 169 185 189 194 196 198 199 206 213 224 236 262 273 281 292 308 315 337 345 348 352
[45] 356 357 366 370 382 386 387 389 392 402 414 419 433 441 445 449 454 459 473 478 489 492
[67] 495
[[7]]
[1] 2 8 12 22 32 38 40 59 60 61 64 76 84 85 86 95 100 105 127 130 137 149
[23] 155 158 166 172 174 175 188 201 204 208 225 226 228 230 232 251 257 269 270 276 277 290
[45] 297 304 314 333 336 346 359 361 372 374 390 396 406 413 416 422 424 432 435 440 448 460
[67] 464 475 494 496
[[8]]
[1] 9 17 34 43 44 52 57 65 75 101 125 134 136 144 146 162 176 190 195 205 210 211
[23] 229 231 238 247 250 252 259 275 278 283 291 299 301 309 317 322 329 338 349 360 362 363
[45] 376 379 384 388 391 397 403 404 405 408 420 427 431 434 446 451 452 465 470 477 482 487
[67] 499