I have a algorithm that is spitting out a list of scores:
list_of_scores = [100, 101, 102, 101, 43, 42, 42, 41, 20, 21, 22, 23]
in the above list, there is a ‘cluster’ of scores around 100, another ‘cluster’ around 42, and another ‘cluster’ around 22.
Score lists will always contain 1 or more clusters.
I need to summarize the scores by creating a list containing 1 member from each ‘cluster’ of scores. the exact median score of the cluster is not important – i just need to have a score (an actual element in the list) from each ‘cluster’.
Anyone have any thoughts on how to accomplish this in python?
edit: i won’t know the number of potential clusters beforehand or the spread of the cluster itself:
-
one list may contain 3 clusters of data, another list may contain
only 1 cluster in that all of the data may be close to one value. -
one list may contain 2 clusters, each member of the cluster is within
+/- 10 around the center point, whereas the other cluster, each member of the cluster may be within +/- 25 of the center point
2
Answers
I think i finally got this to work! i figured i'd post my solution in case anyone else may search on this later...
since i didn't know how many clusters or the spread of values within each cluster, i made an initial assumption that values should be within +/- 5% of the total distance from list max to list min.
if i can't find a group for a specific score, i double the spread and try again. if i still can't find it, it gets a new group
finally, in case that a score may appear to belong to more than one group given my assumption of the size of the spread, i place the score in the group with the shortest distance from score-in-question to the median of said group
the starting list:
the results:
the code:
Based on the example values shown,
this may be a good candidate for
ruptures
change point detection.
It defines a loss function and then finds a global minimum.
You can specify that the algorithm shall find K break points,
similar to K-means. Or you can let it decide what number
of break points best explains changes in the generating process.