How can I find population clusters in a list using Python and MySQL?

cotfessi
May 31, 2023
423 views
0 votes
2 Answers

I have a algorithm that is spitting out a list of scores:

list_of_scores = [100, 101, 102, 101, 43, 42, 42, 41, 20, 21, 22, 23]

in the above list, there is a ‘cluster’ of scores around 100, another ‘cluster’ around 42, and another ‘cluster’ around 22.

Score lists will always contain 1 or more clusters.

I need to summarize the scores by creating a list containing 1 member from each ‘cluster’ of scores. the exact median score of the cluster is not important – i just need to have a score (an actual element in the list) from each ‘cluster’.

Anyone have any thoughts on how to accomplish this in python?

edit: i won’t know the number of potential clusters beforehand or the spread of the cluster itself:

one list may contain 3 clusters of data, another list may contain
only 1 cluster in that all of the data may be close to one value.
one list may contain 2 clusters, each member of the cluster is within
+/- 10 around the center point, whereas the other cluster, each member of the cluster may be within +/- 25 of the center point

Answers

Chosen as BEST ANSWER

I think i finally got this to work! i figured i'd post my solution in case anyone else may search on this later...

since i didn't know how many clusters or the spread of values within each cluster, i made an initial assumption that values should be within +/- 5% of the total distance from list max to list min.

if i can't find a group for a specific score, i double the spread and try again. if i still can't find it, it gets a new group

finally, in case that a score may appear to belong to more than one group given my assumption of the size of the spread, i place the score in the group with the shortest distance from score-in-question to the median of said group

the starting list:

scores = [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763, 8967, 8988, 9000, 8836, 8968, 8996, 9173, 9027, 8896, 9130, 4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]

the results:

RESULTS: (completed in 0.000287 seconds)
-----> group [0]: [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763]
-----> group [1]: [8967, 8988, 9000, 8836, 8968, 8996, 9173, 9027, 8896, 9130]
-----> group [2]: [4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]

the code:

import time
import statistics


def analyze_list(list_to_analyze):
    # setup
    groupings = []
    list_max = max(list_to_analyze)
    list_min = min(list_to_analyze)
    spread = int((list_max - list_min) * 0.05)
    print('Starting with max = [{}], min = [{}], spread = [{}]'.format(list_max, list_min, spread))

    count = 0
    for tmp_score in list_to_analyze:
        print('-> checking [{}]...'.format(tmp_score))

        # special case for the first one, it gets placed in the first group
        if count == 0:
            tmp_list = [tmp_score]
            groupings.append(tmp_list)
            print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, 0))
        else:
            # check tmp_score against the median score of each group
            # don't place them in the group, just add them to the 'tmp_placement_list'
            # dictionary with the distance to the median
            # -------------------------------------------------------------
            count_group = 0
            score_placed = False
            tmp_placement_list = {}
            for tmp_group_list in groupings:
                tmp_median = int(statistics.median(tmp_group_list))

                # calculate how close the score to check is to the median score in the current group
                distance_from_median_value = abs(tmp_median - tmp_score)
                print('---> group [{}]  range = ({}..{}) - distance to median = [{}] '.format(count_group,
                                                                                              tmp_median - spread,
                                                                                              tmp_median + spread,
                                                                                              distance_from_median_value))

                if distance_from_median_value <= spread:
                    tmp_placement_list[count_group] = distance_from_median_value

                # increment counter
                count_group = count_group + 1

            # analyze 'tmp_placement_list' dictionary - place score in
            # group with the smallest distance to the median
            # -------------------------------------------------------------
            min_distance = spread * 2
            winning_group = -1
            for tmp_group_number in tmp_placement_list.keys():
                print('-----> result - group [{}] distance [{}]'.format(tmp_group_number, tmp_placement_list[tmp_group_number]))
                if tmp_placement_list[tmp_group_number] < min_distance:
                    min_distance = tmp_placement_list[tmp_group_number]
                    winning_group = tmp_group_number

            if winning_group > -1:
                groupings[winning_group].append(tmp_score)
                score_placed = True
                print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, winning_group))

            if not score_placed:
                # if we got here 'tmp_score' isn't within the current 'spread' of any of the groups.
                # double the spread and try again
                # -------------------------------------------------------------
                second_chance_spread = spread * 2
                tmp_placement_list = {}
                count_group = 0
                for tmp_group_list in groupings:
                    tmp_median = int(statistics.median(tmp_group_list))

                    # calculate how close the score to check is to the median score in the current group
                    distance_from_median_value = abs(tmp_median - tmp_score)
                    print('---> 2nd chance - group [{}]  range = ({}..{}) - dist median = [{}] '.format(count_group,
                                                                                                        tmp_median - spread,
                                                                                                        tmp_median + spread,
                                                                                                        distance_from_median_value))

                    if distance_from_median_value <= second_chance_spread:
                        tmp_placement_list[count_group] = distance_from_median_value

                    # increment counter
                    count_group = count_group + 1

                # analyze results - place score in group with the smallest distance to the median
                # -------------------------------------------------------------
                min_distance = spread * 2
                winning_group = -1
                for tmp_group_number in tmp_placement_list.keys():
                    print(
                        '-----> result - group [{}] distance [{}]'.format(tmp_group_number, tmp_placement_list[tmp_group_number]))
                    if tmp_placement_list[tmp_group_number] < min_distance:
                        min_distance = tmp_placement_list[tmp_group_number]
                        winning_group = tmp_group_number

                if winning_group > -1:
                    groupings[winning_group].append(tmp_score)
                    score_placed = True
                    print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, winning_group))

            if not score_placed:
                # if we got this far, then 'tmp_score' isn't within the 2x the original 'spread' of any of the
                # medians of any of the groups
                # assumption: this score is the start of a new group
                # -------------------------------------------------------------
                tmp_list = [tmp_score]
                groupings.append(tmp_list)

        # increment counter
        count = count + 1

    return groupings


if __name__ == "__main__":

    # the initial list of scores
    scores = [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763, 8967, 8988, 9000, 8836, 8968, 8996, 9173,
              9027, 8896, 9130, 4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]

    # time the execution of the analysis
    time_start = time.time()
    score_groupings = analyze_list(scores)
    time_end = time.time()

    # print the results
    print('nnRESULTS: (completed in {} seconds)'.format(format(time_end - time_start, ".6f")))
    count = 0
    for tmp_group in score_groupings:
        print('-----> group [{}]: {}'.format(count, tmp_group))
        count = count + 1

(Edit)

- J_H
- May 30, 2023 at 10:50 pm
- 0 votes
0
Based on the example values shown,
this may be a good candidate for
ruptures
change point detection.
It defines a loss function and then finds a global minimum.

You can specify that the algorithm shall find K break points,
similar to K-means. Or you can let it decide what number
of break points best explains changes in the generating process.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.