I have a algorithm that is spitting out a list of scores:

list_of_scores = [100, 101, 102, 101, 43, 42, 42, 41, 20, 21, 22, 23]

in the above list, there is a ‘cluster’ of scores around 100, another ‘cluster’ around 42, and another ‘cluster’ around 22.

Score lists will always contain 1 or more clusters.

I need to summarize the scores by creating a list containing 1 member from each ‘cluster’ of scores. the exact median score of the cluster is not important – i just need to have a score (an actual element in the list) from each ‘cluster’.

Anyone have any thoughts on how to accomplish this in python?

edit: i won’t know the number of potential clusters beforehand or the spread of the cluster itself:

  • one list may contain 3 clusters of data, another list may contain
    only 1 cluster in that all of the data may be close to one value.

  • one list may contain 2 clusters, each member of the cluster is within
    +/- 10 around the center point, whereas the other cluster, each member of the cluster may be within +/- 25 of the center point



  1. Chosen as BEST ANSWER

    I think i finally got this to work! i figured i'd post my solution in case anyone else may search on this later...

    since i didn't know how many clusters or the spread of values within each cluster, i made an initial assumption that values should be within +/- 5% of the total distance from list max to list min.

    if i can't find a group for a specific score, i double the spread and try again. if i still can't find it, it gets a new group

    finally, in case that a score may appear to belong to more than one group given my assumption of the size of the spread, i place the score in the group with the shortest distance from score-in-question to the median of said group

    the starting list:

    scores = [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763, 8967, 8988, 9000, 8836, 8968, 8996, 9173, 9027, 8896, 9130, 4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]

    the results:

    RESULTS: (completed in 0.000287 seconds)
    -----> group [0]: [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763]
    -----> group [1]: [8967, 8988, 9000, 8836, 8968, 8996, 9173, 9027, 8896, 9130]
    -----> group [2]: [4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]

    the code:

    import time
    import statistics
    def analyze_list(list_to_analyze):
        # setup
        groupings = []
        list_max = max(list_to_analyze)
        list_min = min(list_to_analyze)
        spread = int((list_max - list_min) * 0.05)
        print('Starting with max = [{}], min = [{}], spread = [{}]'.format(list_max, list_min, spread))
        count = 0
        for tmp_score in list_to_analyze:
            print('-> checking [{}]...'.format(tmp_score))
            # special case for the first one, it gets placed in the first group
            if count == 0:
                tmp_list = [tmp_score]
                print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, 0))
                # check tmp_score against the median score of each group
                # don't place them in the group, just add them to the 'tmp_placement_list'
                # dictionary with the distance to the median
                # -------------------------------------------------------------
                count_group = 0
                score_placed = False
                tmp_placement_list = {}
                for tmp_group_list in groupings:
                    tmp_median = int(statistics.median(tmp_group_list))
                    # calculate how close the score to check is to the median score in the current group
                    distance_from_median_value = abs(tmp_median - tmp_score)
                    print('---> group [{}]  range = ({}..{}) - distance to median = [{}] '.format(count_group,
                                                                                                  tmp_median - spread,
                                                                                                  tmp_median + spread,
                    if distance_from_median_value <= spread:
                        tmp_placement_list[count_group] = distance_from_median_value
                    # increment counter
                    count_group = count_group + 1
                # analyze 'tmp_placement_list' dictionary - place score in
                # group with the smallest distance to the median
                # -------------------------------------------------------------
                min_distance = spread * 2
                winning_group = -1
                for tmp_group_number in tmp_placement_list.keys():
                    print('-----> result - group [{}] distance [{}]'.format(tmp_group_number, tmp_placement_list[tmp_group_number]))
                    if tmp_placement_list[tmp_group_number] < min_distance:
                        min_distance = tmp_placement_list[tmp_group_number]
                        winning_group = tmp_group_number
                if winning_group > -1:
                    score_placed = True
                    print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, winning_group))
                if not score_placed:
                    # if we got here 'tmp_score' isn't within the current 'spread' of any of the groups.
                    # double the spread and try again
                    # -------------------------------------------------------------
                    second_chance_spread = spread * 2
                    tmp_placement_list = {}
                    count_group = 0
                    for tmp_group_list in groupings:
                        tmp_median = int(statistics.median(tmp_group_list))
                        # calculate how close the score to check is to the median score in the current group
                        distance_from_median_value = abs(tmp_median - tmp_score)
                        print('---> 2nd chance - group [{}]  range = ({}..{}) - dist median = [{}] '.format(count_group,
                                                                                                            tmp_median - spread,
                                                                                                            tmp_median + spread,
                        if distance_from_median_value <= second_chance_spread:
                            tmp_placement_list[count_group] = distance_from_median_value
                        # increment counter
                        count_group = count_group + 1
                    # analyze results - place score in group with the smallest distance to the median
                    # -------------------------------------------------------------
                    min_distance = spread * 2
                    winning_group = -1
                    for tmp_group_number in tmp_placement_list.keys():
                            '-----> result - group [{}] distance [{}]'.format(tmp_group_number, tmp_placement_list[tmp_group_number]))
                        if tmp_placement_list[tmp_group_number] < min_distance:
                            min_distance = tmp_placement_list[tmp_group_number]
                            winning_group = tmp_group_number
                    if winning_group > -1:
                        score_placed = True
                        print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, winning_group))
                if not score_placed:
                    # if we got this far, then 'tmp_score' isn't within the 2x the original 'spread' of any of the
                    # medians of any of the groups
                    # assumption: this score is the start of a new group
                    # -------------------------------------------------------------
                    tmp_list = [tmp_score]
            # increment counter
            count = count + 1
        return groupings
    if __name__ == "__main__":
        # the initial list of scores
        scores = [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763, 8967, 8988, 9000, 8836, 8968, 8996, 9173,
                  9027, 8896, 9130, 4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]
        # time the execution of the analysis
        time_start = time.time()
        score_groupings = analyze_list(scores)
        time_end = time.time()
        # print the results
        print('nnRESULTS: (completed in {} seconds)'.format(format(time_end - time_start, ".6f")))
        count = 0
        for tmp_group in score_groupings:
            print('-----> group [{}]: {}'.format(count, tmp_group))
            count = count + 1

  2. Based on the example values shown,
    this may be a good candidate for
    change point detection.
    It defines a loss function and then finds a global minimum.

    You can specify that the algorithm shall find K break points,
    similar to K-means. Or you can let it decide what number
    of break points best explains changes in the generating process.

