skip to Main Content

I have a algorithm that is spitting out a list of scores:

list_of_scores = [100, 101, 102, 101, 43, 42, 42, 41, 20, 21, 22, 23]

in the above list, there is a ‘cluster’ of scores around 100, another ‘cluster’ around 42, and another ‘cluster’ around 22.

Score lists will always contain 1 or more clusters.

I need to summarize the scores by creating a list containing 1 member from each ‘cluster’ of scores. the exact median score of the cluster is not important – i just need to have a score (an actual element in the list) from each ‘cluster’.

Anyone have any thoughts on how to accomplish this in python?

edit: i won’t know the number of potential clusters beforehand or the spread of the cluster itself:

  • one list may contain 3 clusters of data, another list may contain
    only 1 cluster in that all of the data may be close to one value.

  • one list may contain 2 clusters, each member of the cluster is within
    +/- 10 around the center point, whereas the other cluster, each member of the cluster may be within +/- 25 of the center point

2

Answers


  1. Chosen as BEST ANSWER

    I think i finally got this to work! i figured i'd post my solution in case anyone else may search on this later...

    since i didn't know how many clusters or the spread of values within each cluster, i made an initial assumption that values should be within +/- 5% of the total distance from list max to list min.

    if i can't find a group for a specific score, i double the spread and try again. if i still can't find it, it gets a new group

    finally, in case that a score may appear to belong to more than one group given my assumption of the size of the spread, i place the score in the group with the shortest distance from score-in-question to the median of said group

    the starting list:

    scores = [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763, 8967, 8988, 9000, 8836, 8968, 8996, 9173, 9027, 8896, 9130, 4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]
    

    the results:

    RESULTS: (completed in 0.000287 seconds)
    -----> group [0]: [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763]
    -----> group [1]: [8967, 8988, 9000, 8836, 8968, 8996, 9173, 9027, 8896, 9130]
    -----> group [2]: [4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]
    

    the code:

    import time
    import statistics
    
    
    def analyze_list(list_to_analyze):
        # setup
        groupings = []
        list_max = max(list_to_analyze)
        list_min = min(list_to_analyze)
        spread = int((list_max - list_min) * 0.05)
        print('Starting with max = [{}], min = [{}], spread = [{}]'.format(list_max, list_min, spread))
    
        count = 0
        for tmp_score in list_to_analyze:
            print('-> checking [{}]...'.format(tmp_score))
    
            # special case for the first one, it gets placed in the first group
            if count == 0:
                tmp_list = [tmp_score]
                groupings.append(tmp_list)
                print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, 0))
            else:
                # check tmp_score against the median score of each group
                # don't place them in the group, just add them to the 'tmp_placement_list'
                # dictionary with the distance to the median
                # -------------------------------------------------------------
                count_group = 0
                score_placed = False
                tmp_placement_list = {}
                for tmp_group_list in groupings:
                    tmp_median = int(statistics.median(tmp_group_list))
    
                    # calculate how close the score to check is to the median score in the current group
                    distance_from_median_value = abs(tmp_median - tmp_score)
                    print('---> group [{}]  range = ({}..{}) - distance to median = [{}] '.format(count_group,
                                                                                                  tmp_median - spread,
                                                                                                  tmp_median + spread,
                                                                                                  distance_from_median_value))
    
                    if distance_from_median_value <= spread:
                        tmp_placement_list[count_group] = distance_from_median_value
    
                    # increment counter
                    count_group = count_group + 1
    
                # analyze 'tmp_placement_list' dictionary - place score in
                # group with the smallest distance to the median
                # -------------------------------------------------------------
                min_distance = spread * 2
                winning_group = -1
                for tmp_group_number in tmp_placement_list.keys():
                    print('-----> result - group [{}] distance [{}]'.format(tmp_group_number, tmp_placement_list[tmp_group_number]))
                    if tmp_placement_list[tmp_group_number] < min_distance:
                        min_distance = tmp_placement_list[tmp_group_number]
                        winning_group = tmp_group_number
    
                if winning_group > -1:
                    groupings[winning_group].append(tmp_score)
                    score_placed = True
                    print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, winning_group))
    
                if not score_placed:
                    # if we got here 'tmp_score' isn't within the current 'spread' of any of the groups.
                    # double the spread and try again
                    # -------------------------------------------------------------
                    second_chance_spread = spread * 2
                    tmp_placement_list = {}
                    count_group = 0
                    for tmp_group_list in groupings:
                        tmp_median = int(statistics.median(tmp_group_list))
    
                        # calculate how close the score to check is to the median score in the current group
                        distance_from_median_value = abs(tmp_median - tmp_score)
                        print('---> 2nd chance - group [{}]  range = ({}..{}) - dist median = [{}] '.format(count_group,
                                                                                                            tmp_median - spread,
                                                                                                            tmp_median + spread,
                                                                                                            distance_from_median_value))
    
                        if distance_from_median_value <= second_chance_spread:
                            tmp_placement_list[count_group] = distance_from_median_value
    
                        # increment counter
                        count_group = count_group + 1
    
                    # analyze results - place score in group with the smallest distance to the median
                    # -------------------------------------------------------------
                    min_distance = spread * 2
                    winning_group = -1
                    for tmp_group_number in tmp_placement_list.keys():
                        print(
                            '-----> result - group [{}] distance [{}]'.format(tmp_group_number, tmp_placement_list[tmp_group_number]))
                        if tmp_placement_list[tmp_group_number] < min_distance:
                            min_distance = tmp_placement_list[tmp_group_number]
                            winning_group = tmp_group_number
    
                    if winning_group > -1:
                        groupings[winning_group].append(tmp_score)
                        score_placed = True
                        print('-------> WAHOO! adding [{}] to group [{}]'.format(tmp_score, winning_group))
    
                if not score_placed:
                    # if we got this far, then 'tmp_score' isn't within the 2x the original 'spread' of any of the
                    # medians of any of the groups
                    # assumption: this score is the start of a new group
                    # -------------------------------------------------------------
                    tmp_list = [tmp_score]
                    groupings.append(tmp_list)
    
            # increment counter
            count = count + 1
    
        return groupings
    
    
    if __name__ == "__main__":
    
        # the initial list of scores
        scores = [9944, 9555, 9772, 9707, 9623, 9724, 9717, 9751, 9717, 9763, 8967, 8988, 9000, 8836, 8968, 8996, 9173,
                  9027, 8896, 9130, 4301, 3806, 3696, 4372, 4203, 4702, 4290, 4552, 4113, 4325]
    
        # time the execution of the analysis
        time_start = time.time()
        score_groupings = analyze_list(scores)
        time_end = time.time()
    
        # print the results
        print('nnRESULTS: (completed in {} seconds)'.format(format(time_end - time_start, ".6f")))
        count = 0
        for tmp_group in score_groupings:
            print('-----> group [{}]: {}'.format(count, tmp_group))
            count = count + 1
    

  2. Based on the example values shown,
    this may be a good candidate for
    ruptures
    change point detection.
    It defines a loss function and then finds a global minimum.

    You can specify that the algorithm shall find K break points,
    similar to K-means. Or you can let it decide what number
    of break points best explains changes in the generating process.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search