skip to Main Content

Below is a scatter plot I constructed from two numpy arrays.

Scatter Plot Example
enter image description here

What I’d like to add to this plot is a running median of y over a range of x. I’ve photoshoped in an example:

Modified Scatter Plot
enter image description here

Specifically, I need the median for data points in bins of 1 unit along the x axis between two values (this range will vary between many plots, but I can manually adjust it). I appreciate any tips that can point me in the right direction.

4

Answers


  1. I’ve written something like this in C#. I don’t do Python so here is the pseudocode:

    • create a List to use for the data which the median will be derived from
    • sort scatter plot points by x value
    • cycle through the sorted points by x value
    • for each point insert the Y value of that point into the median list so that the median list grows as a sorted list. i.e. insert Y so the List value above and below it are > and < it respectively. Take a look here: Inserting values into specific locations in a list in Python .
    • after each Y value is added, the median value will be the list value at the current middle index i.e. List(List.Length/2)

    Hope it helps!

    Login or Signup to reply.
  2. You can create a function based on numpy.median() that will calculate the median value given the intervals:

    import numpy as np
    
    def medians(x, y, intervals):
        out = []
        for xmin, xmax in intervals:
            mask = (x >= xmin) & (x < xmax)
            out.append(np.median(y[mask]))
        return np.array(out)
    

    Then use this function for the desired intervals:

    import matplotlib.pyplot as plt
    
    intervals = ((18, 19), (19, 20), (20, 21), (21, 22))
    centers = [(xmin+xmax)/2. for xmin, xmax in intervals]
    
    plt.plot(centers, medians(x, y, intervals)
    
    Login or Signup to reply.
  3. I would use np.digitize to do the bin sorting for you. This way you can easily apply any function and set the range you are interested in.

    import numpy as np
    import pylab as plt
    
    N = 2000
    total_bins = 10
    
    # Sample data
    X = np.random.random(size=N)*10
    Y = X**2 + np.random.random(size=N)*X*10
    
    bins = np.linspace(X.min(),X.max(), total_bins)
    delta = bins[1]-bins[0]
    idx  = np.digitize(X,bins)
    running_median = [np.median(Y[idx==k]) for k in range(total_bins)]
    
    plt.scatter(X,Y,color='k',alpha=.2,s=2)
    plt.plot(bins-delta/2,running_median,'r--',lw=4,alpha=.8)
    plt.axis('tight')
    plt.show()
    

    enter image description here

    As an example of the versatility of the method, let’s add errorbars given by the standard deviation of each bin:

    running_std    = [Y[idx==k].std() for k in range(total_bins)]
    plt.errorbar(bins-delta/2,running_median,
                  running_std,fmt=None)
    

    enter image description here

    Login or Signup to reply.
  4. This problem can also be efficiently tackled via python pandas (Python Data Analysis Library), which offers native data cutting and analysis methods.

    Consider this

    (Kudos and +1 to @Hooked for his example from which I borrowed the X and Y data)

     import pandas as pd
     df = pd.DataFrame({'X' : X, 'Y' : Y})  #we build a dataframe from the data
    
     data_cut = pd.cut(df.X,bins)           #we cut the data following the bins
     grp = df.groupby(by = data_cut)        #we group the data by the cut
    
     ret = grp.aggregate(np.median)         #we produce an aggregate representation (median) of each bin
    
     #plotting
    
     plt.scatter(df.X,df.Y,color='k',alpha=.2,s=2)
     plt.plot(ret.X,ret.Y,'r--',lw=4,alpha=.8)
     plt.show()
    

    Remark: here the x values of the red curve are the bin-wise x-medians (the midpoints of the bins can be used).

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search