skip to Main Content

I’m working on analyzing some data. I have a way of doing this in excel, but it’s slow and too much manual work. I’d like to find a more effective way to find what I’m looking for.

Here’s the scenario:
I have a DB table (multiple, but let’s just focus on a single one for now) that has many rows and many columns. Think of this as transactional data and we can call it Table0. It looks like the sample below.

Table0 has differences in columns 0,2,3,5 and has identical data in columns 1,4. I need to process this table, and only return the columns with differences: columns 0,2,3,5.

I’m looking for a solution that will work with either Python or SQL (postgres) that can provide the sample output table below. It doesn’t seem like a complex issue, but I don’t have the luxury of time to get a custom solution running properly.

Are there any well-known methods of manipulating my data like this?

Table0
        C0   C1   C2   C3   C4   C5
    R0  aaa  ax   ay   aq   123  555
    R1  aab  ax   ay   aq   123  555
    R2  aac  ax   ay   aw   123  557
    R3  aad  ax   ax   aw   123  555
    R4  aae  ax   ay   aw   123  559
    R5  aaf  ax   ay   ae   123  555


Output
        C0   C2   C3   C5
    R0  aaa  ay   aq   555
    R1  aab  ay   aq   555
    R2  aac  ay   aw   557
    R3  aad  ax   aw   555
    R4  aae  ay   aw   559
    R5  aaf  ay   ae   555

2

Answers


  1. Chosen as BEST ANSWER

    Thanks to @ouroboros1 for the solution, full code to follow!

    def main():
    
        csvFileList=os.listdir("data")
    
        dataFileList={}
    
        for csvFile in csvFileList:
    
            with open(f"data\{csvFile}", mode='r', newline='n') as curFile:
                curFileData = csv.reader(curFile)
                currFileList=[]
                for row in curFileData:
                    currFileList.append(row)
                dataFileList[csvFile]= currFileList
        pandaList=[]
    
        for file in dataFileList:
            pandaList.append((file, pd.DataFrame(dataFileList[file][1:], columns=dataFileList[file][0])))
    
        for df in pandaList:
    
            filename = df[0]
            dataFrame=df[1]
            result = dataFrame.loc[:, dataFrame.nunique().ne(1)]
    
            result.to_csv(filename, index=False)
    

  2. Using pandas:

    df.loc[:, df.nunique().ne(1)]
    
         C0  C2  C3   C5
    R0  aaa  ay  aq  555
    R1  aab  ay  aq  555
    R2  aac  ay  aw  557
    R3  aad  ax  aw  555
    R4  aae  ay  aw  559
    R5  aaf  ay  ae  555
    

    The intermediate:

    df.nunique()
    
    C0    6
    C1    1 # -> `False` with .ne(1)
    C2    2
    C3    3
    C4    1 # -> `False` with .ne(1)
    C5    3
    dtype: int64
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search