Postgresql - Finding the difference in rows of a table, and returning only the columns where values are different

MKUltra
October 10, 2024
82 views
0 votes
2 Answers

I’m working on analyzing some data. I have a way of doing this in excel, but it’s slow and too much manual work. I’d like to find a more effective way to find what I’m looking for.

Here’s the scenario:
I have a DB table (multiple, but let’s just focus on a single one for now) that has many rows and many columns. Think of this as transactional data and we can call it Table0. It looks like the sample below.

Table0 has differences in columns 0,2,3,5 and has identical data in columns 1,4. I need to process this table, and only return the columns with differences: columns 0,2,3,5.

I’m looking for a solution that will work with either Python or SQL (postgres) that can provide the sample output table below. It doesn’t seem like a complex issue, but I don’t have the luxury of time to get a custom solution running properly.

Are there any well-known methods of manipulating my data like this?

Table0
        C0   C1   C2   C3   C4   C5
    R0  aaa  ax   ay   aq   123  555
    R1  aab  ax   ay   aq   123  555
    R2  aac  ax   ay   aw   123  557
    R3  aad  ax   ax   aw   123  555
    R4  aae  ax   ay   aw   123  559
    R5  aaf  ax   ay   ae   123  555


Output
        C0   C2   C3   C5
    R0  aaa  ay   aq   555
    R1  aab  ay   aq   555
    R2  aac  ay   aw   557
    R3  aad  ax   aw   555
    R4  aae  ay   aw   559
    R5  aaf  ay   ae   555

Answers

Chosen as BEST ANSWER

Thanks to @ouroboros1 for the solution, full code to follow!

def main():

    csvFileList=os.listdir("data")

    dataFileList={}

    for csvFile in csvFileList:

        with open(f"data\{csvFile}", mode='r', newline='n') as curFile:
            curFileData = csv.reader(curFile)
            currFileList=[]
            for row in curFileData:
                currFileList.append(row)
            dataFileList[csvFile]= currFileList
    pandaList=[]

    for file in dataFileList:
        pandaList.append((file, pd.DataFrame(dataFileList[file][1:], columns=dataFileList[file][0])))

    for df in pandaList:

        filename = df[0]
        dataFrame=df[1]
        result = dataFrame.loc[:, dataFrame.nunique().ne(1)]

        result.to_csv(filename, index=False)

(Edit)

- ouroboros1
- October 10, 2024 at 9:32 pm
- 0 votes
0
Using pandas:
- Check where df.nunique is not equal to 1 using Series.ne and select with df.loc:
```
df.loc[:, df.nunique().ne(1)]

     C0  C2  C3   C5
R0  aaa  ay  aq  555
R1  aab  ay  aq  555
R2  aac  ay  aw  557
R3  aad  ax  aw  555
R4  aae  ay  aw  559
R5  aaf  ay  ae  555
```
The intermediate:
```
df.nunique()

C0    6
C1    1 # -> `False` with .ne(1)
C2    2
C3    3
C4    1 # -> `False` with .ne(1)
C5    3
dtype: int64
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Finding the difference in rows of a table, and returning only the columns where values are different

Answers