I’m working on analyzing some data. I have a way of doing this in excel, but it’s slow and too much manual work. I’d like to find a more effective way to find what I’m looking for.
Here’s the scenario:
I have a DB table (multiple, but let’s just focus on a single one for now) that has many rows and many columns. Think of this as transactional data and we can call it Table0. It looks like the sample below.
Table0 has differences in columns 0,2,3,5 and has identical data in columns 1,4. I need to process this table, and only return the columns with differences: columns 0,2,3,5.
I’m looking for a solution that will work with either Python or SQL (postgres) that can provide the sample output table below. It doesn’t seem like a complex issue, but I don’t have the luxury of time to get a custom solution running properly.
Are there any well-known methods of manipulating my data like this?
Table0
C0 C1 C2 C3 C4 C5
R0 aaa ax ay aq 123 555
R1 aab ax ay aq 123 555
R2 aac ax ay aw 123 557
R3 aad ax ax aw 123 555
R4 aae ax ay aw 123 559
R5 aaf ax ay ae 123 555
Output
C0 C2 C3 C5
R0 aaa ay aq 555
R1 aab ay aq 555
R2 aac ay aw 557
R3 aad ax aw 555
R4 aae ay aw 559
R5 aaf ay ae 555
2
Answers
Thanks to @ouroboros1 for the solution, full code to follow!
Using
pandas
:df.nunique
is not equal to 1 usingSeries.ne
and select withdf.loc
:The intermediate: