I need to find unique records from mysql table / dataframe.
As per the business logic, if the time difference is less than 5 seconds (mobile number and code should be the same) then that record is considered as a single record.
Sample table:
create table test(dt bigint, mobile bigint, code int);
insert into test values (231019074114, 7819838466, 20);
insert into test values (231019074151, 7819838466, 20);
insert into test values (231019074154, 7819838466, 20);
insert into test values (231019074159, 8819838466, 20);
insert into test values (231019074159, 7819838466, 99);
select mobile, code, max(dt) from test as t group by mobile, code order by mobile, code
Results:
mobile code max(dt)
7819838466 20 231019074154
7819838466 99 231019074159
8819838466 20 231019074159
In this case, the difference is just 3 seconds (231019074151 and 231019074154) for the same mobile number and code combo. Therefore one of the records is eliminated. The display value can be anything beteen 51 to 54 and therefore even 231019074153 is acceptable.
Expected results:
mobile code max(dt)
7819838466 20 231019074114
7819838466 20 231019074153
7819838466 99 231019074159
8819838466 20 231019074159
Update:
pandas solution seems incorrect:
This code returns only 4 records but one record is missing
from io import StringIO
import pandas as pd
audit_trail = StringIO('''
dt|mobile|code
231019074114|7819838466|20
231019074151|7819838466|20
231019074152|7819838466|20
231019074153|7819838466|20
231019074154|7819838466|20
231019074155|7819838466|20
231019074159|8819838466|20
231019074231|7819838466|99
231019074259|7819838466|99
''')
df = pd.read_csv(audit_trail, sep="|")
N = 5
df = df.sort_values(['mobile','code','dt'])
diff = df.groupby(['mobile','code'])['dt'].diff().clip(lower=N)
df[diff.isna() | diff.eq(N) & ~diff.duplicated()]
Expected:
dt mobile code
0 231019074114 7819838466 20
2 231019074152 7819838466 20
7 231019074231 7819838466 99
8 231019074259 7819838466 99
6 231019074159 8819838466 20
2
Answers
If there is multiple values with differencies less like 5 and need only one row:
I think this does what you want if I’ve understood correctly, but it won’t be very efficient for use on large tables.
Output:
This will group together all records which are not more than 4 seconds apart. This might not be what you want if you have say three records, at dt1, dt1+4, dt1+8, they will display as only one instance. The question isn’t clear about how such a scenario should be handled though so my solution may be acceptable.