skip to Main Content

I need to find unique records from mysql table / dataframe.

As per the business logic, if the time difference is less than 5 seconds (mobile number and code should be the same) then that record is considered as a single record.

Sample table:

create table test(dt bigint, mobile bigint, code int);

insert into test values (231019074114, 7819838466, 20);
insert into test values (231019074151, 7819838466, 20);
insert into test values (231019074154, 7819838466, 20);
insert into test values (231019074159, 8819838466, 20);
insert into test values (231019074159, 7819838466, 99);

select mobile, code, max(dt) from test as t group by mobile, code order by mobile, code

Results:

mobile  code    max(dt)
7819838466  20  231019074154
7819838466  99  231019074159
8819838466  20  231019074159

In this case, the difference is just 3 seconds (231019074151 and 231019074154) for the same mobile number and code combo. Therefore one of the records is eliminated. The display value can be anything beteen 51 to 54 and therefore even 231019074153 is acceptable.

Expected results:

mobile  code    max(dt)
7819838466  20  231019074114
7819838466  20  231019074153
7819838466  99  231019074159
8819838466  20  231019074159

Update:

pandas solution seems incorrect:

This code returns only 4 records but one record is missing

from io import StringIO
import pandas as pd

audit_trail = StringIO('''
dt|mobile|code
231019074114|7819838466|20
231019074151|7819838466|20
231019074152|7819838466|20
231019074153|7819838466|20
231019074154|7819838466|20
231019074155|7819838466|20
231019074159|8819838466|20
231019074231|7819838466|99
231019074259|7819838466|99
''')

df = pd.read_csv(audit_trail, sep="|")

N = 5
df = df.sort_values(['mobile','code','dt'])
diff = df.groupby(['mobile','code'])['dt'].diff().clip(lower=N)
df[diff.isna() | diff.eq(N) & ~diff.duplicated()]

Expected:

             dt      mobile  code
0  231019074114  7819838466    20
2  231019074152  7819838466    20
7  231019074231  7819838466    99
8  231019074259  7819838466    99
6  231019074159  8819838466    20

2

Answers


  1. If there is multiple values with differencies less like 5 and need only one row:

    N = 5
    df = df.sort_values(['mobile','code','dt'])
    diff = df.groupby(['mobile','code'])['dt'].diff().clip(lower=N)
    print (diff)
    0     NaN
    1    37.0
    2     5.0
    3     5.0
    4     5.0
    5     5.0
    7     NaN
    8    28.0
    6     NaN
    Name: dt, dtype: float64
    
    out = df[diff.isna() | diff.eq(N) & ~diff.duplicated()]
    print (out)
                 dt      mobile  code
    0  231019074114  7819838466    20
    2  231019074152  7819838466    20
    7  231019074131  7819838466    99
    6  231019074159  8819838466    20
    
    Login or Signup to reply.
  2. I think this does what you want if I’ve understood correctly, but it won’t be very efficient for use on large tables.

    select distinct t1.mobile, t1.code, t1.dt from test as t1 
    where not exists(
      select t2.mobile from test as t2
      where t1.mobile = t2.mobile
      and t1.code = t2.code
      and t2.dt between t1.dt + 1 and t1.dt + 4
    )
    order by mobile, code;
    

    Output:

    +------------+------+--------------+
    | mobile     | code | dt           |
    +------------+------+--------------+
    | 7819838466 |   20 | 231019074114 |
    | 7819838466 |   20 | 231019074154 |
    | 7819838466 |   99 | 231019074159 |
    | 8819838466 |   20 | 231019074159 |
    +------------+------+--------------+
    

    This will group together all records which are not more than 4 seconds apart. This might not be what you want if you have say three records, at dt1, dt1+4, dt1+8, they will display as only one instance. The question isn’t clear about how such a scenario should be handled though so my solution may be acceptable.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search