Javascript - Remove Duplicates from Very Large Array, Keeping Oldest Timestamp

DanCue
August 14, 2024
93 views
1 vote
2 Answers

I have a large dataset (400K+ rows by 4 columns). I’m currently grabbing the data into an array using the following (thanks to tanaike):

function getValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000 }) {
  return [...Array(Math.ceil(maxRow / limit))].flatMap((_) => {
    const last = start - 1 + limit;
    const range = `'${sheetName}'!A${start}:${last > maxRow ? maxRow : last}`;
    const temp = Sheets.Spreadsheets.Values.get(spreadsheetId, range).values;
    start += limit;
    return temp;
  });
}

function setValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000, values }) {
  Array.from(Array(Math.ceil(maxRow / limit))).forEach((_) => {
    const v = values.splice(0, limit);
    Sheets.Spreadsheets.Values.update({ values: v }, spreadsheetId, `'${sheetName}'!A${start}`, { valueInputOption: "USER_ENTERED" });
    start += limit;
  });
}

The first function grabs data 100K rows at a time and places it into an array. The second function pastes data 100K rows at a time. Works great. I then do some work on it via loops.

I now find myself needing to remove duplicates. Here is what my data looks like:

ID	Content	Completion Date	Pathway
1	abc	01/01/2024	Apple
1	def	01/01/2024	Apple
1	ghi	01/01/2024	Apple
1	def	01/11/2024	Apple
1	abc	01/01/2023	Apple
1	abc	01/01/2024	Apple

I would like to remove those rows that are duplicates based on ID, Content and Pathway…keeping the oldest date in the Completion Date column. It should look like this:

ID	Content	Completion Date	Pathway
1	def	01/01/2024	Apple
1	ghi	01/01/2024	Apple
1	abc	01/01/2023	Apple

How would you approach this?

Paste data into the sheet, sort and THEN use .removeDuplicates()?
Remove duplicates within the array with loop and then paste into the sheet?

You should know:

I already have the data in an array because I am looping
through it to remove certain rows that have specific data in the
Content column.
With having so many rows I’m looking for the fastest solution possible.
I have no problem trying to figure this out on my own if you can simply answer my question above, but a solution that provides a script would be appreciated, especially if the best method is using a loop. I can manage with simple datasets but one this size presents a challenge for me.

Update:

Using Method 1 above of pasting, sorting and .removingDuplicates took ~30 seconds on about 400K rows. That’s actually not bad but I’m curious if there’s another way.

Answers

- PatrickdC
- August 13, 2024 at 5:37 pm
- 0 votes
0
Use Another Script to Reduce Your Data

You may add another script which will filter out your current data based on the oldest date. You may achieve this by using the following script which uses map(), filter(), and sort functions.

Script
```
function redeuceData() {
  var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
  var data = ss.getDataRange().getDisplayValues();
  var uniqueContent = [...new Set(data.map(x=>x[1]))];
  var out = uniqueContent.map(x=>data.filter(y=>y[1]==x).sort()).map(z=>z[0])
  ss.clearContents();
  ss.getRange(1,1,out.length,out[0].length).setValues(out);
}
```
The [...new Set(data.map(x=>x[1]))] creates a new set of data with unique values for the column content. This set will become the basis of grouping all rows according to the value of content. After grouping the rows together, the sort function is applied so that the oldest date will be positioned on to index 0 of the groupings after which is reduced with the help of map which only gets all rows with index 0.

Output:

Using this sample data:

Running the script will result in:

References:
- map()
- filter()
- sort()
Update:

If you wish to reduce your data based on 3 columns, you may use the following script which expands the filter based on ID, Content, and Pathway:
```
function redeuceData2() {
  var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
  var data = ss.getDataRange().getDisplayValues();
  var uniqueContent = [...new Set(data.map(x => [x[0],x[1],x[3]].join("|")))];
  var out = uniqueContent.map(x => data.filter(y => [y[0],y[1],y[3]].join("|") == x).sort()).map(z => z[0]);
  ss.clearContents();
  ss.getRange(1, 1, out.length, out[0].length).setValues(out);
}
```
Output:

Using another sample data with different IDs and Pathways:

Running the updated script should result in:
Login or Signup to reply.

- Tanaike
- August 14, 2024 at 7:32 am
- 0 votes
0
From I already have the data in an array because I am looping through it to remove certain rows that have specific data in the Content column., if you have already had the values retrieved by your showing function getValues_ from the Spreadsheet. And when the retrieved values have 400,000 rows, how about the following sample script?

By the way, from your showing table, I guessed that your date format might be MM/dd/yyyy. Please be careful about this.

Sample script:
```
// This is from your showing script.
const values = getValues_({,,,});


const res = [...values.reduce((m, r) => {
  const [a, b, c, d] = r;
  const k = `${a}${b}${d}`;
  const v = m.get(k);
  if (v) {
    if (new Date(v[2]).getTime() > new Date(c).getTime()) {
      m.delete(k);
      m.set(k, r);
    }
  } else {
    m.set(k, r);
  }
  return m;
}, new Map()).values()];

console.log(res);
```
When this script is run with the following values (your showing table),
```
const values = [
  [1, "abc", "01/01/2024", "Apple"],
  [1, "def", "01/01/2024", "Apple"],
  [1, "ghi", "01/01/2024", "Apple"],
  [1, "def", "01/11/2024", "Apple"],
  [1, "abc", "01/01/2023", "Apple"],
  [1, "abc", "01/01/2024", "Apple"],
];
```
the following result is obtained. It seems that this result is the same with your expected result. Only the rows with the oldest value of column "C" as a key of columns "A", "B", and "D" are left.
```
[
  [1,"def","01/01/2024","Apple"],
  [1,"ghi","01/01/2024","Apple"],
  [1,"abc","01/01/2023","Apple"]
]
```
References:
- reduce()
- Map
Login or Signup to reply.