I have a large dataset (400K+ rows by 4 columns). I’m currently grabbing the data into an array using the following (thanks to tanaike):
function getValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000 }) {
return [...Array(Math.ceil(maxRow / limit))].flatMap((_) => {
const last = start - 1 + limit;
const range = `'${sheetName}'!A${start}:${last > maxRow ? maxRow : last}`;
const temp = Sheets.Spreadsheets.Values.get(spreadsheetId, range).values;
start += limit;
return temp;
});
}
function setValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000, values }) {
Array.from(Array(Math.ceil(maxRow / limit))).forEach((_) => {
const v = values.splice(0, limit);
Sheets.Spreadsheets.Values.update({ values: v }, spreadsheetId, `'${sheetName}'!A${start}`, { valueInputOption: "USER_ENTERED" });
start += limit;
});
}
The first function grabs data 100K rows at a time and places it into an array. The second function pastes data 100K rows at a time. Works great. I then do some work on it via loops.
I now find myself needing to remove duplicates. Here is what my data looks like:
ID | Content | Completion Date | Pathway |
---|---|---|---|
1 | abc | 01/01/2024 | Apple |
1 | def | 01/01/2024 | Apple |
1 | ghi | 01/01/2024 | Apple |
1 | def | 01/11/2024 | Apple |
1 | abc | 01/01/2023 | Apple |
1 | abc | 01/01/2024 | Apple |
I would like to remove those rows that are duplicates based on ID, Content and Pathway…keeping the oldest date in the Completion Date column. It should look like this:
ID | Content | Completion Date | Pathway |
---|---|---|---|
1 | def | 01/01/2024 | Apple |
1 | ghi | 01/01/2024 | Apple |
1 | abc | 01/01/2023 | Apple |
How would you approach this?
- Paste data into the sheet, sort and THEN use
.removeDuplicates()
? - Remove duplicates within the array with loop and then paste into the sheet?
You should know:
- I already have the data in an array because I am looping
through it to remove certain rows that have specific data in the
Content column. - With having so many rows I’m looking for the fastest solution possible.
- I have no problem trying to figure this out on my own if you can simply answer my question above, but a solution that provides a script would be appreciated, especially if the best method is using a loop. I can manage with simple datasets but one this size presents a challenge for me.
Update:
- Using Method 1 above of pasting, sorting and
.removingDuplicates
took ~30 seconds on about 400K rows. That’s actually not bad but I’m curious if there’s another way.
2
Answers
Use Another Script to Reduce Your Data
You may add another script which will filter out your current data based on the oldest date. You may achieve this by using the following script which uses
map()
,filter()
, andsort
functions.Script
The
[...new Set(data.map(x=>x[1]))]
creates a new set of data with unique values for the columncontent
. This set will become the basis of grouping all rows according to the value of content. After grouping the rows together, thesort
function is applied so that the oldest date will be positioned on toindex 0
of the groupings after which is reduced with the help ofmap
which only gets all rows withindex 0
.Output:
Using this sample data:
Running the script will result in:
References:
Update:
If you wish to reduce your data based on 3 columns, you may use the following script which expands the filter based on
ID
,Content
, andPathway
:Output:
Using another sample data with different
IDs
andPathways
:Running the updated script should result in:
From
I already have the data in an array because I am looping through it to remove certain rows that have specific data in the Content column.
, if you have already had the values retrieved by your showing functiongetValues_
from the Spreadsheet. And when the retrieved values have 400,000 rows, how about the following sample script?By the way, from your showing table, I guessed that your date format might be
MM/dd/yyyy
. Please be careful about this.Sample script:
When this script is run with the following
values
(your showing table),the following result is obtained. It seems that this result is the same with your expected result. Only the rows with the oldest value of column "C" as a key of columns "A", "B", and "D" are left.
References: