skip to Main Content

I have a large dataset (400K+ rows by 4 columns). I’m currently grabbing the data into an array using the following (thanks to tanaike):

function getValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000 }) {
  return [...Array(Math.ceil(maxRow / limit))].flatMap((_) => {
    const last = start - 1 + limit;
    const range = `'${sheetName}'!A${start}:${last > maxRow ? maxRow : last}`;
    const temp = Sheets.Spreadsheets.Values.get(spreadsheetId, range).values;
    start += limit;
    return temp;
  });
}

function setValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000, values }) {
  Array.from(Array(Math.ceil(maxRow / limit))).forEach((_) => {
    const v = values.splice(0, limit);
    Sheets.Spreadsheets.Values.update({ values: v }, spreadsheetId, `'${sheetName}'!A${start}`, { valueInputOption: "USER_ENTERED" });
    start += limit;
  });
}

The first function grabs data 100K rows at a time and places it into an array. The second function pastes data 100K rows at a time. Works great. I then do some work on it via loops.

I now find myself needing to remove duplicates. Here is what my data looks like:

ID Content Completion Date Pathway
1 abc 01/01/2024 Apple
1 def 01/01/2024 Apple
1 ghi 01/01/2024 Apple
1 def 01/11/2024 Apple
1 abc 01/01/2023 Apple
1 abc 01/01/2024 Apple

I would like to remove those rows that are duplicates based on ID, Content and Pathway…keeping the oldest date in the Completion Date column. It should look like this:

ID Content Completion Date Pathway
1 def 01/01/2024 Apple
1 ghi 01/01/2024 Apple
1 abc 01/01/2023 Apple

How would you approach this?

  • Paste data into the sheet, sort and THEN use .removeDuplicates()?
  • Remove duplicates within the array with loop and then paste into the sheet?

You should know:

  • I already have the data in an array because I am looping
    through it to remove certain rows that have specific data in the
    Content column.
  • With having so many rows I’m looking for the fastest solution possible.
  • I have no problem trying to figure this out on my own if you can simply answer my question above, but a solution that provides a script would be appreciated, especially if the best method is using a loop. I can manage with simple datasets but one this size presents a challenge for me.

Update:

  • Using Method 1 above of pasting, sorting and .removingDuplicates took ~30 seconds on about 400K rows. That’s actually not bad but I’m curious if there’s another way.

2

Answers


  1. Use Another Script to Reduce Your Data

    You may add another script which will filter out your current data based on the oldest date. You may achieve this by using the following script which uses map(), filter(), and sort functions.

    Script

    function redeuceData() {
      var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
      var data = ss.getDataRange().getDisplayValues();
      var uniqueContent = [...new Set(data.map(x=>x[1]))];
      var out = uniqueContent.map(x=>data.filter(y=>y[1]==x).sort()).map(z=>z[0])
      ss.clearContents();
      ss.getRange(1,1,out.length,out[0].length).setValues(out);
    }
    

    The [...new Set(data.map(x=>x[1]))] creates a new set of data with unique values for the column content. This set will become the basis of grouping all rows according to the value of content. After grouping the rows together, the sort function is applied so that the oldest date will be positioned on to index 0 of the groupings after which is reduced with the help of map which only gets all rows with index 0.

    Output:

    Using this sample data:

    Sample data

    Running the script will result in:

    result

    References:

    Update:

    If you wish to reduce your data based on 3 columns, you may use the following script which expands the filter based on ID, Content, and Pathway:

    function redeuceData2() {
      var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
      var data = ss.getDataRange().getDisplayValues();
      var uniqueContent = [...new Set(data.map(x => [x[0],x[1],x[3]].join("|")))];
      var out = uniqueContent.map(x => data.filter(y => [y[0],y[1],y[3]].join("|") == x).sort()).map(z => z[0]);
      ss.clearContents();
      ss.getRange(1, 1, out.length, out[0].length).setValues(out);
    }
    

    Output:

    Using another sample data with different IDs and Pathways:

    sample data 2

    Running the updated script should result in:

    Output 2

    Login or Signup to reply.
  2. From I already have the data in an array because I am looping through it to remove certain rows that have specific data in the Content column., if you have already had the values retrieved by your showing function getValues_ from the Spreadsheet. And when the retrieved values have 400,000 rows, how about the following sample script?

    By the way, from your showing table, I guessed that your date format might be MM/dd/yyyy. Please be careful about this.

    Sample script:

    // This is from your showing script.
    const values = getValues_({,,,});
    
    
    const res = [...values.reduce((m, r) => {
      const [a, b, c, d] = r;
      const k = `${a}${b}${d}`;
      const v = m.get(k);
      if (v) {
        if (new Date(v[2]).getTime() > new Date(c).getTime()) {
          m.delete(k);
          m.set(k, r);
        }
      } else {
        m.set(k, r);
      }
      return m;
    }, new Map()).values()];
    
    console.log(res);
    

    When this script is run with the following values (your showing table),

    const values = [
      [1, "abc", "01/01/2024", "Apple"],
      [1, "def", "01/01/2024", "Apple"],
      [1, "ghi", "01/01/2024", "Apple"],
      [1, "def", "01/11/2024", "Apple"],
      [1, "abc", "01/01/2023", "Apple"],
      [1, "abc", "01/01/2024", "Apple"],
    ];
    

    the following result is obtained. It seems that this result is the same with your expected result. Only the rows with the oldest value of column "C" as a key of columns "A", "B", and "D" are left.

    [
      [1,"def","01/01/2024","Apple"],
      [1,"ghi","01/01/2024","Apple"],
      [1,"abc","01/01/2023","Apple"]
    ]
    

    References:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search