skip to Main Content

I have a list of objects and I want to filter the list by removing some objects that have ids present in a separate list i.e.

List<Vehicle> vehicles;  //All Vehicles List (260 vehicles data)
List<int> vehiclesIds;  //Ids List to remove from Original List (Length < 260)

Now for removing if i’m using RemoveAll() it takes areound 25 to 30 seconds.

vehicles.RemoveAll(e => vehiclesIds.Contains(e.Id));

But if I’m using for loop as the following it takes only 8 to 20 milliseconds.

foreach (var vehiclesId in vehiclesIds)
{
    vehicles.RemoveAll(e=> vehiclesId == e.Id);
}

I multiple times run this code and analyse this after using StopWatch that List.Contains() is the reason for slow processing.

2

Answers


  1. The logic between these two versions is different.

    The first version loops the vehicles first, looking for matches in the IDs. Contains uses the default comparer, which might be a slow way of comparing because it uses interfaces.

    You can improve this version by using a HashSet<int> to find matching IDs. This should be faster (even though it also uses Contains) as it uses a hashtable internally.

    List<Vehicle> vehicles;  //All Vehicles List (260 vehicles data)
    HashSet<int> vehiclesIds;  //Ids List to remove from Original List (Length < 260)
    
    vehicles.RemoveAll(e => vehiclesIds.Contains(e.Id));
    

    The second version instead loops the smaller list of IDs first. You can improve this version by using a Dictionary for the vehicles

    Dictionary<int, Vehicle> vehicles;  //All Vehicles List (260 vehicles data)
    List<int> vehiclesIds;  //Ids List to remove from Original List (Length < 260)
    
    foreach (var vehiclesId in vehiclesIds)
    {
        vehicles.Remove(vehiclesId);
    }
    

    i would imagine this last version to be the fastest.

    Login or Signup to reply.
  2. TLDR: The answer to your question "Why List.Contains() is too slow?" is: It isn’t. There must be something wrong elsewhere that you haven’t shown us.

    Longer answer:

    The first version should be faster because it’s only removing elements from the list once rather than multiple times. (The fact that it isn’t faster points to some problem elsewhere.)

    However, the first version does call vehiclesIds.Contains(e.Id) once for each vehicle, which is an O(N) operation.

    You could improve that by creating a HashSet<int> from the vehicle IDs and then use that for the lookup.

    I tried the three approaches (your two plus the HashSet one) in a benchmark:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using BenchmarkDotNet.Attributes;
    using BenchmarkDotNet.Running;
    
    namespace Demo;
    
    static class Program
    {
        public static void Main()
        {
            var summary = BenchmarkRunner.Run<UnderTest>();
        }
    }
    
    sealed class Vehicle
    {
        public int Id { get; set; }
    }
    
    public class UnderTest
    {
        public UnderTest()
        {
            var rng = new Random(34655);
    
            for (int i = 0; i < NUM_IDS; ++i)
                _vehicleIds.Add(rng.Next(0, MAX_ID));
    
            _idsHash = _vehicleIds.ToHashSet();
        }
    
        [IterationSetup]
        public void IterationSetup()
        {
            Console.WriteLine("IterationSetup");
            createVehicleList();
        }
    
        [Benchmark]
        public void RemoveItemsUsingRemoveAll()
        {
            _vehicles.RemoveAll(e => _vehicleIds.Contains(e.Id));
        }
    
        [Benchmark]
        public void RemoveItemsUsingLoop()
        {
            foreach (int vehiclesId in _vehicleIds)
            {
                _vehicles.RemoveAll(e => vehiclesId == e.Id);
            }
        }
    
        [Benchmark]
        public void RemoveItemsUsingHashSet()
        {
            _vehicles.RemoveAll(e => _idsHash.Contains(e.Id));
        }
    
        void createVehicleList()
        {
            _vehicles.Clear();
            var rng = new Random(78712);
    
            for (int i = 0; i < NUM_VEHICLES; ++i)
            {
                _vehicles.Add(new Vehicle {Id = rng.Next(MAX_ID)});
            }
        }
    
        const int NUM_VEHICLES = 200_000;
        const int MAX_ID       = 800_000;
        const int NUM_IDS      =   1_000;
    
        readonly List<int>     _vehicleIds = new ();
        readonly List<Vehicle> _vehicles   = new ();
        readonly HashSet<int>  _idsHash;
    }
    

    The results I obtained on my PC were as follows:

    |                     Method |       Mean |      Error |     StdDev |     Median |
    |--------------------------- |-----------:|-----------:|-----------:|-----------:|
    |  RemoveItemsUsingRemoveAll |  68.360 ms |  1.4030 ms |  4.0926 ms |  67.828 ms |
    |       RemoveItemsUsingLoop | 625.148 ms | 19.4181 ms | 56.3354 ms | 607.409 ms |
    |    RemoveItemsUsingHashSet |   5.293 ms |  0.2643 ms |  0.7626 ms |   5.323 ms |    
    

    As you can see, using your first version is significantly faster than your second version (as I expected), but using a HashSet is even faster.

    The conclusion that we must draw is that the slowdown you’re seeing is due to code that you haven’t shown us.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search