skip to Main Content

I’m creating a custom DataSet and I’m under some constrains:

  • I want the user to specify the type of the data which they want to store.
  • I want to reduce type-casting because I think it will be VERY expensive.
  • I will use the data VERY frequently in my application.

I don’t know what type of data will be stored in the DataSet, so my initial idea was to make it a List of objects, but I suspect that the frequent use of the data and the need to type-cast will be very expensive.

The basic idea is this:

class DataSet : IDataSet
{
    private Dictionary<string, List<Object>> _data;

    /// <summary>
    /// Constructs the data set given the user-specified labels.
    /// </summary>
    /// <param name="labels">
    /// The labels of each column in the data set.
    /// </param>
    public DataSet(List<string> labels)
    {
        _data = new Dictionary<string, List<object>>();
        foreach (string label in labels)
        {
            _data.Add(label, new List<object>());
        }
    }

    #region IDataSet Members

    public List<string> DataLabels
    {
        get { return _data.Keys.ToList(); }
    }

    public int Count
    {
        get { _data[_data.Keys[0]].Count; }
    }

    public List<object> GetValues(string label)
    {
        return _data[label];
    }

    public object GetValue(string label, int index)
    {
        return _data[label][index];
    }

    public void InsertValue(string label, object value)
    {
        _data[label].Insert(0, value);
    }

    public void AddValue(string label, object value)
    {
        _data[label].Add(value);
    }

    #endregion
}

A concrete example where the DataSet will be used is to store data obtained from a CSV file where the first column contains the labels. When the data is being loaded from the CSV file I’d like to specify the type rather than casting to object. The data could contain columns such as dates, numbers, strings, etc. Here is what it could look like:

"Date","Song","Rating","AvgRating","User"
"02/03/2010","Code Monkey",4.6,4.1,"joe"
"05/27/2009","Code Monkey",1.2,4.5,"jill"

The data will be used in a Machine Learning/Artificial Intelligence algorithm, so it is essential that I make the reading of data very fast. I want to eliminate type-casting as much as possible, since I can’t afford to cast from ‘object’ to whatever data type is needed on every read.

I’ve seen applications that allow the user to pick the specific data type for each item in the csv file, so I’m trying to make a similar solution where a different type can be specified for each column. I want to create a generic solution so I don’t have to return a List<object> but a List<DateTime> (if it’s a DateTime column) or List<double> (if it’s a column of doubles).

Is there any way that this can be achieved? Perhaps my approach is wrong, is there a better approach to this problem?

2

Answers


  1. I would suggest trying what you have now. Maybe the performance will be good enough. If not, and only then, you could think about optimizing further.

    You could also store each field as a variant object like this:

    struct Variant
    {
       string StringValue;
       DateTime DateTimeValue;
       bool BoolValue;
       // ... etc. ...
    }
    

    Then you would just need to access the appropriate member from the struct, but this may add just as much overhead with the memory usage and if statements…

    Login or Signup to reply.
  2. Bear in mind that DataSets also store rows, columns etc. as objects. Getting them type-safe usually means that in your typed dataset the cast is done.

    I think it really depends what has to happen with the data read from the csv, but to eliminate casting without knowing in advance which types you will require, I can only think of creating the type holding the data dynamically through Reflection.Emit.

    As Jeff says, though, the casting may not kill your app.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search