According to the docs, the Maximum number of interactions that are considered by a model during training is 500M. What if I have more than 500M records in my interactions dataset? How does Amazon select 500M interactions among those to train a model? Does it consider the latest 500M interactions?
Question posted in Amazon Web Sevices
The official Amazon Web Services documentation can be found here.
The official Amazon Web Services documentation can be found here.
2
Answers
I had not explicitly disclosed how they handle datasets larger than the maximum size during training. However, I can tell you some generally accepted strategies in machine learning for dealing with this kind of situation:
Random Sampling: One approach could be to randomly select 500 million interactions from your dataset. This would provide a broad, if not comprehensive, sample of the data.
Temporal Sampling: Another approach could be to select the most recent 500 million interactions, under the assumption that more recent data is more relevant to current and future predictions.
Stratified Sampling: You could also stratify the data, ensuring that the sample of 500 million interactions is representative of the various categories or types of interactions in the dataset.
However, it’s crucial to consider that these strategies may introduce bias or exclude potentially useful information.
If you are dealing with a specific tool, product, or cloud service, such as Amazon Personalize, which I believe you might be referring to, it would be best to consult the specific product documentation or directly contact the service provider for a precise answer.
The most recent 500M interactions are used for training based on the TIMESTAMP column in the interactions dataset. This limit is also adjustable.
https://docs.aws.amazon.com/personalize/latest/dg/limits.html