I have a Raspberry Pi 3B+ with 1Gb ram in which I am running a telegram bot.
This bot uses a database I store in csv format, which contains about 100k rows and four columns:
- First two are for searching
- Third is a result
those use about 20-30MB ram, this is assumable.
- The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.
I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.
What do you think is the best way to do this? Thanks!
2
Answers
Setting the
dtype
according to the data can reduce memory usage a lot.With
read_csv
you can directly set thedtype
for each column:Example:
See the next section on some dtype examples (with an existing df).
To change that on an existing dataframe you can use
astype
on the respective column.Use
df.info()
to check the df memory usage before and after the change.Some examples:
Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python – Make your code run fast with these tricks! where I learned those tips.
Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets – Use efficient datatypes
Not sure if this is a good idea in this case but it might be worth trying.
The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.
However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).
The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.
Here is a simple example I made before:
If you try it let me know if it works, I’m also interested.