I m very new to DynamoDB,
I want to do upload data (file.CSV) using boto3
The below code working fine, but whenever I’m doing bulk upload the existing data got deleted and insert the item which is in the CSV file. but i want to achieve if table is empty insert CSV data. if table not empty append the CSV data into the DynamoDB table without deleting the existing data, Guide me how to achieve the same.
import boto3
import pandas as pd
import json
AdvgCountries_json = json.loads(
pd.read_csv('data.csv').to_json(orient='records')
)
lst_Dics = [{'item': AdvgCountries_json, 'table':'xxxxxxx'}]
#print(AdvgCountries_json)
#Connect to DynamoDb Function
dynamodb = boto3.resource('dynamodb')
def insertDynamoItem (tablename,item_lst):
dynamoTable = dynamodb.Table(tablename)
for record in item_lst:
dynamoTable.put_item(Item=record)
print('Success')
#Upload Content to DynamoDB
for element in lst_Dics:
insertDynamoItem(element['table'],element['item'])
2
Answers
First, I’ll answer your question directly (with an approach I wouldn’t go with, but want to use it to help you understand the service), and then I will propose a better approach(es) given the limited information.
Using boto3 DynamoDB scan operation with Limit value set to 1, you can quickly get the information if the table is empty or not. There is also an option to use boto3’s describe_table() operation which contains the information about the item count of the table, but the value is updated roughly every six hours, so it may not contain the correct value, depending on how often you want to do such a check.
Your issue is that, it doesn’t delete the data, but overwrites it.
When you create a table, in addition to the table name, you must specify the primary key of the table. The primary key uniquely identifies each item in the table, so that no two items can have the same key.
So, when you try to write your CSV data, what happens is that your partition key is the same as the one that you are trying to write from a CSV file. In that case, the content of the DynamoDB row that has the same partition key as the record you are trying to write gets overwritten with the new data.
Yes, you could add a sort key which would mean you’d have a composite primary key, where the partition key stays the same and the sort key is an additional value that is unique. In that way, you could have multiple equal partition keys, but they would be differentiated by the sort key and your data wouldn’t get overwritten (IF you don’t try to write again the same composite primary key value).
But, to add the sort key (if one doesn’t exist on the table creation), you’d have to recreate the table because AWS doesn’t allow you to add the sort key after the table’s creation. That may not be possible in your case, so here’s an alternate approach that shouldn’t affect you much if you’re not dealing with things at a great scale.
Use boto3’s get_item() function and check if the partition key exists. If any data is returned, then the row exists, and you need to decide if you want to keep previous data or add yours. You can even use batch_get_item() function, apply some logic, and do a batch_write_item() operation for the records you want to add/overwrite in your table.
Using a sort key approach would be my first way, if possible. Using get item calls introduces unnecessary costs, both in time and money. Using the sort key allows you to have multiple rows for the same partition key, but with a differentiating sort key.
Hope this helps!
A simple way to achieve practical CSV file import to DynamoDB is using AWS Wrangler (AKA AWS SDK for Pandas).