I want to download Daily NAV (Net asset Value) from AMFI Website for all the schemes and store all of it in MongoDB. But with my current code it is taking too long almost 5 days to download and push all the data into the database as I am trying to change the structure of the data. I was hoping if anyone can help me with optimizing the code so that this can be done faster.
I am aware about the fact that in my code, the thing that is taking up time is that I’m trying to push each NAV data for each date into to database. One by one. I want to group it and push it in DB but the to do all that I guess I need a better laptop. As it takes a lot of space if I store the data in an array.
Please find my code below
#https://portal.amfiindia.com/DownloadNAVHistoryReport_Po.aspx?&frmdt=14-Aug-2023&todt=16-Aug-2023
import requests
from pytz import utc
from datetime import datetime
import pymongo # Import the pymongo library for MongoDB operations
# Initialize MongoDB client and database
mongo_client = pymongo.MongoClient("mongodb://localhost:27017/") # Replace with your MongoDB connection string
mydb = mongo_client["M_F"] # Replace with your database name
mycollection = mydb["MyNAV"] # Replace with your collection name
def convert_date_to_utc_datetime(date_string):
date_format = "%d-%b-%Y"
date_object = datetime.strptime(date_string, date_format)
return date_object.replace(tzinfo=utc)
from datetime import datetime, timedelta
def split_date_range(start_date_str, end_date_str, max_duration=90):
# Convert input date strings to datetime objects
start_date = datetime.strptime(start_date_str, "%d-%b-%Y")
end_date = datetime.strptime(end_date_str, "%d-%b-%Y")
date_ranges = []
current_date = start_date
while current_date <= end_date:
# Calculate the end of the current sub-range
sub_range_end = current_date + timedelta(days=max_duration - 1)
# Make sure the sub-range end is not greater than the end_date
if sub_range_end > end_date:
sub_range_end = end_date
# Append the current sub-range as a tuple to the list
date_ranges.append((current_date, sub_range_end))
# Move the current_date to the day after the sub-range end
current_date = sub_range_end + timedelta(days=1)
return date_ranges
def nav_data(start,end):
"""Put the date in DD-Mmm-YYYY that too in a string format"""
url = f"https://portal.amfiindia.com/DownloadNAVHistoryReport_Po.aspx?&frmdt={start}&todt={end}"
response = requests.session().get(url)
print("Got the data form connection")
data = response.text.split("rn")
Structure = ""
Category = ""
Sub_Category = ""
amc = ""
code = int()
name = str()
nav = float()
date = ""
inv_src = ""
dg = ""
i = 0
j = 1
for lines in data[1:]:
split = lines.split(";")
if j == len(data)-1:
break
if split[0] == "":
# To check the Scheme [Structure, Category, Sub-Category]
if data[j] == data[j+1]:
sch_cat = data[j-1].split("(")
sch_cat[-1]=sch_cat[-1][:-2].strip()
sch_cat = [i.strip() for i in sch_cat]
if "-" in sch_cat[1]:
sch_sub_cat = sch_cat[1].split("-")
sch_sub_cat = [i.strip() for i in sch_sub_cat]
sch_cat.pop(-1)
sch_cat = sch_cat+sch_sub_cat
else:
sch_sub_cat = ["",sch_cat[1]]
sch_cat.pop(-1)
sch_cat = sch_cat+sch_sub_cat
Structure = sch_cat[0]
Category = sch_cat[1]
Sub_Category = sch_cat[2]
#print(sch_cat)
# to check the AMC name
elif "Mutual Fund" in data[j+1]:
amc = data[j+1]
elif len(split)>1:
code = int(split[0].strip())
name = str(split[1].strip())
if "growth" in name.lower():
dg = "Growth"
elif "idcw" or "dividend" in name.lower():
dg = "IDCW"
else:
dg = ""
if "direct" in name.lower():
inv_src = "Direct"
elif "regular" in name.lower():
inv_src = "Regular"
else:
inv_src = ""
try:
nav = float(split[4].strip())
except:
nav = split[4].strip()
date = convert_date_to_utc_datetime(split[7].strip())
print(type(date),date)
existing_data = mycollection.find_one({"meta.Code": code})
if existing_data:
# If data with the code already exists in MongoDB, update it
mycollection.update_one({"_id": existing_data["_id"]}, {
"$push": {"data": {"date": date, "nav": nav}}})
print("Another one bites the dust")
else:
new_record = {
"meta": {
"Structure": Structure,
"Category": Category,
"Sub-Category": Sub_Category,
"AMC": amc,
"Code": code,
"Name": name,
"Source": inv_src,
"Option" : dg
},
"data": [{"date":date, "nav": nav }]
}
mycollection.insert_one(new_record)
print("Data data data")
j = j+1
return
start_date_str = "04-Apr-2023"
end_date_str = "31-Aug-2023"
max_duration = 90
date_ranges = split_date_range(start_date_str, end_date_str, max_duration)
for start, end in date_ranges:
print(f"Start Date: {start.strftime('%d-%b-%Y')}, End Date: {end.strftime('%d-%b-%Y')}")
nav_data(start.strftime('%d-%b-%Y'),end.strftime('%d-%b-%Y'))
input("press any key to confirm")
2
Answers
I can recommend two things.
Use a session object for your requests. Every time you make a
GET
request, therequests
module creates a new connection which does take time.Use bulk insert for mongodb. You said it takes a lot of space to store the data in an array but have you tested it? It shouldn’t use much memory if its just a dict with strings in it.
It seems like you’re dealing with performance issues while downloading and storing Daily NAV data from the AMFI website into MongoDB. The primary bottleneck appears to be the process of pushing each NAV data for each date into the database one by one. To optimize your code and make it more efficient, you can consider the following suggestions:
Batch Insertion:
Instead of inserting data one by one, consider using MongoDB’s bulk write operations. This can significantly improve the insertion speed. You can create an array of documents and insert them in bulk. Here’s a simplified example using PyMongo in Python:
Indexing:
Ensure that you have appropriate indexes on your MongoDB collection. Indexing can significantly speed up read and write operations. Identify the fields that you frequently query or use for sorting, and create indexes accordingly.
Parallel Processing:
Consider using parallel processing to download and insert data concurrently. You can use Python’s
concurrent.futures
module for this. Break down the task into smaller chunks and process them simultaneously.Optimize Data Processing:
Review your data processing logic to identify any inefficient operations. Make sure you’re using the most efficient algorithms and data structures for your specific use case.
Use MongoDB Aggregation Framework:
If you need to transform or aggregate data before inserting it into MongoDB, consider using the MongoDB Aggregation Framework. This allows you to perform complex operations directly within the database.
Implementing these optimizations should help improve the speed of your code. Adjustments may be needed based on the specific details of your application and data.
disclaimer : i wanna try accurate chatGPT