skip to Main Content

I want to download Daily NAV (Net asset Value) from AMFI Website for all the schemes and store all of it in MongoDB. But with my current code it is taking too long almost 5 days to download and push all the data into the database as I am trying to change the structure of the data. I was hoping if anyone can help me with optimizing the code so that this can be done faster.

I am aware about the fact that in my code, the thing that is taking up time is that I’m trying to push each NAV data for each date into to database. One by one. I want to group it and push it in DB but the to do all that I guess I need a better laptop. As it takes a lot of space if I store the data in an array.

Please find my code below

#https://portal.amfiindia.com/DownloadNAVHistoryReport_Po.aspx?&frmdt=14-Aug-2023&todt=16-Aug-2023

import requests
from pytz import utc
from datetime import datetime
import pymongo  # Import the pymongo library for MongoDB operations



# Initialize MongoDB client and database
mongo_client = pymongo.MongoClient("mongodb://localhost:27017/")  # Replace with your MongoDB connection string
mydb = mongo_client["M_F"]  # Replace with your database name
mycollection = mydb["MyNAV"]  # Replace with your collection name


def convert_date_to_utc_datetime(date_string):
    date_format = "%d-%b-%Y"
    date_object = datetime.strptime(date_string, date_format)
    return date_object.replace(tzinfo=utc)
from datetime import datetime, timedelta

def split_date_range(start_date_str, end_date_str, max_duration=90):
    # Convert input date strings to datetime objects
    start_date = datetime.strptime(start_date_str, "%d-%b-%Y")
    end_date = datetime.strptime(end_date_str, "%d-%b-%Y")

    date_ranges = []

    current_date = start_date
    while current_date <= end_date:
        # Calculate the end of the current sub-range
        sub_range_end = current_date + timedelta(days=max_duration - 1)
        
        # Make sure the sub-range end is not greater than the end_date
        if sub_range_end > end_date:
            sub_range_end = end_date

        # Append the current sub-range as a tuple to the list
        date_ranges.append((current_date, sub_range_end))

        # Move the current_date to the day after the sub-range end
        current_date = sub_range_end + timedelta(days=1)

    return date_ranges

def nav_data(start,end):
    """Put the date in DD-Mmm-YYYY that too in a string format"""
    url = f"https://portal.amfiindia.com/DownloadNAVHistoryReport_Po.aspx?&frmdt={start}&todt={end}"
    response = requests.session().get(url)
    print("Got the data form connection")
    data = response.text.split("rn")
    Structure = ""
    Category = ""
    Sub_Category = ""
    amc = ""
    code = int()
    name = str()
    nav = float()
    date = ""
    inv_src = ""
    dg = ""
    i = 0
    j = 1
    
    for lines in data[1:]:
        split = lines.split(";")
        if j == len(data)-1:
            break
        if split[0] == "":
            # To check the Scheme [Structure, Category, Sub-Category]
            if data[j] == data[j+1]:
                sch_cat = data[j-1].split("(")
                sch_cat[-1]=sch_cat[-1][:-2].strip()
                sch_cat = [i.strip() for i in sch_cat]
                if "-" in sch_cat[1]:
                    sch_sub_cat = sch_cat[1].split("-")
                    sch_sub_cat = [i.strip() for i in sch_sub_cat]
                    sch_cat.pop(-1)
                    sch_cat = sch_cat+sch_sub_cat
                else:
                    sch_sub_cat = ["",sch_cat[1]]
                    sch_cat.pop(-1)
                    sch_cat = sch_cat+sch_sub_cat
                Structure = sch_cat[0]
                Category = sch_cat[1]
                Sub_Category = sch_cat[2]
                #print(sch_cat)
            # to check the AMC name
            elif "Mutual Fund" in data[j+1]:
                amc = data[j+1]
        elif len(split)>1:
            code = int(split[0].strip())
            name = str(split[1].strip())
            if "growth" in name.lower():
                dg = "Growth"
            elif "idcw" or "dividend" in name.lower():
                dg = "IDCW"
            else:
                dg = ""

            if "direct" in name.lower():
                inv_src = "Direct"
            elif "regular" in name.lower():
                inv_src = "Regular"
            else:
                inv_src = ""

            try:
                nav = float(split[4].strip()) 
            except:
                nav = split[4].strip()
           
            date = convert_date_to_utc_datetime(split[7].strip())
            print(type(date),date)
            existing_data = mycollection.find_one({"meta.Code": code})
            if existing_data:
                    # If data with the code already exists in MongoDB, update it
                    mycollection.update_one({"_id": existing_data["_id"]}, {
                        "$push": {"data": {"date": date, "nav": nav}}})
                    print("Another one bites the dust")
            else:
                new_record = {
                    "meta": {
                        "Structure": Structure,
                        "Category": Category, 
                        "Sub-Category": Sub_Category,
                        "AMC": amc, 
                        "Code": code, 
                        "Name": name,
                        "Source": inv_src,
                        "Option" : dg
                    },
                    "data": [{"date":date, "nav": nav }]
                }
                mycollection.insert_one(new_record)
                print("Data data data")
        j = j+1

    return

start_date_str = "04-Apr-2023"
end_date_str = "31-Aug-2023"
max_duration = 90

date_ranges = split_date_range(start_date_str, end_date_str, max_duration)
for start, end in date_ranges:
    print(f"Start Date: {start.strftime('%d-%b-%Y')}, End Date: {end.strftime('%d-%b-%Y')}")
    nav_data(start.strftime('%d-%b-%Y'),end.strftime('%d-%b-%Y'))
input("press any key to confirm")

2

Answers


  1. I can recommend two things.

    • Use a session object for your requests. Every time you make a GET request, the requests module creates a new connection which does take time.

      def nav_data(start,end, req):
          url = f"https://..."
          response = req.get(url)
          ...
      
      
      with requests.Session() as req:
          for start, end in date_ranges:
              print(f"Start Date: {start.strftime('%d-%b-%Y')}, End Date: {end.strftime('%d-%b-%Y')}")
              nav_data(start.strftime('%d-%b-%Y'),end.strftime('%d-%b-%Y'), req)
      
    • Use bulk insert for mongodb. You said it takes a lot of space to store the data in an array but have you tested it? It shouldn’t use much memory if its just a dict with strings in it.

    Login or Signup to reply.
  2. It seems like you’re dealing with performance issues while downloading and storing Daily NAV data from the AMFI website into MongoDB. The primary bottleneck appears to be the process of pushing each NAV data for each date into the database one by one. To optimize your code and make it more efficient, you can consider the following suggestions:

    1. Batch Insertion:
      Instead of inserting data one by one, consider using MongoDB’s bulk write operations. This can significantly improve the insertion speed. You can create an array of documents and insert them in bulk. Here’s a simplified example using PyMongo in Python:

      from pymongo import MongoClient
      
      # Connect to MongoDB
      client = MongoClient('your_mongodb_uri')
      db = client['your_database']
      collection = db['your_collection']
      
      # Your existing code to fetch NAV data
      nav_data = [...]  # Your NAV data as a list of dictionaries
      
      # Batch insert into MongoDB
      collection.insert_many(nav_data)
      
    2. Indexing:
      Ensure that you have appropriate indexes on your MongoDB collection. Indexing can significantly speed up read and write operations. Identify the fields that you frequently query or use for sorting, and create indexes accordingly.

      # Example: Create an index on the 'date' field
      collection.create_index([('date', pymongo.ASCENDING)])
      
    3. Parallel Processing:
      Consider using parallel processing to download and insert data concurrently. You can use Python’s concurrent.futures module for this. Break down the task into smaller chunks and process them simultaneously.

      from concurrent.futures import ThreadPoolExecutor
      
      def process_chunk(chunk):
          # Your code to fetch and insert data for a chunk
      
      # Split nav_data into chunks
      chunks = [...]
      
      # Process chunks in parallel
      with ThreadPoolExecutor() as executor:
          executor.map(process_chunk, chunks)
      
    4. Optimize Data Processing:
      Review your data processing logic to identify any inefficient operations. Make sure you’re using the most efficient algorithms and data structures for your specific use case.

    5. Use MongoDB Aggregation Framework:
      If you need to transform or aggregate data before inserting it into MongoDB, consider using the MongoDB Aggregation Framework. This allows you to perform complex operations directly within the database.

      # Example: Use MongoDB Aggregation Framework for grouping by date
      pipeline = [
          {'$group': {'_id': '$date', 'avg_nav': {'$avg': '$nav'}}},
          {'$out': 'aggregated_collection'}
      ]
      collection.aggregate(pipeline)
      

    Implementing these optimizations should help improve the speed of your code. Adjustments may be needed based on the specific details of your application and data.

    disclaimer : i wanna try accurate chatGPT

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search