skip to Main Content

I am making a data pipeline using python and am pulling a 3gb table from our RDS instance using psycopg2. Im currently testing this locally, and whenever I try this my Ram says my python instance is taking up 27gb which is not possible since I only have 8gb of ram.

Im not sure if this is relevant bythe pipeline takes the table and uses asyncio to asynchronously do post requests to the destination.

def query_database(
    query: str,
    query_args: str = None,
    dbname: str = os.environ.get('DBNAME'),
    user: str = os.environ.get('DBUSER'),
    password: str = os.environ.get('DBPASSWORD'),
    host: str = os.environ.get('LOCAL_BIND_ADDRESS'),
    port: int = int(os.environ.get('LOCAL_BIND_PORT'))
):
    with psycopg2.connect(
            dbname=dbname,
            user=user,
            password=password,
            host=host,
            port=port
        ) as conn:
        with conn.cursor() as cur:
            # Execute a query
            args = [query, query_args] if query_args else [query]
            cur.execute(*args)
            # Check if query returned rows and get them
            if cur.description:
                records = cur.fetchall()
                columns = [desc[0] for desc in cur.description]
                return pd.DataFrame(data=records, columns=columns)


query = sql.SQL("""SELECT * from table""")
query_database

2

Answers


  1. The 27GB is probably correct. What you’re seeing is virtual memory (swap space). This is done to prevent crashing the OS, as the OS swaps virtual memory in and out of the physical memory and stores it on the disk, if it runs full.

    The high memory usage is probably due to the fact that the DB stores data in a compressed format, and it is now being saved in Python objects, which do not attempt to save space. Additionally, psycopg2 is known to leak memory on big queries, like yours. It could also be that there are some other memory leaks in your program, such as unnecessarily copying data.

    I would recommend reading less data, if possible, by using optimized SQL queries. Instead, you could try reading it in little chunks. You could do that by using fetchmany(x) instead of fetchall().

    Login or Signup to reply.
  2. That seems about normal. First libpq is reading all the data into C data structures up front, and then psycopg2 is copying all that data into python data structures, again all up front. Neither of these is height of efficiency, especially not the python part.

    You could "stream" the data by using a pyscopg2 "named" cursor, and then using the cursor as an iterator rather than using fetchall.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search