skip to Main Content

I am trying to run the following query on an SQL table of about 2.66 billion rows, keep in mind I am using Python psycopg (but I can use something else if it would make my life easier) to eventually convert those results into a csv file:

import csv
from psycopg2 import sql
year = 2013
stmt=sql.SQLI(f"""SELECT * FROM skill TABLESAMPLE SYSTEM (1) WHERE skillclusterfamily='Science and Research' AND DATE_PART('year',jobdate)={year}""")
cur.execute(stmt)
res=cur.fetchall()
print(res)

Keep in mind that 1% is about 26.6 million rows alone, and this one singular query will take about 2 hours on my system, that is simply too long. I don’t have access to an NVIDIA GPU however all I wanted was to just query this data and put it in a CSV file so I can use it elsewhere. My ultimate goal is to write a csv file with data from years 2007-2021 using a for loop, and store that in one csv file using a randomly sampled data from the table as 26.6 million rows should be enough hence why I am only doing 1%. Is there anyway to make this kind of query faster? I cannot install extensions.

I tried systems and Bernoulli sampling, neither were any faster to be honest.

2

Answers


  1. I think, for your query to run faster, don’t use (SELECT *), just select what you need.Secondly you should count how many rows in the table first then divide your data and use multi-threading concept to reach your data faster.Thirdly create index and partitioning on columns in your condition query, in your case those are "skillclusterfamily, year".
    Example: I have 2 billion rows in my table, I would use 5 threads to read data from that table, thread 1 responsible for reading rows from 0 to 400.000.000, thread 2 responsible for reading rows from 400.000.001 to 800.000.000… and so on.

    Login or Signup to reply.
  2. Using TABLESAMPLE makes the query slow. Instead, you should create an index that supports the WHERE condition and use LIMIT to limit the number of result rows.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search