I’m trying to query a table using pandas.read_sql_query
, where I want to match multiple columns to python lists passed in as param
arguments. Running into various psycopg2
errors when trying to accomplish this.
Ideally, I would provide a reproducible example, but unfortunately, that’s not possible here due to the SQL connection requirement. If there is some way to provide a reproducible example, please let me know and I will edit the code below. Assume that the entries of col1
are strings and those of col2
are numeric values.
Note that I’m trying to ensure that each row of col1
and col2
matches the corresponding combination of list1
and list2
, so it would not be possible to do separate where
clauses for each, i.e., where col1 = any(%(list1)s) and col2 = any(%(list2)s)
.
First, I tried passing the lists as separate parameters and then combining them into an array within the SQL query:
import pandas as pd
list1 = ['a', 'b', 'c']
list2 = [1,2,3]
pd.read_sql_query(
"""
select * from table
where (col1, col2) = any(array[(%(list1)s, %(list2)s)])
""",
con = conn,
params = {'list1': list1, 'list2':list2}
)
When I try this, I get Datatype Mismatch: cannot compare dissimilar columns of type text and text[] at column 1
.
Also tried the following variant, where I passed a list of lists into param
:
pd.read_sql_query(
"""
select * from table
where (col1, col2) = any(%(arr)s)
""",
con = conn,
params = {'arr': [[x,y] for x,y in zip(list1,list2)]}
)
Here, I got DataError: (psycopg2.errors.InvalidTextRepresentation) in valid input syntax for integer: "a"
.
Tried a few other minor variants of the above, but every attempt threw some kind of error. So, what’s the syntax needed in order to accomplish this?
EDIT:
Including a reproducible example:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
list1 = ["a", "b", "c"]
list2 = [1, 2, 3]
n = 100
engine = create_engine(
"postgresql+psycopg2://postgres:password@localhost:5432/database"
)
np.random.seed(2022)
df = pd.DataFrame(
{
"col1": np.random.choice(list1, n, replace=True),
"col2": np.random.choice(list2, n, replace=True),
}
)
# write table to database
df.to_sql("toy_table", engine, if_exists="replace", index=False)
# query with where any
df_query = pd.read_sql_query(
"""
select * from toy_table
where col1 = any(%(list1)s) and col2=any(%(list2)s)
""",
con=engine,
params={"list1": list1, "list2": list2},
)
# expected output
rows = [(x, y) for x, y in zip(list1, list2)]
df_expected = df.loc[df.apply(lambda x: tuple(x.values) in rows, axis=1)]
# Throws assertion error
assert df_expected.equals(df_query)
2
Answers
Found a solution that works for any number of input lists:
To make a comparison of exact pairs you could convert your array to a dictionary then to JSON, taking advantage of PostgreSQL JSON functions and operators, like this:
then query request should be
Or a more general approach for n columns
the query
Tested with python 3.10.6