skip to Main Content

I am connecting to an on prem SQL Server database using the code below and it works as expected. I have wrapped the query I want to use in a view and I called the view to read the data into the dataframe. How do I explicitly specify a select query in the spark.read options instead of using a full table or wrapping my select in a view? e.g. option("sqlcommand", "select col1, col2 from table1;")

df = (spark.read 
        .format("jdbc") 
        .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") 
        .option("url", "jdbc:sqlserver://our connection string")     
        .option("dbtable", "dbo.databrickstest") 
        .option("user",user) 
        .option("password", pwd) 
        .option("encrypt", "true") 
        .option("trustServerCertificate", "true") 
        .load()
)

2

Answers


  1. You should use the query option instead of the dbtable option. Use .option("query", "SELECT col1, col2 FROM table1") to specify your SQL query. Here is the sample code for your reference:

    jdbcHostname = "<serverName>"     
    jdbcPort = 1433    
    jdbcDatabase = "<dbName>"     
    jdbcUsername = "<userName>"             
    jdbcPassword = "<password>"            
    jdbcDriver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"     
     
    
    jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};databaseName={jdbcDatabase}"
    df = spark.read.format("jdbc").option("driver", jdbcDriver).option("url", jdbcUrl).option("query", "select EmployeeID, FirstName from Employees").option("user", jdbcUsername).option("password", jdbcPassword).load()
    df.show()
    

    It will run the query successfully and give below output:

    EmployeeID FirstName
    1 John
    2 Jane
    3 Mike
    4 Emily
    Login or Signup to reply.
  2. You are reading the entire table when you use:

    .option("dbtable", "dbo.databrickstest")
    

    To avoid this you will need to replace this line with:

    .option("query", query_you_want)
    

    Complete example:

    query = ("""
         SELECT
            columns
         FROM table
         WHERE condition
        """)
    df = spark.read 
    .format("jdbc") 
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") 
    .option("url", "jdbc:sqlserver://our connection string") 
    .option("query", query) 
    .option("user",user) 
    .option("password", pwd) 
    .option("encrypt", "true") 
    .option("trustServerCertificate", "true") 
    .load()
    

    reference from doc: https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option

    Thanks for sharing your doubt and let me know if you need more help!

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search