skip to Main Content

I have a csv file which has the below mentioned data of Table names in a sql server instance in ubuntu.

Tables
TableName1
TableName2
TableName3
.
.
.

I want to read this csv file and I want to get the table data and store as ${table_name}.csv

How can I achieve this using pentaho. I tried a method but I want to know if there any built in methods or more efficient way of doing it. I’m new to pentaho so any advice is appreciable.

These are the details of the job I already tried.
job screenshot

  • The first set variables step is to initialize a variable for the loop
  • The csv reader job is where I used a bash script to read the csv file and the total num of lines and store as variable in a config.properties file
#!/bin/bash

# CSV file path
csv_file="/home/ubuntuv2204/taskDir/tables.csv"
property_file="/home/ubuntuv2204/taskDir/dwconfig.properties"
# Get the total number of rows in the CSV file (excluding the header)
total_rows=$(($(wc -l < "$csv_file")))

# Read the second line of the CSV file (excluding the header) and store it as table_name
table_name=$(sed '${NEW_LOOP}q;d' "$csv_file" | cut -d ',' -f 1)

# Check if the table_name is not empty
if [ -n "$table_name" ]; then
    # Print the table name
    echo "Table Name: $table_name"
else
    echo "Table name is empty or CSV file is not formatted correctly."
fi

# Store the total number of rows in a variable called loop_break
#loop_break=$total_rows

#echo "#DW" > "$property_file"
echo "table_name=$table_name" > "$property_file"
echo "loop_break=$total_rows" >> "$property_file"
  • The Next step is the loop transformation to increase the loop value everytime
  • The set dw tranformation reads the config.properties file and set variable for table_name and total no of lines.
  • rw_ktr has table input step and read the table and writes as txt file output.
  • Simple evaluation step checks if the loop value is equal to the total num of lines in the csv then the job ends that’s how I have written it.

This is working as per my requirement however I don’t think it’s that much good and I need an efficient solution.

2

Answers


  1. It is definetely possible to achieve what you want, but i would have to know the exact steps to give you a 100% answer. What i can do is point you to most of them.

    Use variables and Execute for every row option on job execution. You can utilize a whole host of variables from anywhere inside the transformations, even to change Database connections, Instances, User and Passwords. You would use those variables in a loop job, kind of like a "for each database connection, do this". I have this case solved with this approach.
    If you have trouble with the "Execute for every input row" part, i have this link as well to help.

    Login or Signup to reply.
  2. I am asuming you try to use this to make some kind of export of your database?

    Have a job call a single transformation. In the transformation have a datagrid step which maintains all table names. (this can be replaced with an sql input which retrieves all table names directly from the database if you want to process a full database). Then inside the same transformation have a transformation executer with the variable of the table name as an input variable. (For example ${table}). Set the tranformation executor to "for each row" so it loops through all the tables.

    Inside the transformation of the transformation executor, retrieve the columns of the table from the database system (how to get them depends on the db system used). Use the columnnames to retrieve the data, for example by making an selct statement where you have an inputvariable field, which is created as concatenating of all variables in the table with a seperator in between.

    (for example in postgresl: "select ? as result from ${TABLE}" where you beforehand prepare ? = "name||’,’||age||’,’||gender" from the previous step)

    Afterwards write the data which is now in a single column "result" to the file ${table}.csv.

    This should loop over all tables, retrieve per table the fields involved, and then retrieve the column names of that table. Then load the data from the table, and then write it to a specific file, and move to the next table.

    Tried it on a local system, and it runs at about 20k records per sec on my system for a 10 column table (mid range hardware).

    An alternative (and in my view better solution) requires a bit more knowledge of Pentaho. It would mean to write the whole process as a metadata injection of a generalized transformaton to load and store a database table to a csv. But that requires more explanation then quickly can be given op stack overflow. More info here: Pantaho documentation – metadata injection

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search