skip to Main Content

I have a large number of txt files with a single column of data. There are no headers in the files.

The data are an email address followed by a : and then a string of varchar, which sometimes includes :s.

My goal is to convert the following

[email protected]:v@rch:r$tR:ng
[email protected]::multipleTypes
[email protected]:&ofTxtGoAfT3rThe:

To a tsv with headers.

column1           column2
[email protected]   v@rch:r$tR:ng
[email protected]   :multipleTypes
[email protected]   &ofTxtGoAfT3rThe:

These files will then be uploaded into a postgres database.

Any insight/advice is greatly appreciated.

4

Answers


  1. sed can do that. Without /g, its substitution replaces the first occurrence on each line:

    sed -e 's/:/t/' *.txt > file.tsv
    
    Login or Signup to reply.
  2. In postgres you can split the text at the appropriate position and export the hole with COPY to an TSV

    this would look like

    CREATE TABLE mytab(mytext text)
    
    INSERT INTO mytab VALUES('[email protected]:v@rch:r$tR:ng'),('[email protected]::multipleTypes'),('[email protected]:&ofTxtGoAfT3rThe:')
    
    SELECT 'column1', 'column2'
    UNION ALL
    SELECT  substr(mytext,1, strpos(mytext, ':') -1) col1,
    substr(mytext,strpos(mytext, ':')+1) col2 FROM mytab
    
    ?column?        | ?column?         
    :-------------- | :----------------
    column1         | column2          
    [email protected] | v@rch:r$tR:ng    
    [email protected] | :multipleTypes   
    [email protected] | &ofTxtGoAfT3rThe:
    
    COPY (SELECT 'column1', 'column2'
    UNION ALL
    SELECT  substr(mytext,1, strpos(mytext, ':') -1) col1,
    substr(mytext,strpos(mytext, ':')+1) col2 FROM mytab) TO '/tmp/ind.tsv'  CSV  HEADER DELIMITER E't';
    
    SELECT  substr('[email protected]::multipleTypes',1, strpos('[email protected]::multipleTypes', ':') -1) col1,
    substr('[email protected]::multipleTypes',strpos('[email protected]::multipleTypes', ':')+1) col2
    
    col1            | col2          
    :-------------- | :-------------
    [email protected] | :multipleTypes
    
    SELECT  substr('[email protected]:&ofTxtGoAfT3rThe:',1, strpos('[email protected]:&ofTxtGoAfT3rThe:', ':') -1) col1,
    substr('[email protected]:&ofTxtGoAfT3rThe:',strpos('[email protected]:&ofTxtGoAfT3rThe:', ':')+1) col2
    
    col1            | col2             
    :-------------- | :----------------
    [email protected] | &ofTxtGoAfT3rThe:
    

    db<>fiddle here

    Login or Signup to reply.
  3. @choroba is correct with Sed. This can also be done in python.

    x = '[email protected]::multipleTypes'
    
    x = x.split(':', 1)
    
    print(x)
    

    This splits the text at the first :

    Login or Signup to reply.
  4. Using bash parameter expansion for splitting the lines and printf to format the output:

    printf "%-24s%sn" "column1" "column2" ; while read -r line ; do printf "%-24s%sn" "${line%%:*}" "${line#*:}" ; done <your_data.txt
    
    column1                 column2
    [email protected]         v@rch:r$tR:ng
    [email protected]         :multipleTypes
    [email protected]         &ofTxtGoAfT3rThe:
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search