skip to Main Content

enter image description here

I’m trying to extract just the sequence ID from the the file in Linux server.
To give you few examples TRINITY_DN0_c0_g1_i1.p1 and TRINITY_DN0_c0_g1_i3.p1 and sequence IDs. The sequence ID lengths are not same but they all start with TRINITY and ends with .p1.

I tried using awk '{print$1}' filename.cdhit > seq_id.fasta but instead I got this
enter image description here

I just want the ID but it would also extract non-interest information as well (the lengthy alphabetic protein seq).
I have attempted to create a python script in hopes to just extract the IDs:

import re

file_path = '/var2/user/de_novo/data/transdecoder_dir/Trinity.fasta.transdecoder.pep.cdhi
t'
new_file_path = '/var2/user/de_novo/data/transdecoder_dir/seqID.fasta'

with open(file_path, 'rt') as file:
    for myline in file:
        if ".p1" in file:
            with open(new_file_path, 'w') as new_file:
                new_file.write()
        else:
            print('No match found.')

Tried creating python script, running linux command
But comes out as not match found. Not sure where I went wrong.
Would appreciate with any help, thank you.

2

Answers


  1. This should do the trick:

    awk '{print$1}' filename.cdhit | grep TRINITY | cut -c2-
    

    As for python, check out https://biopython.org/. It’s a collection of bioinformatics-related goodies.

    OR

    Here a ‘.cdhit’-file reader:
    https://pypi.org/project/cdhit-reader/

    Login or Signup to reply.
  2. you can combine that awk | grep | cut into :

    mawk 'sub(".",_, $(NF = /TRINITY/))'
    
    1. using regex to set NF = 1/0, which is same as print $1 | grep

    2. when it’s false, NF is set to 0, so whole line is blanked out, thus sub() will fail and return a 0 (# instances replaced), thus skipping ineligible lines

    3. the sub() does same thing as cut

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search