I’m trying to extract just the sequence ID from the the file in Linux server.
To give you few examples TRINITY_DN0_c0_g1_i1.p1
and TRINITY_DN0_c0_g1_i3.p1
and sequence IDs. The sequence ID lengths are not same but they all start with TRINITY
and ends with .p1
.
I tried using awk '{print$1}' filename.cdhit > seq_id.fasta
but instead I got this
I just want the ID but it would also extract non-interest information as well (the lengthy alphabetic protein seq).
I have attempted to create a python script in hopes to just extract the IDs:
import re
file_path = '/var2/user/de_novo/data/transdecoder_dir/Trinity.fasta.transdecoder.pep.cdhi
t'
new_file_path = '/var2/user/de_novo/data/transdecoder_dir/seqID.fasta'
with open(file_path, 'rt') as file:
for myline in file:
if ".p1" in file:
with open(new_file_path, 'w') as new_file:
new_file.write()
else:
print('No match found.')
Tried creating python script, running linux command
But comes out as not match found. Not sure where I went wrong.
Would appreciate with any help, thank you.
2
Answers
This should do the trick:
As for python, check out https://biopython.org/. It’s a collection of bioinformatics-related goodies.
OR
Here a ‘.cdhit’-file reader:
https://pypi.org/project/cdhit-reader/
you can combine that
awk | grep | cut
into :using
regex
to setNF = 1/0
, which is same asprint $1 | grep
when it’s false,
NF
is set to0
, so whole line is blanked out, thussub()
will fail and return a0
(# instances replaced), thus skipping ineligible linesthe
sub()
does same thing ascut