I have a fasta file that looks like below:
>sequence_1_g1
ATTTCGGATAA
>sequence_2_g1
AGGCTCTAGGA
>sequence_2_g2
TGTTCTGAAAT
>sequence_2_g3
CACCTCGGAGT
>sequence_3_new_g1
GCGGATAAAGC
I’d like to only extract the numbers that comes after the last delimiter and attach them to the end of each header, so that the output would look like below:
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC
I’ve never used linux before and so far I’ve only been able to find this command to separate the text that comes after the last delimiter: sed -E 's/.*_//' filename.fasta
. Can anyone give suggestions on what commands I should look for in addition to get my desired output?
3
Answers
Using
sed
1st solution: With your shown samples please try following
awk
code. Written and tested in GNUawk
, should work in any version of it.2nd solution: Using GNU
awk
‘smatch
function with regex and capturing group’s values please try following.3rd solution: Assuming if your lines always have
_g
separated in lines which are getting started from>
then we can simply try followingawk
code also.4th solution: If in case
perl
one-liner is accepted you could simply use perl’s capability of capturing groups(which will be created if a regex is having true match).You may try this
sed
that searches>
at line start and if there is a match then it matches 1+ digit at end and replaces withnumber_number
substring expression: