skip to Main Content

I have a fasta file that looks like below:

>sequence_1_g1
ATTTCGGATAA
>sequence_2_g1
AGGCTCTAGGA
>sequence_2_g2
TGTTCTGAAAT
>sequence_2_g3
CACCTCGGAGT
>sequence_3_new_g1
GCGGATAAAGC

I’d like to only extract the numbers that comes after the last delimiter and attach them to the end of each header, so that the output would look like below:

>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

I’ve never used linux before and so far I’ve only been able to find this command to separate the text that comes after the last delimiter: sed -E 's/.*_//' filename.fasta . Can anyone give suggestions on what commands I should look for in addition to get my desired output?

3

Answers


  1. Using sed

    $ sed -E 's/.*_.([0-9]+)/&_1/' input_file
    >sequence_1_g1_1
    ATTTCGGATAA
    >sequence_2_g1_1
    AGGCTCTAGGA
    >sequence_2_g2_2
    TGTTCTGAAAT
    >sequence_2_g3_3
    CACCTCGGAGT
    >sequence_3_new_g1_1
    GCGGATAAAGC
    
    Login or Signup to reply.
  2. 1st solution: With your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of it.

    awk '/^>/{$0=$0 "_" substr($0,length($0))} 1' Input_file
    

    2nd solution: Using GNU awk‘s match function with regex and capturing group’s values please try following.

    awk 'match($0,/^>.*([0-9]+)$/,arr){$0=$0"_"arr[1]} 1'  Input_file
    

    3rd solution: Assuming if your lines always have _g separated in lines which are getting started from > then we can simply try following awk code also.

    awk -F'_g' '/^>/{$0=$0"_"$2} 1'  Input_file
    

    4th solution: If in case perl one-liner is accepted you could simply use perl’s capability of capturing groups(which will be created if a regex is having true match).

    perl -pe 's/(^>.*)([0-9]+$)/12_2/'  Input_file
    
    Login or Signup to reply.
  3. You may try this sed that searches > at line start and if there is a match then it matches 1+ digit at end and replaces with number_number substring expression:

    sed -E '/^>/s/[0-9]+$/&_&/' file
    
    >sequence_1_g1_1
    ATTTCGGATAA
    >sequence_2_g1_1
    AGGCTCTAGGA
    >sequence_2_g2_2
    TGTTCTGAAAT
    >sequence_2_g3_3
    CACCTCGGAGT
    >sequence_3_new_g1_1
    GCGGATAAAGC
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search