Extract text after last delimiter and attach at end of line [Linux/Ubuntu]

Jen
September 14, 2022
155 views
3 votes
3 Answers

I have a fasta file that looks like below:

>sequence_1_g1
ATTTCGGATAA
>sequence_2_g1
AGGCTCTAGGA
>sequence_2_g2
TGTTCTGAAAT
>sequence_2_g3
CACCTCGGAGT
>sequence_3_new_g1
GCGGATAAAGC

I’d like to only extract the numbers that comes after the last delimiter and attach them to the end of each header, so that the output would look like below:

>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

I’ve never used linux before and so far I’ve only been able to find this command to separate the text that comes after the last delimiter: sed -E 's/.*_//' filename.fasta . Can anyone give suggestions on what commands I should look for in addition to get my desired output?

Answers

- HatLess
- September 14, 2022 at 12:44 pm
- 0 votes
0
Using sed
```
$ sed -E 's/.*_.([0-9]+)/&_1/' input_file
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC
```
Login or Signup to reply.

- RavinderSingh13
- September 14, 2022 at 12:45 pm
- 0 votes
0
1st solution: With your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of it.
```
awk '/^>/{$0=$0 "_" substr($0,length($0))} 1' Input_file
```
2nd solution: Using GNU awk‘s match function with regex and capturing group’s values please try following.
```
awk 'match($0,/^>.*([0-9]+)$/,arr){$0=$0"_"arr[1]} 1'  Input_file
```
3rd solution: Assuming if your lines always have _g separated in lines which are getting started from > then we can simply try following awk code also.
```
awk -F'_g' '/^>/{$0=$0"_"$2} 1'  Input_file
```
4th solution: If in case perl one-liner is accepted you could simply use perl’s capability of capturing groups(which will be created if a regex is having true match).
```
perl -pe 's/(^>.*)([0-9]+$)/12_2/'  Input_file
```
Login or Signup to reply.

- anubhava
- September 14, 2022 at 12:48 pm
- 0 votes
0
You may try this sed that searches > at line start and if there is a match then it matches 1+ digit at end and replaces with number_number substring expression:
```
sed -E '/^>/s/[0-9]+$/&_&/' file

>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.