skip to Main Content

I am not super advanced in coding and have been struggling with this problem. I need to extract a substring from a .txt file but there is no clear pattern for me to be able to use awk or cut commands. I need to extract the value for AF in each line in the picture below (circled in blue), however, the number of characters for this string varies from line to line, and the location of the string changes from line to line as well. I tried using grep but it is only returning "AF=", not the number values that follow. I also thought about using the re.findall command in python but the python environment that I have in Ubuntu isn’t letting me use it.[enter image description here][1]

I would greatly appreciate any guidance, thank you!!!
[1]: https://i.stack.imgur.com/gvkQJ.png

text file:
19 281438 281439 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=1.0000;ERATE=0.0003;LDAF=0.0005;RSQ=1.0000;SNPSOURCE=EXOME;THETA=0.0007;VT=SNP . +
19 281438 281439 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=1.0000;ERATE=0.0003;LDAF=0.0005;RSQ=1.0000;SNPSOURCE=EXOME;THETA=0.0007;VT=SNP . +
19 281438 281439 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=1.0000;ERATE=0.0003;LDAF=0.0005;RSQ=1.0000;SNPSOURCE=EXOME;THETA=0.0007;VT=SNP . +
19 281467 281468 LDAF=0.0013;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;AVGPOST=0.9998;THETA=0.0056;ERATE=0.0003;RSQ=0.9244;AC=3;AF=0.0014;EUR_AF=0.0040 . +
19 281467 281468 LDAF=0.0013;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;AVGPOST=0.9998;THETA=0.0056;ERATE=0.0003;RSQ=0.9244;AC=3;AF=0.0014;EUR_AF=0.0040 . +
19 281467 281468 LDAF=0.0013;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;AVGPOST=0.9998;THETA=0.0056;ERATE=0.0003;RSQ=0.9244;AC=3;AF=0.0014;EUR_AF=0.0040 . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282787 282788 AVGPOST=1.0000;LDAF=0.0009;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;THETA=0.0011;RSQ=1.0000;ER

Desired output:
AF=numbervalue
(for each line)

3

Answers


  1. Since the example text is not provided as text but as image, here is my own example text (generated by me, by randomly tapping keyboard):

    AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
    j6u;5ju65e;t42r;g5b5;AF=32.43542235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
    3f4gh5y4bt4h5;g4;3h;4j64g;y;AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
    

    What I noticed is that it’s like table, with each fields separated with semicolon (;), and value is defined with KEY=VALUE

    To just get value of AF field, you can use grep with such pattern: AF=[0-9.]+

    Explanation: [0-9.] will match character 0123456789., and + will match if it occurs once or more

    Here is example terminal output:

    $ cat /tmp/a
    AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
    j6u;5ju65e;t42r;g5b5;AF=32.43542235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
    3f4gh5y4bt4h5;g4;3h;4j64g;y;AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
    
    $ grep -o -E 'AF=[0-9.]+' /tmp/a
    AF=32435.42235
    AF=32.43542235
    AF=32435.42235
    
    

    Now if you want only the numbers (without the AF= prefix), you can just pipe it to other grep command like such:

    $ grep -o -E 'AF=[0-9.]+' /tmp/a | grep -o -E '[0-9.]+'
    32435.42235
    32.43542235
    32435.42235
    
    

    Grep flag explanation: -E enables extended regular expression, -o only output match instead of whole line

    Login or Signup to reply.
  2. You can use grep to match everything from AF= up to but not including the first semicolon:

    grep -o 'AF=[^;]*'
    

    To guard against spurious matches when AF= appears elsewhere in a line, the following will match only when AF= begins on a word boundary:

    grep -o 'bAF=[^;]*'
    
    Login or Signup to reply.
  3. Grep should be the best way to do it, but here is an awk

    echo "test;AF=342435.34234;yes=3434" | awk -F'AF=' '{split($2,a,";");print FS a[1]}'
    AF=342435.34234
    

    It finds the AF= tag, then take rest of the text unn til ;

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search