I am not super advanced in coding and have been struggling with this problem. I need to extract a substring from a .txt file but there is no clear pattern for me to be able to use awk or cut commands. I need to extract the value for AF in each line in the picture below (circled in blue), however, the number of characters for this string varies from line to line, and the location of the string changes from line to line as well. I tried using grep but it is only returning "AF=", not the number values that follow. I also thought about using the re.findall command in python but the python environment that I have in Ubuntu isn’t letting me use it.[enter image description here][1]
I would greatly appreciate any guidance, thank you!!!
[1]: https://i.stack.imgur.com/gvkQJ.png
text file:
19 281438 281439 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=1.0000;ERATE=0.0003;LDAF=0.0005;RSQ=1.0000;SNPSOURCE=EXOME;THETA=0.0007;VT=SNP . +
19 281438 281439 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=1.0000;ERATE=0.0003;LDAF=0.0005;RSQ=1.0000;SNPSOURCE=EXOME;THETA=0.0007;VT=SNP . +
19 281438 281439 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=1.0000;ERATE=0.0003;LDAF=0.0005;RSQ=1.0000;SNPSOURCE=EXOME;THETA=0.0007;VT=SNP . +
19 281467 281468 LDAF=0.0013;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;AVGPOST=0.9998;THETA=0.0056;ERATE=0.0003;RSQ=0.9244;AC=3;AF=0.0014;EUR_AF=0.0040 . +
19 281467 281468 LDAF=0.0013;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;AVGPOST=0.9998;THETA=0.0056;ERATE=0.0003;RSQ=0.9244;AC=3;AF=0.0014;EUR_AF=0.0040 . +
19 281467 281468 LDAF=0.0013;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;AVGPOST=0.9998;THETA=0.0056;ERATE=0.0003;RSQ=0.9244;AC=3;AF=0.0014;EUR_AF=0.0040 . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282264 282265 AA=.;AC=1;AF=0.0005;AN=2184;ASN_AF=0.0017;AVGPOST=0.9997;ERATE=0.0003;LDAF=0.0005;RSQ=0.8040;SNPSOURCE=EXOME;THETA=0.0045;VT=SNP . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282752 282753 ERATE=0.0005;SNPSOURCE=LOWCOV,EXOME;AN=2184;RSQ=0.9960;LDAF=0.3734;AC=815;VT=SNP;AA=.;THETA=0.0059;AVGPOST=0.9973;AF=0.37;ASN_AF=0.15;AMR_AF=0.42;AFR_AF=0.43;EUR_AF=0.48 . +
19 282787 282788 AVGPOST=1.0000;LDAF=0.0009;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=.;THETA=0.0011;RSQ=1.0000;ER
Desired output:
AF=numbervalue
(for each line)
3
Answers
Since the example text is not provided as text but as image, here is my own example text (generated by me, by randomly tapping keyboard):
What I noticed is that it’s like table, with each fields separated with semicolon (;), and value is defined with KEY=VALUE
To just get value of AF field, you can use grep with such pattern:
AF=[0-9.]+
Explanation:
[0-9.]
will match character 0123456789., and+
will match if it occurs once or moreHere is example terminal output:
Now if you want only the numbers (without the
AF=
prefix), you can just pipe it to other grep command like such:Grep flag explanation:
-E
enables extended regular expression,-o
only output match instead of whole lineYou can use grep to match everything from AF= up to but not including the first semicolon:
To guard against spurious matches when AF= appears elsewhere in a line, the following will match only when AF= begins on a word boundary:
Grep should be the best way to do it, but here is an
awk
It finds the
AF=
tag, then take rest of the text unn til;