I work in SEO and sometimes I have to manage lists of domains to be considered for certain actions in our campaigns. On my iMac, I have 2 lists, one provided for consideration – unfiltered.txt
– and another that has listed the domains I’ve already analyzed – used.txt
. The one provided for consideration, the new one (unfiltered.txt
), looks like this:
site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
... etc
List of domains that needs to be used as a filter, to be eliminated (used.txt
) – looks like this.
site4.org
site5.me
site6.co.nz
gland.org.uk
kland.co.nz
site7.de
site8.it
... etc
Is there a way to use my OS X terminal to remove from unfiltered.txt all the lines found in used.txt? Found a software solution that partially solves a problem, and, aside from the words from used.txt, eliminates also words containing these smaller words. It means I get a broader filter and eliminate also domains that I still need.
For example, if my unfiltered.txt contains a domain named fogland.org.uk
it will be automatically eliminated if in my used.txt file I have a domain named gland.org.uk
.
Files are pretty big (close to 100k lines). I have pretty good configuration, with SSD, i7 7th gen, 16GB RAM, but it is unlikely to let it run for hours just for this operation.
… hope it makes sense.
TIA
4
Answers
You can use
comm
. I haven’t got a mac here to check but I expect it will be installed by default. Note that both files must be sorted. Then try:comm -2 -3 unfiltered.txt used.txt
Check the man page for further details.
You can do that with
awk
. You pass both files toawk
. Whilst parsing the first file, where the current record number across all files is the same as the record number in the current file, you make a note of each domain you have seen. Then, when parsing the second file, you only print records that correspond to ones you have not seen in the first file:Sample Output for your input data
awk
is included and delivered as part of macOS – no need to install anything.I have always used
to do this. When “expunge.txt” is too large, you can do it in stages, cutting it into manageable chunks and filtering one after another:
You could even do this in a pipe:
You can use
comm
and process substitution to do everything in one line:P.S. tested on my Mac running OSX 10.11.6 (El Capitan)