skip to Main Content

I work in SEO and sometimes I have to manage lists of domains to be considered for certain actions in our campaigns. On my iMac, I have 2 lists, one provided for consideration – unfiltered.txt – and another that has listed the domains I’ve already analyzed – used.txt. The one provided for consideration, the new one (unfiltered.txt), looks like this:

site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
... etc

List of domains that needs to be used as a filter, to be eliminated (used.txt) – looks like this.

site4.org
site5.me
site6.co.nz
gland.org.uk
kland.co.nz
site7.de
site8.it
... etc

Is there a way to use my OS X terminal to remove from unfiltered.txt all the lines found in used.txt? Found a software solution that partially solves a problem, and, aside from the words from used.txt, eliminates also words containing these smaller words. It means I get a broader filter and eliminate also domains that I still need.

For example, if my unfiltered.txt contains a domain named fogland.org.uk it will be automatically eliminated if in my used.txt file I have a domain named gland.org.uk.

Files are pretty big (close to 100k lines). I have pretty good configuration, with SSD, i7 7th gen, 16GB RAM, but it is unlikely to let it run for hours just for this operation.

… hope it makes sense.

TIA

4

Answers


  1. You can use comm. I haven’t got a mac here to check but I expect it will be installed by default. Note that both files must be sorted. Then try:

    comm -2 -3 unfiltered.txt used.txt

    Check the man page for further details.

    Login or Signup to reply.
  2. You can do that with awk. You pass both files to awk. Whilst parsing the first file, where the current record number across all files is the same as the record number in the current file, you make a note of each domain you have seen. Then, when parsing the second file, you only print records that correspond to ones you have not seen in the first file:

    awk 'FNR==NR{seen[$0]++;next} !seen[$0]' used.txt unfiltered.txt 
    

    Sample Output for your input data

    site1.com
    site2.com
    domain3.net
    british.co.uk
    england.org.uk
    auckland.co.nz
    

    awk is included and delivered as part of macOS – no need to install anything.

    Login or Signup to reply.
  3. I have always used

    grep -v -F -f expunge.txt filewith.txt > filewithout.txt
    

    to do this. When “expunge.txt” is too large, you can do it in stages, cutting it into manageable chunks and filtering one after another:

    cp filewith.txt original.txt
    
    and loop as required:
        grep -v -F -f chunkNNN.txt filewith.txt > filewithout.txt
        mv filewithout.txt filewith.txt
    

    You could even do this in a pipe:

     grep -v -F -f chunk01.txt original.txt |
     grep -v -F -f chunk02.txt original.txt |
     grep -v -F -f chunk03.txt original.txt 
     > purged.txt
    
    Login or Signup to reply.
  4. You can use comm and process substitution to do everything in one line:

    comm -23 <(sort used.txt) <(sort unfiltered.txt) > used_new.txt
    

    P.S. tested on my Mac running OSX 10.11.6 (El Capitan)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search