Mac OS terminal solution to remove from a textfile lines from another textfiles - SEO

designarti
December 27, 2016
197 views
3 votes
4 Answers

I work in SEO and sometimes I have to manage lists of domains to be considered for certain actions in our campaigns. On my iMac, I have 2 lists, one provided for consideration – unfiltered.txt – and another that has listed the domains I’ve already analyzed – used.txt. The one provided for consideration, the new one (unfiltered.txt), looks like this:

site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
... etc

List of domains that needs to be used as a filter, to be eliminated (used.txt) – looks like this.

site4.org
site5.me
site6.co.nz
gland.org.uk
kland.co.nz
site7.de
site8.it
... etc

Is there a way to use my OS X terminal to remove from unfiltered.txt all the lines found in used.txt? Found a software solution that partially solves a problem, and, aside from the words from used.txt, eliminates also words containing these smaller words. It means I get a broader filter and eliminate also domains that I still need.

For example, if my unfiltered.txt contains a domain named fogland.org.uk it will be automatically eliminated if in my used.txt file I have a domain named gland.org.uk.

Files are pretty big (close to 100k lines). I have pretty good configuration, with SSD, i7 7th gen, 16GB RAM, but it is unlikely to let it run for hours just for this operation.

… hope it makes sense.

TIA

Tags: macos terminal

Answers

- user133831
- December 27, 2016 at 1:40 pm
- 0 votes
0
You can use comm. I haven’t got a mac here to check but I expect it will be installed by default. Note that both files must be sorted. Then try:

comm -2 -3 unfiltered.txt used.txt

Check the man page for further details.

Login or Signup to reply.

- MarkSetchell
- December 29, 2016 at 9:00 pm
- 0 votes
0
You can do that with awk. You pass both files to awk. Whilst parsing the first file, where the current record number across all files is the same as the record number in the current file, you make a note of each domain you have seen. Then, when parsing the second file, you only print records that correspond to ones you have not seen in the first file:
```
awk 'FNR==NR{seen[$0]++;next} !seen[$0]' used.txt unfiltered.txt 
```
Sample Output for your input data
```
site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
```
awk is included and delivered as part of macOS – no need to install anything.
Login or Signup to reply.

- LSerni
- December 29, 2016 at 9:04 pm
- 0 votes
0
I have always used
```
grep -v -F -f expunge.txt filewith.txt > filewithout.txt
```
to do this. When “expunge.txt” is too large, you can do it in stages, cutting it into manageable chunks and filtering one after another:
```
cp filewith.txt original.txt

and loop as required:
    grep -v -F -f chunkNNN.txt filewith.txt > filewithout.txt
    mv filewithout.txt filewith.txt
```
You could even do this in a pipe:
```
 grep -v -F -f chunk01.txt original.txt |
 grep -v -F -f chunk02.txt original.txt |
 grep -v -F -f chunk03.txt original.txt 
 > purged.txt
```
Login or Signup to reply.

- Mauro
- December 31, 2016 at 1:43 pm
- 0 votes
0
You can use comm and process substitution to do everything in one line:
```
comm -23 <(sort used.txt) <(sort unfiltered.txt) > used_new.txt
```
P.S. tested on my Mac running OSX 10.11.6 (El Capitan)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Mac OS terminal solution to remove from a textfile lines from another textfiles – SEO

Answers