I know that this has been asked before. But I have spent hours trying to get this to work.
I have a directory structure like:
- datalake
--- datasets
----- foo
------- 00001.json
------- 00002.json
------- latest.json
----- bar
------- 00001.json
------- latest.json
my include path looks like
s3:<bucket_name>/datalake/datasets/
i want to exclude things that are not latest.json
s
I have tried everything under the sun.
**0*
**/0**
*/0*
*0*
**0**
and many others.
Without fail, my crawler catalogs every .json.
I am checking the results of my crawl with Athena.
Am I seriously getting the exclude pattern wrong? Or am I somehow thinking about this entire thing the wrong way and my pattern is irrelevant?
2
Answers
For me, the answer ended up being related to the fact that I was using Athena to look at the updated catalog. According to this:
https://docs.aws.amazon.com/athena/latest/ug/troubleshooting-athena.html#troubleshooting-athena-data-file-issues
Athena will not respect the exclusion of glue files.
UPDATED
It seems that using exclude pattern
**/*[!latest.json]
will not work as per mentioned by OP due to single character matching.s3:<bucket_name>/datalake/datasets/**latest.json
as your overall include path