skip to Main Content

I know that this has been asked before. But I have spent hours trying to get this to work.

I have a directory structure like:

- datalake
--- datasets
----- foo
------- 00001.json
------- 00002.json
------- latest.json
----- bar
------- 00001.json
------- latest.json

my include path looks like


i want to exclude things that are not latest.jsons

I have tried everything under the sun.


and many others.

Without fail, my crawler catalogs every .json.

I am checking the results of my crawl with Athena.

Am I seriously getting the exclude pattern wrong? Or am I somehow thinking about this entire thing the wrong way and my pattern is irrelevant?



  1. Chosen as BEST ANSWER

    For me, the answer ended up being related to the fact that I was using Athena to look at the updated catalog. According to this:

    Athena will not respect the exclusion of glue files.


    It seems that using exclude pattern **/*[!latest.json] will not work as per mentioned by OP due to single character matching.

    1. Remove your exclude patterns
    2. Use s3:<bucket_name>/datalake/datasets/**latest.json as your overall include path
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top