skip to Main Content

I know that this has been asked before. But I have spent hours trying to get this to work.

I have a directory structure like:

- datalake
--- datasets
----- foo
------- 00001.json
------- 00002.json
------- latest.json
----- bar
------- 00001.json
------- latest.json

my include path looks like

s3:<bucket_name>/datalake/datasets/

i want to exclude things that are not latest.jsons

I have tried everything under the sun.

**0*
**/0**
*/0*
*0*
**0**

and many others.

Without fail, my crawler catalogs every .json.

I am checking the results of my crawl with Athena.

Am I seriously getting the exclude pattern wrong? Or am I somehow thinking about this entire thing the wrong way and my pattern is irrelevant?

2

Answers


  1. Chosen as BEST ANSWER

    For me, the answer ended up being related to the fact that I was using Athena to look at the updated catalog. According to this:

    https://docs.aws.amazon.com/athena/latest/ug/troubleshooting-athena.html#troubleshooting-athena-data-file-issues

    Athena will not respect the exclusion of glue files.


  2. UPDATED

    It seems that using exclude pattern **/*[!latest.json] will not work as per mentioned by OP due to single character matching.

    1. Remove your exclude patterns
    2. Use s3:<bucket_name>/datalake/datasets/**latest.json as your overall include path
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search