skip to Main Content

I’m creating a table in Athena based on a list of CSV files in an S3 bucket. The files in the buckets are placed in folders like this:

$ aws s3 ls s3://bucket-name/ --recursive
2023-01-23 16:05:01      25601 logs2023/01/23/23/analytics_Log-1-2023-01-23-23-59-59-6dc5bd4c-f00f-4f34-9292-7bfa9ec33c55
2023-01-23 16:10:03      18182 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-05-01-aa2cb565-05c8-43e2-a203-96324f66a5a7
2023-01-23 16:15:05      20350 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-10-03-87b03989-c059-4fca-8e8b-909e787db889
2023-01-23 16:20:09      25187 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-15-06-6d9b39fb-c05f-4416-9b17-415f48e63591
2023-01-23 16:25:18      20590 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-20-16-3939a0fe-8cfb-4168-bc8e-e71d2122add5

This is the format for the folder structure:

logs{year}/{month}/{day}/{hour}/<filename>

I would like to use Athena’s partition projection and this is how I’m creating my table:

CREATE EXTERNAL TABLE analytics.logs (
  id string, 
  ...
  type tinyint)
PARTITIONED BY (
    year bigint COMMENT '',
    month string COMMENT '',
    day string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket-name/'
TBLPROPERTIES (
  'classification'='csv', 
  'partition.day.values'='01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 
  'partition.day.type'='enum', 
  'partition.enable'='true', 
  'partition.month.values'='01,02,03,04,05,06,07,08,09,10,11,12', 
  'partition.month.type'='enum', 
  'partition.year.range'='2022,2100', 
  'partition.year.type'='integer', 
  'storage.location.template'='s3://bucket-name/logs${year}/${month}/${day}/')

As you can see, I’m trying to partition the data using year, month, and day. Even though there’s also an hour folder, I’m not interested in that. This command executes just fine and it creates the table too. But when I query the table:

SELECT * FROM analytics.logs LIMIT 10;

It returns empty. But if I create the same table without the PARTITIONED part, I can see the records. Can someone please help me understand what I’m doing wrong?

[UPDATE]

I simplified the folder structure to see if it works. It does not.

$ aws s3 ls s3://bucket-name/test --recursive
2023-01-24 07:03:30          0 test/
2023-01-24 07:03:59          0 test/2022/
2023-01-24 07:11:06      13889 test/2022/Log-1-2022-12-01-00-00-11-255f8d74-5417-42a0-8c09-97282a626903
2023-01-24 07:11:05       8208 test/2022/Log-1-2022-12-01-00-05-15-c34eda24-36d8-484c-b7b6-4861c297d857

CREATE EXTERNAL TABLE `log_2`(
  `id` string, 
  ...
  `type` tinyint)
PARTITIONED BY ( 
  `year` bigint COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket-name/test'
TBLPROPERTIES (
  'classification'='csv', 
  'partition.enable'='true', 
  'partition.year.range'='2021,2023', 
  'partition.year.type'='integer', 
  'storage.location.template'='s3://bucket-name/test/${year}/')

And still the following query returns nothing:

SELECT * FROM "analytics"."log_2" where year = 2022 limit 10;

2

Answers


  1. Chosen as BEST ANSWER

    For anyone else who might make my mistake, the problem is that I was (incorrectly) using partition in the TBLPROPERTIES section. While it should have been projection.

    To provide you with a working example:

    CREATE EXTERNAL TABLE `log_2`(
      id string, 
      ...
      type tinyint)
    PARTITIONED BY ( 
      `year` bigint COMMENT '')
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ',' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://bucket-name/test'
    TBLPROPERTIES (
      'classification'='csv', 
      'projection.enable'='true', 
      'projection.year.range'='2021,2023', 
      'projection.year.type'='integer', 
      'storage.location.template'='s3://bucket-name/test/${year}/')
    

  2. You have a mismatch in data types. Partition by year is bigint and partition projection is integer. Make both integers.

    "projection.enabled" = "true",
    "projection.datehour.type" = "date",
    "projection.datehour.format" = "yyyy/MM/dd/HH",
    "projection.datehour.range" = "2021/01/01/00,NOW",
    "projection.datehour.interval" = "1",
    "projection.datehour.interval.unit" = "HOURS",
    

    Change the partition word to projection.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search