skip to Main Content

Getting error when trying to unload or count data from AWS Keyspace using dsbulk.

Error:

Operation COUNT_20221021-192729-813222 failed: Token metadata not present.

Command line:

$ dsbulk count/unload -k my_best_storage -t book_awards -f ./dsbulk_keyspaces.conf

Config:

datastax-java-driver {
  basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
  advanced.auth-provider {
        class = PlainTextAuthProvider
        username = "aw.keyspaces-at-XXX"
        password = "XXXX"
  }


basic.load-balancing-policy {
    local-datacenter = "us-east-2"
}
basic.request {
    consistency = LOCAL_QUORUM
    default-idempotence = true
}

advanced {
  request{
    log-warnings = true
  }

  ssl-engine-factory {
    class = DefaultSslEngineFactory
    truststore-path = "./cassandra_truststore.jks"
    truststore-password = "XXX"
    hostname-validation = false
  }
  metadata {
      token-map.enabled = false
  }

}
}

dsbulk load – loading operator works fine…

2

Answers


  1. I suspect the problem here is that your cluster is using the proprietary com.amazonaws.cassandra.DefaultPartitioner partitioner which most open-source tools and drivers don’t recognise.

    The DataStax Bulk Loader (DSBulk) tool uses the Cassandra Java driver under the hood to connect to Cassandra clusters. The Java driver uses the partitioner to determine which nodes own tokens [ranges]. Only the following Cassandra partitioners are supported:

    • Murmur3Partitioner
    • RandomPartitioner
    • ByteOrderedPartitioner

    Since the Java driver doesn’t know about DefaultPartitioner, it doesn’t have a map of token range owners (token metadata) and so can’t determine how to "split" the Cassandra ring to query the nodes.

    As you already figured out, this doesn’t affect the load command because it simply sends writes to coordinators and lets the coordinators figure out how the data is partitioned. But for unload and count commands which require reads, the Java driver can’t determine which coordinators to pick for sub-range queries with an unsupported partitioner.

    Maybe as a workaround you can try to disable token-awareness with:

    $ dsbulk count [...]
      --driver.advanced.metadata.token-map.enabled false
    

    but I don’t have an AWS Keyspaces cluster I could test and I’m doubtful it will work. In any case, you’re welcome to try.

    There is an outstanding DSBulk feature request to provide the ability to completely disable token-awareness (internal ticket ID DAT-622) but it is unassigned at the time of writing so I’m not in a position to provide any expectation on when it will be prioritised. Cheers!

    Login or Signup to reply.
  2. Amazon Keyspaces now supports multiple partitioners including MurMr3Partitioner. See the following to update your partitioner. You will also want to set token-map.enabled to true.

    metadata {
          token-map.enabled = true
      }
    

    Additionally, if you are using VPC Endpoints you will need the following permissions to make sure that you will see available peers.

    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Sid":"ListVPCEndpoints",
             "Effect":"Allow",
             "Action":[
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeVpcEndpoints"
             ],
             "Resource":"*"
          }
       ]
    }
    

    I would also recommend increasing the connection pool size for the data load process.

    advanced.connection.pool.local.size = 3
    

    Finally, I would recommend using AWS glue instead of DSBulk. DSBulk is single process tool and will not scale for larger data loads. Additionally, learning glue will be helpful in managing other aspects of the data lifecycle. See my example on how to unload/export data using AWS Glue.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search