Getting error when trying to unload or count data from AWS Keyspace using dsbulk.
Error:
Operation COUNT_20221021-192729-813222 failed: Token metadata not present.
Command line:
$ dsbulk count/unload -k my_best_storage -t book_awards -f ./dsbulk_keyspaces.conf
Config:
datastax-java-driver {
basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
advanced.auth-provider {
class = PlainTextAuthProvider
username = "aw.keyspaces-at-XXX"
password = "XXXX"
}
basic.load-balancing-policy {
local-datacenter = "us-east-2"
}
basic.request {
consistency = LOCAL_QUORUM
default-idempotence = true
}
advanced {
request{
log-warnings = true
}
ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "./cassandra_truststore.jks"
truststore-password = "XXX"
hostname-validation = false
}
metadata {
token-map.enabled = false
}
}
}
dsbulk load – loading operator works fine…
2
Answers
I suspect the problem here is that your cluster is using the proprietary
com.amazonaws.cassandra.DefaultPartitioner
partitioner which most open-source tools and drivers don’t recognise.The DataStax Bulk Loader (DSBulk) tool uses the Cassandra Java driver under the hood to connect to Cassandra clusters. The Java driver uses the partitioner to determine which nodes own tokens [ranges]. Only the following Cassandra partitioners are supported:
Murmur3Partitioner
RandomPartitioner
ByteOrderedPartitioner
Since the Java driver doesn’t know about
DefaultPartitioner
, it doesn’t have a map of token range owners (token metadata) and so can’t determine how to "split" the Cassandra ring to query the nodes.As you already figured out, this doesn’t affect the
load
command because it simply sends writes to coordinators and lets the coordinators figure out how the data is partitioned. But forunload
andcount
commands which require reads, the Java driver can’t determine which coordinators to pick for sub-range queries with an unsupported partitioner.Maybe as a workaround you can try to disable token-awareness with:
but I don’t have an AWS Keyspaces cluster I could test and I’m doubtful it will work. In any case, you’re welcome to try.
There is an outstanding DSBulk feature request to provide the ability to completely disable token-awareness (internal ticket ID DAT-622) but it is unassigned at the time of writing so I’m not in a position to provide any expectation on when it will be prioritised. Cheers!
Amazon Keyspaces now supports multiple partitioners including MurMr3Partitioner. See the following to update your partitioner. You will also want to set token-map.enabled to true.
Additionally, if you are using VPC Endpoints you will need the following permissions to make sure that you will see available peers.
I would also recommend increasing the connection pool size for the data load process.
Finally, I would recommend using AWS glue instead of DSBulk. DSBulk is single process tool and will not scale for larger data loads. Additionally, learning glue will be helpful in managing other aspects of the data lifecycle. See my example on how to unload/export data using AWS Glue.