Cassandra – Upgrade from Version 3.11 to 4.0

In my last blog post ‘Setup a Public Cassandra Cluster with Docker‘ I described how to setup a Cassandra Cluster with docker in a public network. The important part of this blog post was how to secure the inter-node and client-node communication in such a scenario. In this bog post I will just cover some details about migrating from version 3.11 to version 4.0.

General Upgrade from 3.x to 4.0

In general it is quite simple to upgrade a Cassandra Node form version 3.x to 4.0 because the version 4.0 can handle the table files from version 3. So at least you need to change your Docker run command pointing to a 4.0 version:

docker run --name cassandra -d \
        -e CASSANDRA_BROADCAST_ADDRESS=<YOUR-PUBLIC-IP> \
        -e CASSANDRA_SEEDS=<COMMA SEPARATED IP LIST OF EXISTING NODES> \
        -p 7000:7000 \
        -p 9042:9042 \
        -v ~/cassandra.yaml:/etc/cassandra/cassandra.yaml\
        -v ~/cqlshrc:/root/.cassandra/cqlshrc\
        -v ~/security:/security\
        -v /var/lib/cassandra:/var/lib/cassandra\
        --restart always\
        cassandra:4.0.6

The cassandra.yaml File

Before you can start the new Cassandara node, you need to update the cassandra.yaml file.

First I recommand to start a local cassandra docker container and copy the origin cassandra.yaml file from the running container. This is necessary because a lot of parameters and settings have change form version 3.x to 4.0

Now you can tweak the cassandra.yaml file. In parallel you can check your current cluster configuration from a running node with docker:

docker exec -it cassandra cat /etc/cassandra/cassandra.yaml

First take care about the following parameters which should be set to your previous configuration settings:

  • cluster_name
  • num_tokens
  • authenticator
  • seed_provider
  • listen_address (usually out comment)
  • broadcast_address
  • broadcast_rpc_address

If you use the server_encryption_options as explained in my last post you need take care about the following sections:

...
server_encryption_options:
    #internode_encryption: none
    internode_encryption: all
    enable_legacy_ssl_storage_port: true
    keystore: /security/cassandra.keystore
    keystore_password: mypassword
    truststore: /security/cassandra.truststore
    truststore_password: mypassword

# enable or disable client/server encryption.
client_encryption_options:
    enabled: true
    optional: false
    keystore: /security/cassandra.keystore
    keystore_password: mypassword
    require_client_auth: false
....
# enable password authentication!
authenticator: PasswordAuthenticator
...

The important change is in the new parameter ‘enable_legacy_ssl_storage_port‘ which need to be set to ‘true’ during migration.

Expose Port 7000

Since version 4.0 the port 7001 is deprecated. This port was used in older version for the encrypted inter-node communication. Now port 7000 is handling both – encypted as also unencrypted communication. So it is sufficient to expose port 7000 now for inter-node communication.

But as long as your cluster contains nods running with version 3.11 you need to set the new parameter ‘enable_legacy_ssl_storage_port‘ to ‘true’. This parameter tells your 4.0 node to use still port 7001 when connecting to older nodes.

    # When set to true, encrypted and unencrypted connections are allowed on the storage_port
    # This should _only be true_ while in unencrypted or transitional operation
    # optional defaults to true if internode_encryption is none
    # optional: true
    # If enabled, will open up an encrypted listening socket on ssl_storage_port. Should only be used
    # during upgrade to 4.0; otherwise, set to false.
    enable_legacy_ssl_storage_port: true

Note: The parameter ‘enable_legacy_ssl_storage_port‘ is only needed as long as your cluster has nodes running in version 3.x. Later you ignore this param. Which is typically only during the migration phase.

If you have completed the settings you can start the node again in version 4.0.6.

Java – DataStax Driver

If you have a Java client using the DataStax Java Driver to connect to your Cassandra Cluster make sure hat you use the latest Driver verson:

<!-- DataStax Java Driver -->
<dependency>
	<groupId>com.datastax.cassandra</groupId>
	<artifactId>cassandra-driver-core</artifactId>
	<!-- for cassandra 4.0 use 3.11.3 or later -->
	<version>3.11.3</version>
	<scope>compile</scope>
</dependency>

Firewall

If you are running a firewall as explained in my last post you need take care about the new port settings. Port 7001 should no longer be needed.

Cassandra and Docker-Swarm

Running a Apache Cassandra Cluster with Docker-Swarm is quite easy using the official Docker Image. Docker-Swarm allows you to setup several docker worker nodes running on different hardware or virtual servers. Take a look at my example docker-compose.yml file:

version: "3.2"

networks:
  cluster_net:
    external:
      name: cassandra-net  
  
services:  

  ################################################################
  # The Casandra cluster 
  #   - cassandra-node1
  ################################################################        
  cassandra-001:
    image: cassandra:3.11
    environment:
      CASSANDRA_BROADCAST_ADDRESS: "cassandra-001"
    deploy:
      restart_policy:
        condition: on-failure
        max_attempts: 3
        window: 120s
      placement:
        constraints:
          - node.hostname == node-001
    volumes:
        - /mnt/cassandra:/var/lib/cassandra 
    networks:
      - cluster_net

  ################################################################
  # The Casandra cluster 
  #   - cassandra-node2
  ################################################################        
  cassandra-002:
    image: cassandra:3.11
    environment:
      CASSANDRA_BROADCAST_ADDRESS: "cassandra-002"
      CASSANDRA_SEEDS: "cassandra-001"
    deploy:
      restart_policy:
        condition: on-failure
        max_attempts: 3
        window: 120s
      placement:
        constraints:
          - node.hostname == node-002
    volumes:
        - /mnt/cassandra:/var/lib/cassandra 
    networks:
      - cluster_net

I am running each cassandra service on a specific host within my docker-swarm. We can not use the build-in scaling feature of docker-swarm because we need to define a separate data volume for each service. See the section ‘volumes’.

The other important part are the two environment variables ‘CASSANDRA_BROADCAST_ADDRESS’ and ‘CASSANDRA_SEEDS’.

‘CASSANDRA_BROADCAST_ADDRESS’ defines a container name for each cassandra node within the cassandra cluster. This name matches the service name. As both services run in the same network ‘cluster_net’ the both cassandara nodes find each user via the service name.

The second environment ‘CASSANDRA_SEEDS’ defines the seed node which need to be defined for the second service only. This is necessary even if a cassandra cluster is ‘master-less’.

That’s is!

Manage Big Data With Apache Cassandra

In this article, I will share my experience with Cassandra and how you can manage big data in an effective way.  Apache Cassandra is a high-performance, extremely scalable, fault-tolerant (i.e., no single point of failure), distributed non-relational database solution. But Cassandra differs from SQL and RDBMS in some important aspects. If, like me, you come from the world of SQL databases, it’s hard to understand Cassandra’s data concept. It took me several weeks to do so.  So let’s see what is the difference. Continue reading “Manage Big Data With Apache Cassandra”