Friday, 23 December 2016

Replacing a dead cassandra node and resolution to issues with Host ID

This blog details about replacing a dead cassandra node. Recently, i have faced this situation and i had to struggle with few issues. I would like to detail all those issues and point right resolution in all the situations.

1. First i would like to describe normal replace procedure. This should be first try, and this works only in ideal situations. Nevertheless, try it.

a. Check the status of the node using "nodetool status". If any node is down, the status will appear with "DN" status.

eg:

Assuming that you have 6 nodes, here is how it looks. I am giving a masked details and renamed host id and IPs.

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  1.2.3.4   7.57 GiB   256          50.2%             2ssbaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.5    7.4 GiB    256          50.1%             9ssbaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.6   7.82 GiB   256          51.3%             1ssbaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.7    7.62 GiB   256          48.9%             2f71f53-9dbdaf-4r324-8697-f80b9351e7  --
DN  1.2.3.8   6.91 GiB   256          47.8%             38abaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.9    7.92 GiB   256          51.7%             7cddaar-9dbdaf-4r324-8697-f80b9351e7  --

Here, in the above status, the node with IP 1.2.3.8 is DOWN.

Assuming we have to replace that with a new machine, here are the steps.

1) install cassandra on new node and do not start cassandra.

2) make sure the seed details and every thing is fine in the cassandra installation.

3) start cassandra with following command (assuming your cassandra installation directory is /usr/lib/cassandra).

/usr/lib/cassandra/bin/cassandra -Dcassandra.replace_address_first_boot=1.2.3.8

Now, the story begins.

case 1: If it starts without any problems, you must be lucky and go to all the nodes and do repair on every one. That should be it.

case 2: If you get waring saying, it is unsafe to replace and use cassandra.allow_unsafe_replace.

then: /usr/lib/cassandra/bin/cassandra -Dcassandra.replace_address_first_boot=1.2.3.8 -Dcassandra.allow_unsafe_replace=true

if it starts after that, you can still consider your self lucky.. go ahead with repair on each node and you will be done.

case 3: If it screams with error 
java.lang.RuntimeException: Host ID collision between active endpoint
It means, the details from seed about the cluster information are still having the died machine in its gossip or system information. If you get this situation, proceed as follows.

i) nodetool removenode --host 1.2.3.8

Then run the cassandra command with replace as in step 1/step 2.

ii) If it still screams at you,

what you can do is, go to the data folder of cassandra. This will be configured in cassandra.yaml. By default it will be, <cassandra_installation_directory>/data.

check the system directory in data directory. This is system information collected from all the machines. Once it is created with old machine details, you will get into this situation.

run the command on the new/fresh node that you want to replace. 

P.S. Do not run this command on any existing machine. This will destroy complete cluster information if mis-used.

rm -r <data_direcotry>/system/*

What it means: Removing all system tables data from the new cassandra node.

now run the command with replace_address command above. If you encounter case 2, run with unsafe replace true.

This should join the cluster now without any issues. When  you check the nodetool status,
you should see only new node, but with same host id as old machine, like below.

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  1.2.3.4   7.57 GiB   256          50.2%             2ssbaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.5    7.4 GiB    256          50.1%             9ssbaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.6   7.82 GiB   256          51.3%             1ssbaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.7    7.62 GiB   256          48.9%             2f71f53-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.10   6.91 GiB   256          47.8%             38abaar-9dbdaf-4r324-8697-f80b9351e7  --
UN  1.2.3.9    7.92 GiB   256          51.7%             7cddaar-9dbdaf-4r324-8697-f80b9351e7  --


Run repair on all the machines, and you are ready to go... 

--------------
Thank You.




No comments:

Post a Comment