Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lagom Cassandra defaults could lead to DoS

See original GitHub issue

When starting a Lagom service, it will use the service locator or a static list of contact-points to connect to the cassandra cluster. The cluster will eventually respond with a longer list of hosts. Given that list, the Cassandra driver is capable of maintaining a pool of connections per host in the Cassandra cluster.

akka-persistence-cassandra is tuned to user QUORUM consistency level and replication factor 1 for the journal and ONE consistency level for the snapshot-store.

Given a Cassandra cluster killing targeted nodes causes Lagom to experience two sorts of unavailabilities.

Scenario 1: killing a particular node in the cluster will forbid the creation of entities completely because akka’s shard coordinator can’t restore its state:

[error] a.c.s.PersistentShardCoordinator - Persistence failure when replaying events for persistenceId [/sharding/HelloEntityCoordinator]. Last known sequence number [0]
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (1 required but only 0 alive)
...

The particular node to kill is unclear but I suspect it’s the first host in the list of contact-points

Scenario 2: killing one other node will forbid creating 1/n-th of the entities (n = #nodes) because akka-persistence-cassandra can’t read the snapshots to determine if that entity existed.

ssages.
[error] c.l.l.i.j.p.PersistentEntityActor - Persistence failure when replaying events for persistenceId [HelloEntitybob50]. Last known sequence number [0]
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:476) ~[guava-19.0.jar:na]

When the pool of dead nodes doesn’t affect the shard coordinator then only the key partitioning will determine what entities can’t be created.

In scenario 1 no entity can be created and some can be restored. In scenario 2 some entities can be created and some can be restored.

I had to use a 5 node cluster ( ccm create test -v 3.0.2 -n 5 -s) to gain a fine-grained control over what nodes I was killing. I kept my service locator pointing to node2 of ccm’s cluster and I’ve tried killing node1, node2 and node5.

I’ve used hello from lagom-java.g8 and invoked curl http://localhost:9000/api/hello/bob50 with incremental count (bob1, bob2,…) for a reasonable control. I also tested the impact of a node loss on entities passivation/restoring via creating several entities, killing one node and trying to restore all those entities once the passivation timeout completed.

Issue Analytics

State:
Created 6 years ago
Comments:19 (18 by maintainers)

Top GitHub Comments

1reaction

ignasi35commented, May 12, 2017

To address the actions suggested in https://github.com/lagom/lagom/issues/730#issuecomment-299761500

There are a few things we could/should do:

Document the need to appropriately set replication factors in production.

Fail if the replication factor in production is less than 3. We could perhaps provide a configuration option to turn this check off for scenarios that it’s not appropriate.

Don’t provide a default value for replication factor, but do provide it in the development environment and for tests. Ensure that it’s explicitly provided in production.

One thing this highlights I think is a need for a simple way to provide production specific configuration. In fact a better way of putting it is that there is a need for a simple way to provide development and test specific configuration. An application, to be resilient, needs a replication factor of 3 for its data - that’s an application concern, not an environment concern, so that configuration should be provided by the application, not by the production environment itself. However, in development and test, due to limitations of those environments, the replication factor needs to be 1.

While this is completely doable in Lagom, there are many ways to achieve it, and it’s not really documented. This goes against our principles of being opinionated in order to guide users in the right direction. Perhaps we should support having a production.conf file that is only loaded in production.

About:

Document the need to appropriately set replication factors in production.

Don’t provide a default value for replication factor, but do provide it in the development environment and for tests. Ensure that it’s explicitly provided in production.

While we work on a way to provide dev-specific vs prod-specific settings and while finding a way to remove the defaults provided by APC, I think the defaults provided by Lagom should be production-specific with an attempt to relax them on Test and Dev.

From the docs, I think we should use the following defaults.

NOTE: the parameters consistency-level, replication-strategy and replication-factor are quit interconnected.

journal

The keyspace should have a replication-factor of 3. Currently APC provides a default of 1.

write (setup in APC): QUORUM
read (setup in APC): QUORUM
query-journal (read) (setup in APC): QUORUM

Lagom should default replication-factor to 3.
In Dev and Test, if the embedded Cassandra Server is used, the consistency-level should be downgraded to ONE (no changes in replication-factorare required). Otherwise we could let the user overwrite the settings in application.conf so that a single-node external cassandra can be used during dev mode.

read-side

read-side must maintain the offset store and potentially some user-defined tables that are not directly derived from the journal therefore it contains data and _info. The keyspace should have a replication-factor of 3. Currently Lagom provides a default of 1. Both read and write consistency levels default to QUORUM already.

(same actions as journal)

Lagom should default replication-factor to 3.
In Dev and Test, if the embedded Cassandra Server is used, the consistency-level should be downgraded to ONE (no changes in replication-factorare required).

snapshot

Snapshots can be rebuilt from the journal any time so we can use really relaxed consistency levels.

snapshot(write): ONE/ANY
snapshot(read): ONE/LOCAL_ONE

No action required, but if we want to get fancy I think we could downgrade the consistency level of the write operation from ONE (current) to ANY to decrease latency (“Provides low latency and a guarantee that a write never fails. Delivers the lowest consistency and highest availability.”). This is probably safe since if a snapshot write fails it can still be rebuilt from the journal. Similarly we could downgrade the read operation from ONE to LOCAL_ONE to speed up the read operation to just get something from the local datacenter (even if it’s not the latest snapshot). Assuming that reading an obsolete snapshot via LOCAL_ONE and then evolve it with last events is faster than a roundtrip to another datacenter.

PS: Also useful is the docs on updating the replication strategy of a live keyspace.

Fail if the replication factor in production is less than 3. We could perhaps provide a configuration option to turn this check off for scenarios that it’s not appropriate.

We can include the check in both CassandraPersistenceComponents and CassandraPersistenceModule to verify that.

1reaction

ignasi35commented, May 8, 2017

I created https://github.com/ignasi35/lagom-testbed-high-replication-factor to ease this tests.

Dev mode: I just checked again and sbt runAll(despite increased replication-factor in journal operations) works correctly (which is a bit scary).
Test mode: tests fail (as expected) because Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)))

I’m in favour of not providing a default value for replication factor and requiring users to set up the value for Test, Dev and Prod environments separately. We’ve seen some users work in Dev with an external Cassandra database so we shouldn’t assume the Dev env is a single node environment.

Note to self: Reviewing this settings may require also reviewing the consistency levels. ATM we use:

QUORUM in APC-inherited cassandra-journal
QUORUM in cassandra-query-journal - (used on Lagom’s read-side)
ONE in APC-inherited cassandra-snaphost-store

These values for consistency-level look OK. In case we document the replication-factor setups or in case we require users to tune them, we must also mention which consistency levesl we’re using on each keyspace and why.

Top Results From Across the Web

Cassandra Setup - Lagom Framework

This page describes how to configure Cassandra for use with Persistent Entity API or Akka Typed Persistence in a Lagom service .

What is the right way to configure cassandra in prod

I'm running a lagom service on Kubernetes. In the Kubernetes Cluster i have 3 Pods of Cassandra up and running. Here is how...

How to use external cassandra from my lagom application

I am trying to run lagom sbt project but, i got struck at Cassandra external connection (could not find cassandra contact points) Error...

Lagom Cassandra Driver - java - Stack Overflow

DataStax Java Driver 3.2.0 that is used by Lagom should work with Cassandra 3.11 out of the box (just checked it myself using...

lagom/lagom - Gitter

You can set it to akka.persistence.cassandra.ConfigSessionProvider or com.lightbend.lagom.internal.persistence.cassandra.ServiceLocatorSessionProvider (the ...