Apache Cassandra Logo

Apache Cassandra

An eventually consistent database,
that scales with ease

by Sabine Maennel
at pydata Zürich 25.1.2017

Overview

"The problem is that Cassandra’s data model is different enough from that of a traditional database to readily cause confusion"

from Cassandra by example, rackspace.com

  • "map of maps"
  • "a map of maps of maps"
  • "containers that hold collections of column objects"
  • "columns ... as 3-tuples"


Some characteristics

History

  1. Facebook invented Cassandra
  2. Influenced by Google and Amazon
  3. Today its driven by Apache as an Open Source Project
Apache logo

How I met Cassandra

coming from an RDBMS background ...

RDBMS Cassandra
Query-Language SQL CQL
Container Database Keyspace
Table Table Table
Fields Column Column
Primary Key Primary Key Primary Key
Operations SELECT, CRUD SELECT, CRUD

But Casssandra is different ...

Cassandra is a mulitlevel-map rather then a structure

How to think of Cassandra

  1. Cassandra is mostly hosted on servers, that form rings: "clusters"
  2. there is no manager node
  3. they talk via a Gossip-Protocol
Cassandra cluster

A table is distributed

  1. Partitions of the table are mapped to different serves in the ring
  2. the mapping is done by a hashing algorithm
  3. there are replications of each row -> replication-factor
Cassandra table

Why replications of data?

Now we understand why a table is distributed

Apache Cassandra Logo

Imagine a query in this distributed system

  1. it does not work!
  2. but, some rows are closer then others ...
Apache Cassandra Logo

partitions are cluster of rows

the primary key has two parts:

Apache Cassandra Logo

rows contain maps rather than columns

columns in cassandra

Columns are maps

  1. they consist of key-value pairs
  2. they come with a timestamp
  3. they may even expire
Cassandra column

remember ...

What does eventually consistent mean?

-> look at how Cassandra reads and writes

Cassandra update

Cassandra is consistent if ...

Cassandra update

Lets look at an example Twissandra

  • Twissandra (Java)
  • >cassandra to start cassandra
    >cqlsh in a different terminal to start CQL
    
            cqlsh> CREATE KEYSPACE IF NOT EXISTS twissandra 
                   WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
            
    
            cqlsh> DESCRIBE twissandra;
            CREATE KEYSPACE twissandra WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;
            

    Now we have to use that keyspace

    
            cqlsh> USE twissandra; 
            

    We are ready to create the first table:

    
            
            cqlsh> CREATE TABLE users (
                   username text PRIMARY KEY,
                   password text);
            

    Look at your table

    
            cqlsh> DESCRIBE users; 
            CREATE TABLE twissandra.users (
                username text PRIMARY KEY,
                password text
            ) WITH bloom_filter_fp_chance = 0.01
                AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
                AND comment = ''
                AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
                AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
                AND crc_check_chance = 1.0
                AND dclocal_read_repair_chance = 0.1
                AND default_time_to_live = 0
                AND gc_grace_seconds = 864000
                AND max_index_interval = 2048
                AND memtable_flush_period_in_ms = 0
                AND min_index_interval = 128
                AND read_repair_chance = 0.0
                AND speculative_retry = '99PERCENTILE';        
            

    There are a lots of defaults in place ...

    Datamodelling in Cassandra: thinking in queries

    %% Example diagram graph LR l(Access) -- username --> a(user) a -- username --> B(following) a -- username --> A(followers) a -- username --> D(Tweets
    -Timeline) a -- username --> C(Tweets
    -Userline) style l fill:#f9f,stroke:#333,stroke-width:4px; classDef db fill:lightblue,stroke:#333,stroke-width:1px; class a,A,B,C,D db

    Following and Followers

            
            cqlsh> 
                -- "username" follows "followed"
                CREATE TABLE following (
                    username text,
                    followed text,
                    PRIMARY KEY(username, followed)
                );
                -- "username" is followed by "following"
                CREATE TABLE followers (
                    username  text,
                    following text,
                    PRIMARY KEY(username, following)
                );
            

    Tweets and Userline

            
                 CREATE TABLE tweets (
                    tweetid uuid PRIMARY KEY, 
                    username text, 
                    body text
                 );
                CREATE TABLE userline (
                    tweetid  timeuuid,
                    username text,
                    body     text,
                    PRIMARY KEY(username, tweetid)
                );
            

    Timeline

            
                CREATE TABLE timeline (
                    username  text,
                    tweetid   timeuuid,
                    posted_by text,
                    body      text,
                    PRIMARY KEY(username, tweetid)
                );
            

    Live Demo on a virtual machine

    Twissandra sample data

    %% Example diagram graph LR a(Mary) --> b(Tom) c(Susan) --> b d(John) --> b b --> d b --> e(Angelina) b --> f(Sally) style b fill:orange,stroke-width:1px; class d fill:bluegreen,stroke-width:1px; classDef follow fill:lightblue,stroke-width:1px; classDef following fill:lightgreen,stroke-width:1px; class a,c follow class e,f following

    Useful Links and some companies, that use Cassandra