AWS Kinesis
AWS Kinesis
- Kinesis is a managed alternative to Apache Kafka
- It is a big data stream tool, which allows to stream application logs, metrics, IoT data, click streams, etc.
- Compatible with many streaming frameworks (Spark, NiFi, etc.)
- Data is automatically replicated to 3 AZ
- Kineses offers 3 types of products:
- Kinesis Streams: low latency streaming ingest at scale
- Kinesis Analytics: perform real-time analytics on streams using SQL
- Kinesis Firehose: load streams into S3, Redshift, ElasticSearch
Kinesis Streams Overview
- Streams are divided in ordered Shards/Partitions
- For higher throughput we can increase the size of the shards
- Data retention is 1 day by default, can go up to 7 days
- Kinesis Streams provides the ability to reprocess/replay the data
- Multiple applications can consume the same stream, this enables real-time processing with scale of throughput
- Kinesis is not a database, once the data is inserted, it can not be deleted
Kinesis Stream Shards
- One stream is made of many different shards
- 1MB/s or 1000 messages at write PER SHARD
- 2MB/s read PER SHARD
- Billing is done per shard provisioned, we can have a many shards as we want as long as we accept the cost
- Ability to batch the messages per calls
- The number of shards can evolve over time (reshard/merge)
- Records are ordered per shard!
Kinesis Streams API - Put Records
- Data must be sent form the PutRecords API to a partition key
- Data with the same key goes to the same partition (helps with ordering for a specific key)
- Messages sent get a sequence number
- Partition key must be highly distributed in order to avoid hot partitions
- In order to reduce costs, we can use batching with PutRecords API
- It the limits are reached, we get a ProvisionedThroughputException
Exceptions
- ProvisionedThroughputException Exceptions:
- Happens when the data value exceeds the limit exposed by the shard
- In order to avoid the, we have to make sure we don’t have hot partitions
- Solutions:
- Retry with back-off
- Increase shards (scaling)
- Ensure the partition key is good
Consumers
- Consumers can use CLI or SDK, or the Kinesis Client Library (in Java, Node, Python, Ruby, .Net)
- Kinesis Client Library (KCL) uses DynamoDB to checkpoint offsets
- KCL uses DynamoDB to track other workers and share work amongst shards
Security
- Control access / authorization using IAM policies
- Encryption in flight using HTTPS endpoints
- Encryption at rest using KMS
- Possibility to encrypt/decrypt data client side
- VPC Endpoints available for Kinesis to be access within VPCs
Kinesis Data Firehose
- Fully managed service, no administration required, provides automatic scaling, it is basically serverless
- Used for load data into Redshift, S3, ElasticSearch and Splunk
- It is Near Real Time: 60 seconds latency minimum for non full batches or minimum 32 MB of data at a time
- Supports many data formats, conversions, transformation and compression
- Pay for the amount of data going through Firehose
Kinesis Data Streams vs Firehose
- Streams:
- Requires to write custom code (producer/consumer)
- Real time (~200 ms)
- Must manage scaling (shard splitting / merging)
- Can store data into stream, data can be stored from 1 to 7 days
- Data can be read by multiple consumers
- Firehose:
- Fully managed, sends data to S3, Redshift, Splunk, ElasticSearch
- Serverless, data transformation can be done with Lambda
- Near real time
- Scales automatically
- It provides no data storage
Kinesis Data Analytics
- Can take data from Kinesis Data Streams and Kinesis Firehose and perform some queries on it
- It can perform real-time analytics using SQL
- Kinesis Data Analytics properties:
- Automatically scales
- Managed: no servers to provision
- Continuous: analytics are done in real time
- Pricing: pay per consumption rate
- It can create streams out of real-time queries
Data Ordering with Kinesis
- Data with the same partition key goes to the same shard
- Data is ordered per shard
SQS vs SNS vs Kinesis
- SQS:
- Consumers pull data
- Data is deleted after being consumed
- Can have many consumers as we want
- No need to provision throughput
- No ordering guarantee in case of standard queues
- Capability to delay individual messages
- SNS:
- Pub/Sub: publish data to many subscribers
- We can have up to 10 million subscribers per topic
- Data is not persisted (it is lost if not delivered)
- Up to 10k topics per account
- No need to provision throughput
- Integrates with SQS for fan-out architecture
- Kinesis Data Streams:
- Consumers “pull data”
- We can have as many consumers as we want
- Possibility to replay data
- Recommended for real-time big data analytics and ETL
- Ordering happens at the shard level
- Data expires after X days
- Must provision throughput