Data Analytics
Data Analytics
Amazon EMR
- EMR = Elastic MapReduce
- EMR helps creating Hadoop clusters (Big Data) to analyze process vast amount of data
- The cluster can be made of hundreds of EC2 instances
- Supports Apache Spark, HBase, Presto, Flink
- EMR takes care of provisioning and configuration
- Provides auto-scaling feature and integrates with spot instances
- Use cases: data processing, machine learning, web indexing, big data
AWS Glue
- Fully-managed ETL (Extract, Transform and Load) service
- We can automate all the time consuming steps of data preparation for analytics
- Serverless, pay as you go, fully managed, in the back uses Apache Spark
- Crawls data sources and identifies data formats (schema inference)
- Provides automatic code generation if we want to customize the Apache Spark code
- Sources: Aurora, RDS, Redshift, S3
- Sinks: S3, Redshift, etc
- Glue Data Catalog: Metadata (definition and schema) of the Source Tables