Hypothesis

8 Matching Annotations

Mar 2020
developer.sh developer.sh

Modern Data Lakes Overview - Developer.sh

8
1. pyxelr 02 Mar 2020
  
  in Public
  
  Here is a high level comparison of the tools we reviewed above:
  
  Comparison of Delta Lake, Apache Iceberg and Apache Hive:
  
  DataEngineering ApacheHadoop
2. pyxelr 02 Mar 2020
  
  in Public
  
  To address Hadoop’s complications and scaling challenges, Industry is now moving towards a disaggregated architecture, with Storage and Analytics layers very loosely coupled using REST APIs.
  
  Things used to address Hadoop's lacks
  
  DataEngineering ApacheHadoop
3. pyxelr 02 Mar 2020
  
  in Public
  
  Hive is now trying to address consistency and usability. It facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.
  
  Apache Hive offers:
  
  Streaming ingest of data - allowing readers to get a consistent view of data and avoiding too many files
  
  Slow changing dimensions - dimensions of table change slowly over time
  
  Data restatement - supported via INSERT, UPDATE, and DELETE
  
  Bulk updates with SQL MERGE
  
  DataEngineering ApacheHadoop
4. pyxelr 02 Mar 2020
  
  in Public
  
  Delta Lake is an open-source platform that brings ACID transactions to Apache Spark™. Delta Lake is developed by Spark experts, Databricks. It runs on top of your existing storage platform (S3, HDFS, Azure) and is fully compatible with Apache Spark APIs.
  
  Delta Lake offers:
  
  ACID transactions on Spark
  
  Scalable metadata handling
  
  Streaming and batch unification
  
  Schema enforcement
  
  Time travel
  
  Upserts and deletes
  
  DataEngineering ApacheHadoop
5. pyxelr 02 Mar 2020
  
  in Public
  
  Apache Iceberg is an open table format for huge analytic data sets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table. Iceberg is focussed towards avoiding unpleasant surprises, helping evolve schema and avoid inadvertent data deletion.
  
  Apache Iceberg offers:
  
  Schema evolution (add, drop, update, rename)
  
  Hidden partitioning
  
  Partition layout evolution
  
  Time travel (reproducible queries)
  
  Version rollback (resetting tables)
  
  DataEngineering ApacheHadoop
6. pyxelr 02 Mar 2020
  
  in Public
  
  in this disaggregated model, users can choose to use Spark for batch workloads for analytics, while Presto for SQL heavy workloads, with both Spark and Presto using the same backend storage platform.
  
  Disaggregated model allows more flexible choice of tools
  
  DataEngineering ApacheHadoop
7. pyxelr 02 Mar 2020
  
  in Public
  
  These projects sit between the storage and analytical platforms and offer strong ACID guarantees to the end user while dealing with the object storage platforms in a native manner.
  
  Solutions to the disaggregated models:
  
  Delta Lake
  
  Apache Iceberg
  
  Apache Hive
  
  DataEngineering ApacheHadoop
8. pyxelr 02 Mar 2020
  
  in Public
  
  rise of Hadoop as the defacto Big Data platform and its subsequent downfall. Initially, HDFS served as the storage layer, and Hive as the analytics layer. When pushed really hard, Hadoop was able to go up to few 100s of TBs, allowed SQL like querying on semi-structured data and was fast enough for its time.
  
  Hadoop's HDFS and Hive became unprepared for even larger sets of data
  
  DataEngineering ApacheHadoop
Visit annotations in context

Tags

DataEngineering

ApacheHadoop

Annotators

pyxelr

URL

developer.sh/posts/delta-lake-and-iceberg

Tags

Annotators

URL