8 Matching Annotations
  1. Mar 2020
    1. Hive is now trying to address consistency and usability. It facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.

      Apache Hive offers:

      • Streaming ingest of data - allowing readers to get a consistent view of data and avoiding too many files
      • Slow changing dimensions - dimensions of table change slowly over time
      • Data restatement - supported via INSERT, UPDATE, and DELETE
      • Bulk updates with SQL MERGE
    2. Delta Lake is an open-source platform that brings ACID transactions to Apache Spark™. Delta Lake is developed by Spark experts, Databricks. It runs on top of your existing storage platform (S3, HDFS, Azure) and is fully compatible with Apache Spark APIs.

      Delta Lake offers:

      • ACID transactions on Spark
      • Scalable metadata handling
      • Streaming and batch unification
      • Schema enforcement
      • Time travel
      • Upserts and deletes
    3. Apache Iceberg is an open table format for huge analytic data sets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table. Iceberg is focussed towards avoiding unpleasant surprises, helping evolve schema and avoid inadvertent data deletion.

      Apache Iceberg offers:

      • Schema evolution (add, drop, update, rename)
      • Hidden partitioning
      • Partition layout evolution
      • Time travel (reproducible queries)
      • Version rollback (resetting tables)
    4. rise of Hadoop as the defacto Big Data platform and its subsequent downfall. Initially, HDFS served as the storage layer, and Hive as the analytics layer. When pushed really hard, Hadoop was able to go up to few 100s of TBs, allowed SQL like querying on semi-structured data and was fast enough for its time.

      Hadoop's HDFS and Hive became unprepared for even larger sets of data