- Mar 2020
from Docker Compose on a single machine, to Heroku and similar systems, to something like Snakemake for computational pipelines.
Other alternatives to Kubernetes:
- Docker Compose on a single machine
- Heroku and similar systems
- Snakemake for computational pipelines
if what you care about is downtime, your first thought shouldn’t be “how do I reduce deployment downtime from 1 second to 1ms”, it should be “how can I ensure database schema changes don’t prevent rollback if I screw something up.”
Caring about downtime
The features Kubernetes provides for reliability (health checks, rolling deploys), can be implemented much more simply, or already built-in in many cases. For example, nginx can do health checks on worker processes, and you can use docker-autoheal or something similar to automatically restart those processes.
Kubernetes' health checks can be replaced with nginx on worker processes + docker-autoheal to automatically restart those processes
Scaling for many web applications is typically bottlenecked by the database, not the web workers.
Kubernetes might be useful if you need to scale a lot. But let’s consider some alternatives
- cloud VMs with up to 416 vCPUs and 8 TiB RAM
- scale many web apps with Heroku
Distributed applications are really hard to write correctly. Really. The more moving parts, the more these problems come in to play. Distributed applications are hard to debug. You need whole new categories of instrumentation and logging to getting understanding that isn’t quite as good as what you’d get from the logs of a monolithic application.
Microservices stay as a hard nut to crack.
They are fine for an organisational scaling technique: when you have 500 developers working on one live website (so they can work independently). For example, each team of 5 developers can be given one microservice
you need to spin up a complete K8s system just to test anything, via a VM or nested Docker containers.
You need a complete K8s to run your code, or you can use Telepresence to code locally against a remote Kubernetes cluster
“Kubernetes is a large system with significant operational complexity. The assessment team found configuration and deployment of Kubernetes to be non-trivial, with certain components having confusing default settings, missing operational controls, and implicitly defined security controls.”
Deployment of Kubernetes is non-trivial
Before you can run a single application, you need the following highly-simplified architecture
Before running the simplest Kubernetes app, you need at least this architecture:
the Kubernetes codebase has significant room for improvement. The codebase is large and complex, with large sections of code containing minimal documentation and numerous dependencies, including systems external to Kubernetes.
As of March 2020, the Kubernetes code base has more than 580 000 lines of Go code
Kubernetes has plenty of moving parts—concepts, subsystems, processes, machines, code—and that means plenty of problems.
Kubernetes might be not the best solution in a smaller team
Here is a high level comparison of the tools we reviewed above:
Comparison of Delta Lake, Apache Iceberg and Apache Hive:
To address Hadoop’s complications and scaling challenges, Industry is now moving towards a disaggregated architecture, with Storage and Analytics layers very loosely coupled using REST APIs.
Things used to address Hadoop's lacks
Hive is now trying to address consistency and usability. It facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.
Apache Hive offers:
- Streaming ingest of data - allowing readers to get a consistent view of data and avoiding too many files
- Slow changing dimensions - dimensions of table change slowly over time
- Data restatement - supported via INSERT, UPDATE, and DELETE
- Bulk updates with SQL MERGE
Delta Lake is an open-source platform that brings ACID transactions to Apache Spark™. Delta Lake is developed by Spark experts, Databricks. It runs on top of your existing storage platform (S3, HDFS, Azure) and is fully compatible with Apache Spark APIs.
Delta Lake offers:
- ACID transactions on Spark
- Scalable metadata handling
- Streaming and batch unification
- Schema enforcement
- Time travel
- Upserts and deletes
Apache Iceberg is an open table format for huge analytic data sets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table. Iceberg is focussed towards avoiding unpleasant surprises, helping evolve schema and avoid inadvertent data deletion.
Apache Iceberg offers:
- Schema evolution (add, drop, update, rename)
- Hidden partitioning
- Partition layout evolution
- Time travel (reproducible queries)
- Version rollback (resetting tables)
in this disaggregated model, users can choose to use Spark for batch workloads for analytics, while Presto for SQL heavy workloads, with both Spark and Presto using the same backend storage platform.
Disaggregated model allows more flexible choice of tools
These projects sit between the storage and analytical platforms and offer strong ACID guarantees to the end user while dealing with the object storage platforms in a native manner.
Solutions to the disaggregated models:
- Delta Lake
- Apache Iceberg
- Apache Hive
rise of Hadoop as the defacto Big Data platform and its subsequent downfall. Initially, HDFS served as the storage layer, and Hive as the analytics layer. When pushed really hard, Hadoop was able to go up to few 100s of TBs, allowed SQL like querying on semi-structured data and was fast enough for its time.
Hadoop's HDFS and Hive became unprepared for even larger sets of data
Disaggregated model means the storage system sees data as a collection of objects or files. But end users are not interested in the physical arrangement of data, they instead want to see a more logical view of their data.
File or Tables problem of disaggregated models
ACID stands for Atomicity (an operation either succeeds completely or fails, it does not leave partial data), Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation), Isolation (an incomplete operation by one user does not cause unexpected side effects for other users), and Durability (once an operation is complete it will be preserved even in the face of machine or system failure).
Currently this may be possible using version management of object store, but that as we saw earlier is at a lower layer of physical detail which may not be useful at higher, logical level.
Change management issue of disaggregated models
Traditionally Data Warehouse tools were used to drive business intelligence from data. Industry then recognized that Data Warehouses limit the potential of intelligence by enforcing schema on write. It was clear that all the dimensions of data-set being collected could not be thought of at the time of data collection.
Data Warehouses were later being replaced with Data Lakes to face the amount of big data
As explained above, users are no longer willing to consider inefficiencies of underlying platforms. For example, data lakes are now also expected to be ACID compliant, so that the end user doesn’t have the additional overhead of ensuring data related guarantees.
SQL Interface issue of disaggregated models
Commonly used Storage platforms are object storage platforms like AWS S3, Azure Blob Storage, GCS, Ceph, MinIO among others. While analytics platforms vary from simple Python & R based notebooks to Tensorflow to Spark, Presto to Splunk, Vertica and others.
Commonly used storage platforms:
- AWS S3
- Azure Blob Storage
Commonly used analytics platforms:
- Python & R based notebooks
Data Lakes that are optimized for unstructured and semi-structured data, can scale to PetaBytes easily and allowed better integration of a wide range of tools to help businesses get the most out of their data.
Data Lake definitions / what do offer us:
- support for unstructured and semi-structured data.
- scalability to PetaBytes and higher
- SQL like interface to interact with the stored data
- ability to connect various analytics tools as seamlessly as possible
- modern data lakes are generally a combination of decoupled storage and analytics tools
'Directed' means that the edges of the graph only move in one direction, where future edges are dependent on previous ones.
Meaning of "directed" in Directed Acyclic Graph
Several cryptocurrencies use DAGs rather than blockchain data structures in order to process and validate transactions.
DAG vs Blockchain:
- DAG transactions are linked to each other rather than grouped into blocks
- DAG transactions can be processed simultaneously with others
- DAG results in a lessened bottleneck on transaction throughput. In blockchain it's limited, such as transactions that can fit in a single block
graph data structure that uses topological ordering, meaning that the graph flows in only one direction, and it never goes in circles.
Simple definition of Directed Acyclic Graph (DAG)
'Acyclic' means that it is impossible to start at one point of the graph and come back to it by following the edges.
Meaning of "acyclic" in Directed Acyclic Graph