3 Matching Annotations
- Sep 2018
-
blog.acolyer.org blog.acolyer.org
-
Smoke uses read- and write-efficient index structures based on row ids to capture lineage information. 1-N relationships (between input and output tuples) are represented as inverted indexes. The index’s ith entry corresponds to the ith output group, and points to a row id array containing the ids of all input records that belong to the group.
-
We have the usual space/time trade-offs to consider. We can slow down the base query in order to capture lineage information during query execution (and store that information somewhere). This speeds up answering lineage queries later on. Or we can keep base queries fast and lazily materialize lineage information later when lineage queries are asked (making them slower).
-
Data lineage connects the input and output data items of a computation. Given a set of output records, a backward lineage query selects a subset of the output records and asks “which input records contributed to these results?” A forward lineage query selects a subset of the input records and asks, “which output records depend on these inputs?”. Lineage-enabled systems capture record-level relationships throughout a workflow and support lineage queries.
-