utilizing a mix of third-party providers (OpenAI, Anthropic, etc.) and in-house trained LLMs
LLMs are in-house trained. This is a HUGE MOAT
utilizing a mix of third-party providers (OpenAI, Anthropic, etc.) and in-house trained LLMs
LLMs are in-house trained. This is a HUGE MOAT
Delta Lake for transactional data management, and ClickHouse for high-performance, real-time analytics and log storage.
Delta lake is used in order to train the AI on transactional data
The platform is built on AWS and uses Kubernetes and Ray for container orchestration and distributed computing.
Basically uses Ray (I think Anyscale is the managed version) to manage the GPUs which are stored in AWS
prioritizes real-time clickstream behavioral data over traditional vector-only or keyword-based search engines.
Using real-time clickstream data is the core component that allows it to differentiate vs other systems
Matillion provides a Low-Code/No-Code interface that allows data engineers to build these pipelines
Think we do want data engineers for the actual data piece of AI enablement
Matillion ensures the AI doesn't ingest irrelevant or sensitive information that could lead to "hallucinations" (confident but false AI responses).
This is important, but feels more like an "add-on" than an actual core workflow
triggers an API call (to OpenAI or Bedrock) to convert text into Vectors (numerical representations of meaning).
Not super proprietary
breaks long documents into smaller segments to fit the "context window" (the data limit an AI can process at once).
Feel like you need actual access to historical AI data in order to determine what is most important - don't think they have the data moat to do this
edacts Personally Identifiable Information (PII), and standardizes formats.
This is somewhat standard transformations that a lot of data warehouses are beginning to just do themselves, and you have to compete with more enterprise governance tools like informatica, etc
atillion’s model pushes processing to the data warehouse. Performing LLM (Large Language Model) inference or complex data cleaning inside a warehouse can be prohibitively expensive
This is similar to dbt - Matillion just does the compute in-warehouse but that doesn't seem like something teams would want to do
Data prep for AI is increasingly handled by Data Scientists using Python-based tools (like LangChain or LlamaIndex)
Data prep for AI itself is already being won by key players
If Matillion’s integrations are slower or less flexible than niche, AI-native competitors, they become a bottleneck rather than an enabler.
Vectors require highly complex algorithms, etc that benefit from data flywheels. Matillion doesn't have this flywheel and it will be super hard to do it
Matillion’s core strength is ELT (Extract, Load, Transform), which traditionally relies on batch processing;
In today's AI day and age, you need up to date real time info rather than historical batch loading. Without having a streaming broker, you aren't really playing in the game
his counteracts the commoditization of pre-built connectors by allowing users to rapidly create their own high-quality integrations for niche or proprietary APIs without waiting for a vendor's roadmap.
Matillion isn't built for AI Agents to actually use APIs - it is built for data engineers to do so. Fundamentally different things
It differentiates Matillion from pure-SaaS competitors who require data to pass through their own servers.
This is somewhat of a moat, but think all competitors will catch up farily quickly
t provides "no-code" components for vector store integration (e.g., Pinecone), prompt engineering, and RAG (Retrieval-Augmented Generation) workflows.
Matillion is now focused on chunking, OCR, embeddings etc for vector databases
Maia uses natural language to build, test, and document entire end-to-end pipelines.
Have essentially turned into an "AI data engineer application" which is much riskier
fails to dominate the AI/Agentic data engineering space, it risks becoming just one of dozens of identical "IPaaS"
Dominating this new AI / agentic space is key in terms of helping to "prepare" AI ready data
dbt (Data Build Tool) has become the industry standard for SQL transformations, offering a code-based, version-controlled alternative to Matillion’s visual UI.
DBT is already starting to dominate the AI data transformation market
By using an open standard (Iceberg, Delta Lake, Hudi), a company can switch their compute engine (e.g., moving from Databricks to Snowflake) without re-engineering their entire ETL pipeline or re-extracting data.
This takes a huge piece away from the ETL market
security, data lineage, and deployment pipelines around LangChain often exceeds the licensing cost of Dataiku, which provides these "out of the box."
Dataiku's secret sauce is the LLM mesh connectors and that integration impact. Since everyone uses the same platform, within Dataiku you are able to have everyone use the same data security levels
provides IT-centralized controls (audit logs, versioning, and access permissions) that are native to its platform
These specific insights are what could make an enterprise-based team choose to use Dataiku because they need more specific tracking
abstracts model APIs to provide cost tracking, PII (Personally Identifiable Information) filtering, and model-agnosticism
The monitoring of SOTA model APIs is key. Think you can use open-source models with Dataiku but it is a little harder
Dataiku manages the infrastructure and data layer (ETL, storage, model monitoring, and security).
Dataiku is more focused on the RAG pipeline, choosing data, etc as well as connecting to the tools themselves
converting live, streaming data into vectors in real-time—and immediately performing a join against its current working memory.
Often times data will just be streamed directly into Estuary, but you need real-time data to aggregate somewhat histrocial events into the agentic database.
GPU will do real-time instant decisions - TigerGraph will look historically at the different lookups, etc to deliver data to the agent that it might need.
it supports ~85% of Postgres features compared to CockroachDB's ~54%.
Yuga supports postgres extentions which make the ecosystem extremely robust
highly compatible with existing Postgres code, engineering teams can migrate with minimal rewriting of the application layer.
This is probably a big decision when actually going to market
offers more flexibility in how data is replicated (synchronously or asynchronously).
Yuga is a little more useful - Cockroach just super stable
though it makes range-based queries more resource-intensive.
Yugabyte just scatters everything - makes it a little harder to query but fast writes
CockroachDB focuses on ease of geo-distribution and strict data consistency, whereas YugabyteDB prioritizes deep PostgreSQL feature compatibility and high-performance throughput by reusing the original Postgres query engine.
Cockroach is more about just being like insanely resilient - Yuga is about having more functionality while still being disributed
an update in Postgres and its subsequent replication to ClickHouse are not atomic, leading to potential "partial" data states during failures.
This is they main piece
often require manual intervention or complex mapping logic in Clickpipes to prevent the pipeline from breaking.
It's not as simple as just using CDC - there are a ton of downstream issues like these schema issues that makes it harder
atency between a write in Postgres and its visibility in ClickHouse), meaning analytical queries may reflect stale data compared to the live transactional state.
This delay is a big reason why having operationg and OLAP data together is useful
sers can write simple scripts or "rules" to label data in bulk based on metadata or specific visual triggers
Leads to a flywheel affect of organizations really "living" in these type of software
small, specialized models trained on a client's specific subset of data to automate repetitive labeling tasks (e.g., identifying a specific type of medical lesion or a specific car part).
This leads to a super specific data moat
copy-on-write forks
This fork just points to original data for an AI agent to use in order to test features, etc
olicy-driven automation for compression, retention, and re-ordering (
This organizing for speed is again a super optimized-piece of the data
compressed tuple filtering
Standard compression which is helpful
keeping dashboards current without re-scanning historical columnar data.
This helps speed things up a ton
vectorized query engine
This is the actual motor that looks at multiple batches of data at once. This is what allows for way more faster queries vs using a compute engine that is slower
automatic tiering of cold data to Amazon S3 (object storage) while maintaining full SQL access.
This tiering of data enables way more efficient storage which helps teams run faster because you have less hot data to query from
TigerData’s pgai extension automates this within the database, allowing the database to "vectorize" its own data as it is inserted.
This allows you to not have to send data, reducing ETL. This is massive.
TigerData combines pgvectorscale (vector search) with pg_textsearch (BM25 algorithm).
Combining these is HUGE. Allows you to have dual-search and reduce hallucinations + have one source of truth.
SBQ improves upon standard binary quantization by using statistical methods to retain higher accuracy (99% recall) while reducing the data footprint by up to 30x.
Another huge thing of having compression algorithms that minimize data store
StreamingDiskANN allows the index to reside on SSDs (Disk) with minimal performance loss, significantly reducing hardware costs for large datasets.
Able to help scale pgvector a ton compared to normal
compaction (merging small files), clustering (optimizing data layout), and schema evolution.
These are real managed services that OneHouse provides. Able to optimize the OTFs in the background
automates the complex "last mile" of Change Data Capture (CDC
Is moreso like a transformation tool for CDC
the Binary protocol (for speed) or HTTP (for compatibility/cloud).
Doesn't mirror data - just converts data into text to say what it is. Is much faster than Clickhouse CDC
It automatically converts Postgres SQL syntax and functions (like date_part or percentile_cont) into their equivalent ClickHouse function
Is basically acting just like a translator for Clickhouse data
This utilizes ClickHouse’s columnar speed rather than pulling raw data into Postgres for processing.
Could allow for faster Postgres processing, although there are bottlenecks in the data federation and it isn't as good as
PostgreSQL SQL/MED (SQL Management of External Data) standard
Essentially Postgres is federating data from Clickhouse
materialized views (pre-computed summaries) that track changes in the underlying data
Materialized views are how real-time data is able to occur so quickly
Dual-Engine Storage (Hypercore):
Hypercore is the multi-storage engine that allows for both OLAP and OLTP to be stored
which automatically partition data into time-based "chunks."
Hypertables are the chunking / partioning of the data
is 24x faster than standard SQL for these operations.
Another argument is just the speed of this infrastructure in general
distributed locks in Redis
This is just saying which worker is working on which trace at a time so that they don't overlap on the same trace.
race data in cheap Object Storage (like S3) while keeping metadata in Postgres
Metadata is what is used for organizational info, security, scores, context, user IDs, etc. The trace actually is the exact prompt sent to the AI and word-for-word response, the intermediate thoughts, tool calling, how long RAG retrieval took, etc
By eliminating the "waiting for logs to load" phase
Braintrust combines hot data in the WAL and cold storage relatively seamlessly
It merges real-time data from a custom WAL with historical data from S3 in a single pass
This is like super similar to what Estuary is doing, but i guess custom built for super huge AI traces
a "failed" trace from production, convert it into a test case with one click, and run it against new code to ensure the bug doesn't reappear.
This is that flywheel moat of integrating errors + convert the better answer to a golden dataset along with helping non-technical users interact with technical ones
This allows teams to filter by specific variables, such as "show me every time the agent used the wrong API key during a refund process."
From a use-case perspective, being able to see this level of detail is extremely helpful
Brainstore captures every intermediate "thought," tool call, and sub-task within an agent's execution path.
Captures the behind the scenes tooling - this integrations / APIs could be a moat as well
It merges real-time data from a custom WAL with historical data from S3 in a single pass
This tool is extremely similar to what Estuary does
a "failed" trace from production, convert it into a test case with one click, and run it against new code to ensure the bug doesn't reappear.
This is that flywheel moat of integrating errors + convert the better answer to a golden dataset along with helping non-technical users interact with technical ones
This allows teams to filter by specific variables, such as "show me every time the agent used the wrong API key during a refund process."
From a use-case perspective, being able to see this level of detail is extremely helpful
Brainstore captures every intermediate "thought," tool call, and sub-task within an agent's execution path.
Captures the behind the scenes tooling - this integrations / APIs could be a moat as well
Attention scales quadratically (O(n2)) with the number of tokens.
In a typical architecture, the pixel goes through Self-Attention (pixels talking to pixels) and then immediately through Cross-Attention (pixels talking to text). It’s a "chain" where:
Pixel-to-Pixel: Ensures Coherence (it looks like a real object).
Pixel-to-Text: Ensures Adherence (it looks like what you asked for).
Latent Diffusion (processing compressed images) to save on hardware costs.
Latent diffusion is different than using somewhat stale pieces of images. It is about compressing something down so it can later be a full sized image.
Pixel models generally cannot use this trick because the entire image state changes at every denoising step
You can't just only look at historical KV caches because they won't be relevant anymore since the image has changed
Rely heavily on Cross-Attention, which maps features from one modality (text prompt) to another (image latent space)
The idea is that the pixels being generated by the model acts as the query. The text prompt stays the same, and the pixels ask the text which part of this should be red - then the model decides and it asks again.
The tile is loaded directly into SRAM, along with the text prompt
ttention must account for spatial proximity (up, down, left, right) rather than just "before" or "after."
The attention is still all being output as vectors. The model looks at a prompt and calculates a score for how much a piece of a prompt measures for a certain tile, and then takes the actual values and multiplies them by those scores.
maintain global structural coherence.
In each step of asking the text prompt what to do with it's pixels, every tile asks: "What do the other tiles look like, and how do we all fit the text prompt?".
2D grids of pixel patches
These patches are what is broken down and sent to GPUs - it also needs bi-directional attention to align with both the text prompt and what other text patches are doing.
simultaneously.
Simultaneously here is key - you don't need to determine one word to do the text. Every attention mechanism is happening at the same time.
s an open-source tool with a massive community, it provides a "common language" for data transformations
This is the main thing - it's just open source and got a ton of adoption + network affects of people knowing how to use it, similar to Kafka
rather than using external processing engines.
Bascially spark / flink / estuary, etc are the actual processing engines. But dbt just uses the one that is being used in the dat warehouse, so it doesn't go through a double cycle
combines SQL with Jinja (a Python-based templating engine) to enable logic
Basically a just a language that transforms code to SQL making it easy for devs to use
dbt ensures that parent tables are processed before child tables, preventing "missing data" errors.
This is important part of the "transformation" of tables in terms of joining them together. The keys are what matter most to keep things sane throughout the data set
build the physical objects in the warehouse.
It appears to be all warehouse based - use that compute. Is probably the issue in using dbt to perform in-flight transformations
without hitting "memory walls."
This is basically done by making it so with tiling you calculate the relationship for a small tile, write down that answer and then while doing this the other tile is being pulled from the HBM so that by the time the other tile is done you already have the attention you need.
both the attention vectors and the weights needed are "tiled" in,
uses proprietary networking software to manage how data moves between GPUs
Together is still using NVLink - it somehow re-architects the execution and scheduling to be faster
system runs a smaller, faster "draft" model alongside the main model to predict upcoming text
This is the same thing Fireworks is doing
fine-tuned for specific chips like the NVIDIA H100 to execute math operations with fewer wasted cycles.
Together is getting to the low-level Kernel math similar to how Fireworks is
implement specialized algorithms that reduce the memory read/write requirements of the Attention Mechanism
While the core algorithm is open, Together AI maintains a proprietary layer called the Together Kernel Collection (TKC) for their own cloud customers
Stale Activation Reuse: It leverages temporal redundancy (the fact that data doesn't change drastically between steps) by using "stale" results from the previous step to start the current step immediately.
This is huge - similar to how LLM models use models to predict the next word, PipeFusion uses historical parts of a tile to run another load even if the tile isn't fully finished. I imagine this would get better with more data?
he system partitions an image into non-overlapping patches
Basically the tiling of the images themselves isn;t the hard part. What is hard is dertmining hoiw to make a number of patches that is not equal to the amount of GPUs. The most important part is how these tiles talk to eachother
Executive Summary: Fal PipeFusion Optimization
Pipefusion's job is to decide how to chop an image into "patches" so they can be sent to different GPUs.
This ensures that when a request arrives, the engine does not have to pull multi-gigabyte files over the public internet.
This inference engine isn't the multi-modal specific thing
compiled caches to load pre-compiled CUDA kernels directly into memory.
Don't really understand this
to control exactly how data moves through the processor.
By controlling data through the processor, you are both managing what KV cache is kept in the SRAM / which weights, as well as the actual threads on the GPU doing calculations and how much each thread should use.
"incremental loading" (only moving new/changed data)
CDC is often associated with real-time streaming, but batch processes use it to determine even in a batch loading process which pieces of data have changed and should be moved.
Providers manage "cursors" or "bookmarks" to ensure idempotency (the ability to run a process multiple times with the same result).
Basically just provide a guarnteed way to load in data
Providers automate this "Schema Evolution" (modifying the target table to match the source).
This schema evolution is probably pretty tough as well
not because of file formats
Loading data isn't hard because of different file formats - oh lol
Specialized providers manage "throttling" (slowing down requests) and "retries" (re-attempting failed calls) so the connection doesn't break when limits are hit.
Managing these connectors is key
Streaming tools maintain an internal State Store
This is HUGE. Allows to not re-read source data which is way faster
it stays active in memory, processing individual Kafka messages the millisecond they arrive.
For streaming, you have to have the architecture of maintaining the query at all times - not just writing the architecture once
compiles SQL into a command, sends it to a data warehouse (e.g., Snowflake), and waits for the warehouse to process the entire dataset (or a defined partition) as a fixed block.
The way DBT is built (warehouse centric) seems like it would take super longer compared to other stream processing frameworks
real-time processing turns thousands of GPS pings into a single "Estimated Time of Arrival" (ETA)
Good real-time example of needing to probably do a ton of operations to the data being streamed to do this
machine-readable formats like Protobuf or Avro into human-readable or database-friendly formats like JSON or SQL.
These machine readable formats include metadata within the data itself to make it easy for a person or BI tool to understand it
filtered to only emit a record if the temperature exceeds a specific threshold.
can have certain filters - processing acts as what actually sees the individual data
Transformation involves "joining" a fast-moving stream (e.g., a credit card swipe) with a static database (e.g., the user’s credit limit) to create a complete record.
This is key, and similar to doing joining in a normal ELT process into a database
Modern Machine Learning models require highly structured, "clean" data to function;
This is key
Unifies data silos so that Marketing, Finance, and Sale
This unification is super important in the age of AI
connecting a CRM (Customer Relationship Management) record to a Web Analytics ID to see a full customer journey.
This linking / joining is probably the most complex part of data transformation
converting thousands of individual daily transactions into a single "Monthly Total Revenue" figure.
Perform function son information like Excel to perform aggregate value calculations
converting all dates to YYYY-MM-DD or all currencies to USD).
Makes sense in terms of tools that align structure of data - kind of like excel but just at insane scale
removing duplicate customer records or filling in missing values (e.g., converting "NULL" to "0").
This makes sense
It manages KV cache reuse (storing previous conversation context to speed up new responses)
Baseten isn't managing the specific things in the KV cache - just the persistence and routing of it, so it stays warm. It routes user requests to a GPU that already has specific caches. It's KV cache orchestration, not determining exactly what sits on one within one GPU.
parallelized byte-range downloads and image streaming
This is a different technical approach to loading weights vs Modal - it just distributes out the entire model weights to parallelize the compute to break it down and then load those small pieces
number of active GPU replicas (duplicate model instances)
Essentially just distributes out these GPUs when traffic requires using an Autoscaler - seems similar to Modal.
Also allows for it to be fail-proof since have multiple GPUs.
using kernel fusion
Kernel fusion itself isn't proprietary - Baseten manages open-source kernels