112 Matching Annotations
  1. Last 7 days
    1. Delta Lake for transactional data management, and ClickHouse for high-performance, real-time analytics and log storage.

      Delta lake is used in order to train the AI on transactional data

    2. The platform is built on AWS and uses Kubernetes and Ray for container orchestration and distributed computing.

      Basically uses Ray (I think Anyscale is the managed version) to manage the GPUs which are stored in AWS

    3. prioritizes real-time clickstream behavioral data over traditional vector-only or keyword-based search engines.

      Using real-time clickstream data is the core component that allows it to differentiate vs other systems

    1. Matillion provides a Low-Code/No-Code interface that allows data engineers to build these pipelines

      Think we do want data engineers for the actual data piece of AI enablement

    2. Matillion ensures the AI doesn't ingest irrelevant or sensitive information that could lead to "hallucinations" (confident but false AI responses).

      This is important, but feels more like an "add-on" than an actual core workflow

    3. breaks long documents into smaller segments to fit the "context window" (the data limit an AI can process at once).

      Feel like you need actual access to historical AI data in order to determine what is most important - don't think they have the data moat to do this

    4. edacts Personally Identifiable Information (PII), and standardizes formats.

      This is somewhat standard transformations that a lot of data warehouses are beginning to just do themselves, and you have to compete with more enterprise governance tools like informatica, etc

    1. atillion’s model pushes processing to the data warehouse. Performing LLM (Large Language Model) inference or complex data cleaning inside a warehouse can be prohibitively expensive

      This is similar to dbt - Matillion just does the compute in-warehouse but that doesn't seem like something teams would want to do

    2. Data prep for AI is increasingly handled by Data Scientists using Python-based tools (like LangChain or LlamaIndex)

      Data prep for AI itself is already being won by key players

    3. If Matillion’s integrations are slower or less flexible than niche, AI-native competitors, they become a bottleneck rather than an enabler.

      Vectors require highly complex algorithms, etc that benefit from data flywheels. Matillion doesn't have this flywheel and it will be super hard to do it

    4. Matillion’s core strength is ELT (Extract, Load, Transform), which traditionally relies on batch processing;

      In today's AI day and age, you need up to date real time info rather than historical batch loading. Without having a streaming broker, you aren't really playing in the game

    1. his counteracts the commoditization of pre-built connectors by allowing users to rapidly create their own high-quality integrations for niche or proprietary APIs without waiting for a vendor's roadmap.

      Matillion isn't built for AI Agents to actually use APIs - it is built for data engineers to do so. Fundamentally different things

    2. It differentiates Matillion from pure-SaaS competitors who require data to pass through their own servers.

      This is somewhat of a moat, but think all competitors will catch up farily quickly

    3. t provides "no-code" components for vector store integration (e.g., Pinecone), prompt engineering, and RAG (Retrieval-Augmented Generation) workflows.

      Matillion is now focused on chunking, OCR, embeddings etc for vector databases

    4. Maia uses natural language to build, test, and document entire end-to-end pipelines.

      Have essentially turned into an "AI data engineer application" which is much riskier

    5. fails to dominate the AI/Agentic data engineering space, it risks becoming just one of dozens of identical "IPaaS"

      Dominating this new AI / agentic space is key in terms of helping to "prepare" AI ready data

    6. dbt (Data Build Tool) has become the industry standard for SQL transformations, offering a code-based, version-controlled alternative to Matillion’s visual UI.

      DBT is already starting to dominate the AI data transformation market

    1. By using an open standard (Iceberg, Delta Lake, Hudi), a company can switch their compute engine (e.g., moving from Databricks to Snowflake) without re-engineering their entire ETL pipeline or re-extracting data.

      This takes a huge piece away from the ETL market

    1. security, data lineage, and deployment pipelines around LangChain often exceeds the licensing cost of Dataiku, which provides these "out of the box."

      Dataiku's secret sauce is the LLM mesh connectors and that integration impact. Since everyone uses the same platform, within Dataiku you are able to have everyone use the same data security levels

    2. provides IT-centralized controls (audit logs, versioning, and access permissions) that are native to its platform

      These specific insights are what could make an enterprise-based team choose to use Dataiku because they need more specific tracking

    3. abstracts model APIs to provide cost tracking, PII (Personally Identifiable Information) filtering, and model-agnosticism

      The monitoring of SOTA model APIs is key. Think you can use open-source models with Dataiku but it is a little harder

    4. Dataiku manages the infrastructure and data layer (ETL, storage, model monitoring, and security).

      Dataiku is more focused on the RAG pipeline, choosing data, etc as well as connecting to the tools themselves

    1. converting live, streaming data into vectors in real-time—and immediately performing a join against its current working memory.

      Often times data will just be streamed directly into Estuary, but you need real-time data to aggregate somewhat histrocial events into the agentic database.

      GPU will do real-time instant decisions - TigerGraph will look historically at the different lookups, etc to deliver data to the agent that it might need.

    1. highly compatible with existing Postgres code, engineering teams can migrate with minimal rewriting of the application layer.

      This is probably a big decision when actually going to market

    2. CockroachDB focuses on ease of geo-distribution and strict data consistency, whereas YugabyteDB prioritizes deep PostgreSQL feature compatibility and high-performance throughput by reusing the original Postgres query engine.

      Cockroach is more about just being like insanely resilient - Yuga is about having more functionality while still being disributed

    1. an update in Postgres and its subsequent replication to ClickHouse are not atomic, leading to potential "partial" data states during failures.

      This is they main piece

    2. often require manual intervention or complex mapping logic in Clickpipes to prevent the pipeline from breaking.

      It's not as simple as just using CDC - there are a ton of downstream issues like these schema issues that makes it harder

    3. atency between a write in Postgres and its visibility in ClickHouse), meaning analytical queries may reflect stale data compared to the live transactional state.

      This delay is a big reason why having operationg and OLAP data together is useful

    1. sers can write simple scripts or "rules" to label data in bulk based on metadata or specific visual triggers

      Leads to a flywheel affect of organizations really "living" in these type of software

    2. small, specialized models trained on a client's specific subset of data to automate repetitive labeling tasks (e.g., identifying a specific type of medical lesion or a specific car part).

      This leads to a super specific data moat

  2. Jan 2026
    1. vectorized query engine

      This is the actual motor that looks at multiple batches of data at once. This is what allows for way more faster queries vs using a compute engine that is slower

    2. automatic tiering of cold data to Amazon S3 (object storage) while maintaining full SQL access.

      This tiering of data enables way more efficient storage which helps teams run faster because you have less hot data to query from

    1. TigerData’s pgai extension automates this within the database, allowing the database to "vectorize" its own data as it is inserted.

      This allows you to not have to send data, reducing ETL. This is massive.

    2. TigerData combines pgvectorscale (vector search) with pg_textsearch (BM25 algorithm).

      Combining these is HUGE. Allows you to have dual-search and reduce hallucinations + have one source of truth.

    3. SBQ improves upon standard binary quantization by using statistical methods to retain higher accuracy (99% recall) while reducing the data footprint by up to 30x.

      Another huge thing of having compression algorithms that minimize data store

    4. StreamingDiskANN allows the index to reside on SSDs (Disk) with minimal performance loss, significantly reducing hardware costs for large datasets.

      Able to help scale pgvector a ton compared to normal

    1. compaction (merging small files), clustering (optimizing data layout), and schema evolution.

      These are real managed services that OneHouse provides. Able to optimize the OTFs in the background

    1. the Binary protocol (for speed) or HTTP (for compatibility/cloud).

      Doesn't mirror data - just converts data into text to say what it is. Is much faster than Clickhouse CDC

    2. It automatically converts Postgres SQL syntax and functions (like date_part or percentile_cont) into their equivalent ClickHouse function

      Is basically acting just like a translator for Clickhouse data

    3. This utilizes ClickHouse’s columnar speed rather than pulling raw data into Postgres for processing.

      Could allow for faster Postgres processing, although there are bottlenecks in the data federation and it isn't as good as

    1. race data in cheap Object Storage (like S3) while keeping metadata in Postgres

      Metadata is what is used for organizational info, security, scores, context, user IDs, etc. The trace actually is the exact prompt sent to the AI and word-for-word response, the intermediate thoughts, tool calling, how long RAG retrieval took, etc

    2. It merges real-time data from a custom WAL with historical data from S3 in a single pass

      This is like super similar to what Estuary is doing, but i guess custom built for super huge AI traces

    3. a "failed" trace from production, convert it into a test case with one click, and run it against new code to ensure the bug doesn't reappear.

      This is that flywheel moat of integrating errors + convert the better answer to a golden dataset along with helping non-technical users interact with technical ones

    4. This allows teams to filter by specific variables, such as "show me every time the agent used the wrong API key during a refund process."

      From a use-case perspective, being able to see this level of detail is extremely helpful

    5. Brainstore captures every intermediate "thought," tool call, and sub-task within an agent's execution path.

      Captures the behind the scenes tooling - this integrations / APIs could be a moat as well

    1. a "failed" trace from production, convert it into a test case with one click, and run it against new code to ensure the bug doesn't reappear.

      This is that flywheel moat of integrating errors + convert the better answer to a golden dataset along with helping non-technical users interact with technical ones

    2. This allows teams to filter by specific variables, such as "show me every time the agent used the wrong API key during a refund process."

      From a use-case perspective, being able to see this level of detail is extremely helpful

    3. Brainstore captures every intermediate "thought," tool call, and sub-task within an agent's execution path.

      Captures the behind the scenes tooling - this integrations / APIs could be a moat as well

    1. Attention scales quadratically (O(n2)) with the number of tokens.

      In a typical architecture, the pixel goes through Self-Attention (pixels talking to pixels) and then immediately through Cross-Attention (pixels talking to text). It’s a "chain" where:

      Pixel-to-Pixel: Ensures Coherence (it looks like a real object).

      Pixel-to-Text: Ensures Adherence (it looks like what you asked for).

    2. Latent Diffusion (processing compressed images) to save on hardware costs.

      Latent diffusion is different than using somewhat stale pieces of images. It is about compressing something down so it can later be a full sized image.

    3. Pixel models generally cannot use this trick because the entire image state changes at every denoising step

      You can't just only look at historical KV caches because they won't be relevant anymore since the image has changed

    4. Rely heavily on Cross-Attention, which maps features from one modality (text prompt) to another (image latent space)

      The idea is that the pixels being generated by the model acts as the query. The text prompt stays the same, and the pixels ask the text which part of this should be red - then the model decides and it asks again.

      The tile is loaded directly into SRAM, along with the text prompt

    5. ttention must account for spatial proximity (up, down, left, right) rather than just "before" or "after."

      The attention is still all being output as vectors. The model looks at a prompt and calculates a score for how much a piece of a prompt measures for a certain tile, and then takes the actual values and multiplies them by those scores.

    6. maintain global structural coherence.

      In each step of asking the text prompt what to do with it's pixels, every tile asks: "What do the other tiles look like, and how do we all fit the text prompt?".

    7. 2D grids of pixel patches

      These patches are what is broken down and sent to GPUs - it also needs bi-directional attention to align with both the text prompt and what other text patches are doing.

    1. s an open-source tool with a massive community, it provides a "common language" for data transformations

      This is the main thing - it's just open source and got a ton of adoption + network affects of people knowing how to use it, similar to Kafka

    2. rather than using external processing engines.

      Bascially spark / flink / estuary, etc are the actual processing engines. But dbt just uses the one that is being used in the dat warehouse, so it doesn't go through a double cycle

    3. dbt ensures that parent tables are processed before child tables, preventing "missing data" errors.

      This is important part of the "transformation" of tables in terms of joining them together. The keys are what matter most to keep things sane throughout the data set

    4. build the physical objects in the warehouse.

      It appears to be all warehouse based - use that compute. Is probably the issue in using dbt to perform in-flight transformations

    1. without hitting "memory walls."

      This is basically done by making it so with tiling you calculate the relationship for a small tile, write down that answer and then while doing this the other tile is being pulled from the HBM so that by the time the other tile is done you already have the attention you need.

      both the attention vectors and the weights needed are "tiled" in,

    2. uses proprietary networking software to manage how data moves between GPUs

      Together is still using NVLink - it somehow re-architects the execution and scheduling to be faster

    3. fine-tuned for specific chips like the NVIDIA H100 to execute math operations with fewer wasted cycles.

      Together is getting to the low-level Kernel math similar to how Fireworks is

    4. implement specialized algorithms that reduce the memory read/write requirements of the Attention Mechanism

      While the core algorithm is open, Together AI maintains a proprietary layer called the Together Kernel Collection (TKC) for their own cloud customers

    1. Stale Activation Reuse: It leverages temporal redundancy (the fact that data doesn't change drastically between steps) by using "stale" results from the previous step to start the current step immediately.

      This is huge - similar to how LLM models use models to predict the next word, PipeFusion uses historical parts of a tile to run another load even if the tile isn't fully finished. I imagine this would get better with more data?

    2. he system partitions an image into non-overlapping patches

      Basically the tiling of the images themselves isn;t the hard part. What is hard is dertmining hoiw to make a number of patches that is not equal to the amount of GPUs. The most important part is how these tiles talk to eachother

    1. This ensures that when a request arrives, the engine does not have to pull multi-gigabyte files over the public internet.

      This inference engine isn't the multi-modal specific thing

    1. to control exactly how data moves through the processor.

      By controlling data through the processor, you are both managing what KV cache is kept in the SRAM / which weights, as well as the actual threads on the GPU doing calculations and how much each thread should use.

    1. "incremental loading" (only moving new/changed data)

      CDC is often associated with real-time streaming, but batch processes use it to determine even in a batch loading process which pieces of data have changed and should be moved.

    2. Providers manage "cursors" or "bookmarks" to ensure idempotency (the ability to run a process multiple times with the same result).

      Basically just provide a guarnteed way to load in data

    3. Specialized providers manage "throttling" (slowing down requests) and "retries" (re-attempting failed calls) so the connection doesn't break when limits are hit.

      Managing these connectors is key

    1. it stays active in memory, processing individual Kafka messages the millisecond they arrive.

      For streaming, you have to have the architecture of maintaining the query at all times - not just writing the architecture once

    2. compiles SQL into a command, sends it to a data warehouse (e.g., Snowflake), and waits for the warehouse to process the entire dataset (or a defined partition) as a fixed block.

      The way DBT is built (warehouse centric) seems like it would take super longer compared to other stream processing frameworks

    1. real-time processing turns thousands of GPS pings into a single "Estimated Time of Arrival" (ETA)

      Good real-time example of needing to probably do a ton of operations to the data being streamed to do this

    2. machine-readable formats like Protobuf or Avro into human-readable or database-friendly formats like JSON or SQL.

      These machine readable formats include metadata within the data itself to make it easy for a person or BI tool to understand it

    3. filtered to only emit a record if the temperature exceeds a specific threshold.

      can have certain filters - processing acts as what actually sees the individual data

    4. Transformation involves "joining" a fast-moving stream (e.g., a credit card swipe) with a static database (e.g., the user’s credit limit) to create a complete record.

      This is key, and similar to doing joining in a normal ELT process into a database

    1. connecting a CRM (Customer Relationship Management) record to a Web Analytics ID to see a full customer journey.

      This linking / joining is probably the most complex part of data transformation

    2. converting thousands of individual daily transactions into a single "Monthly Total Revenue" figure.

      Perform function son information like Excel to perform aggregate value calculations

    3. converting all dates to YYYY-MM-DD or all currencies to USD).

      Makes sense in terms of tools that align structure of data - kind of like excel but just at insane scale

    1. It manages KV cache reuse (storing previous conversation context to speed up new responses)

      Baseten isn't managing the specific things in the KV cache - just the persistence and routing of it, so it stays warm. It routes user requests to a GPU that already has specific caches. It's KV cache orchestration, not determining exactly what sits on one within one GPU.

    2. parallelized byte-range downloads and image streaming

      This is a different technical approach to loading weights vs Modal - it just distributes out the entire model weights to parallelize the compute to break it down and then load those small pieces

    3. number of active GPU replicas (duplicate model instances)

      Essentially just distributes out these GPUs when traffic requires using an Autoscaler - seems similar to Modal.

      Also allows for it to be fail-proof since have multiple GPUs.