Mikołaj Sędek’s Post

https://lnkd.in/d_P-Un5f Incremental processing of files from volumes / stream processed messages means no manual bookkeeping and faster / more efficient compute (your pipeline processes only new data and runs minimal merge updates on existing one, data reprocessing is minimised) ⚡️💰Because of that if you operate on Databricks Lakehouse an your ETL processing logic is based on incrementally added events or snapshots it makes a lot of sense to use Lakeflow Spark Declarative Pipelines to ingest, transform and finally model data using SCD1 / SCD2 (the latter supporting tracking history on selected columns to remove duplicates and limit history scope) using fully incremental logic. The framework supports SQL and PySpark APIs, static frame joins and incremental processing that can be combined with batch processing in a single pipeline if needed. Combined with Standard Serverless Compute (very good price / performance ratio), ability to write to custom schemas and catalog in UC, Workflows scheduling and alerting, Asset Bundles CI/CD deployment and a brand new WebUI editor - Lakeflow Pipelines have truly become extremely valuable part of Lakehouse ecosystem. PS. You can run incremental ingestion / processing both in batch and real time mode with zero code changes (of course if you need sub-second refresh you need to switch appropriate sources like Kinesis / Kafka). #Databricks #LakeflowPipelines #ETL #IncrementalProcessing #Spark #DataEngineering

To view or add a comment, sign in

Explore content categories