Raiders of the Last Data Silos: Unleashing the Power of Databricks Lakehouse Federation 🚀

by | Mar 13, 2024 | BlogPosts, Databricks | 0 comments

In the vast expanse of the data universe, organizations are on a constant quest to conquer their data assets.

Much like the intrepid Indiana Jones (other explorers are available), they face the daunting task of navigating the labyrinth of data silos.

Enter our blog’s hero, Databricks Lakehouse Federation, a feature that promises to revolutionize the way organizations handle their siloed data.

The Quest to Break Down Data Silos 🏰

One of the key powers of Databricks Lakehouse Federation is its ability to provide a unified view of all data across the organization. This helps to break down the formidable walls of data silos, enabling seamless access to the treasure trove of data, no matter where it is hidden within the organization.

The Map to Enhanced Data Governance 🗺️ 

Databricks Lakehouse Federation offers a single map to manage permissions and access to data from within Databricks. This centralized approach to data governance ensures that only the worthy have access to the sacred, siloed data. 🔐

The Magic of Query Federation 🧙

Query federation in Databricks Lakehouse Federation allows you to run queries on data that resides in external databases without having to import the data into your Databricks workspace. This is akin to casting a spell that creates a portal to an external data source. 🪄

Here’s an example of how you can create a portal in Databricks using Amazon Redshift: 

%sql

CREATE TABLE my_portal

USING org.apache.spark.sql.jdbc

OPTIONS (

  url ‘jdbc:redshift://redshift-cluster-1.abc123.us-west-2.redshift.amazonaws.com:5439/my_db’,

  dbtable ‘my_table’,

  user ‘my_user’, –use databricks secrets or similar service

  password ‘my_password’ –use databricks secrets or similar service

)

In this example, my_portal is a portal in Databricks that opens up to my_table in an external Amazon Redshift database. 🚪

The Key to Multiple Data Sources 🔑

Lakehouse Federation holds the key to various database types including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Microsoft SQL Server, Azure Synapse (SQL Data Warehouse), Google BigQuery, and of course Databricks.

This wide range of supported data sources ensures that organizations can unlock their existing data infrastructure.

The Scroll of Data Lineage and Access Control 📜

Databricks Lakehouse Federation ensures that data access is managed and audited for all federated queries made by the users in your Databricks workspaces. This provides a clear scroll of data lineage and helps maintain compliance with data regulations.

For example, you can use Databricks’ GRANT and REVOKE SQL commands to manage access to your federated tables:

%sql

— Grant SELECT permission to a specific user

GRANT SELECT ON TABLE my_portal TO user1 

— Revoke SELECT permission from a specific user

REVOKE SELECT ON TABLE my_portal FROM user1

In these examples, user1 is granted or revoked SELECT permission on my_portal. 🔐

The Tool for Ad Hoc Reporting and Proof-of-Concept Work 🛠️

Databricks Lakehouse Federation is also a powerful tool for ad hoc reporting and proof-of-concept work.

You can quickly connect to an external data source, run queries, run AI/ML workloads (MLOps), and generate reports without having to go through a lengthy data import process. 📊

Databricks Unity Catalog: The Guide to Lakehouse Federation 📚

Databricks Unity Catalog plays a crucial role in the functioning of Lakehouse Federation. It provides a unified governance solution for data and AI. Lakehouse Federation capabilities in Unity Catalog allow you to discover, query, and govern data across various data platforms from within Databricks without moving or copying the data.

All these operations can be performed within a simplified and unified experience. Unity Catalog’s advanced security features such as row and column level access controls, discovery features like tags, and data lineage will be available across these external data sources, ensuring consistent governance.

The Treasure at the End of the Journey 🏆

In summary, Databricks Lakehouse Federation is a powerful tool that can help organizations manage their data more effectively, gain faster insights, and improve data governance.

Databricks Lakehouse Federation brings several innovative features that address the challenges faced by data teams in enterprises:

Unified View of Data: It provides a unified view of all data across the organization, helping to break down data silos.

Query Federation: This feature enables users and systems to run queries against multiple data sources without needing to migrate all data to a unified system. This can lead to faster insights as you can query the data in place and avoid complex and time-consuming ETL processing.

Support for Multiple Data Sources: Lakehouse Federation supports connections to various database types including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Microsoft SQL Server, Azure Synapse (SQL Data Warehouse), Google BigQuery, and Databricks.

Data Lineage and Access Control: It ensures that data access is managed and audited for all federated queries made by the users in your Databricks workspaces.

Ad Hoc Reporting and Proof-of-Concept Work: It can be used for ad hoc reporting, proof-of-concept work, the exploratory phase of new ETL pipelines or reports, and supporting workloads during incremental migration.

While Databricks Lakehouse Federation offers many advantages, there are also some potential disadvantages to consider:

Real-Time Data Processing: Lakehouse Federation queries can be slower than queries on data that is stored locally in the lake. Therefore, it may not be a good choice for applications that require real-time data processing.

Complex Data Transformations: It might not be the best fit for scenarios where you need complex data transformations and processing, or need to ingest and transform vast amounts of data.

Read-Only Queries: The queries are read-only, which might limit some use cases.

Throttling of Connections: Throttling of connections is determined using the Databricks SQL concurrent query limit.

Costs and Maintenance: Running both a data warehouse and a data lake in tandem on a data platform can have serious costs and maintenance associated.

By leveraging Databricks Lakehouse Federation, organizations can unlock the full potential of their data and drive their business forward. It’s the treasure at the end of the journey, the lost wisdom that every data adventurer seeks!

If you would like to engage with Mphasis Datalytyx to help evaluate Databricks Lakehouse Federation within your organization, then please reach out to me!🎉

This blog is written by Sunny Sharma

Disclaimer: Please note the opinions above are the author’s own and not necessarily my current employer’s opinion. This blog article is intended to generate discussion and dialogue with the audience. If I have inadvertently hurt your feelings in anyway, then I’m sorry.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *