Choosing an Iceberg catalog: a beginner’s journey 🧊💥🔧

by | Mar 13, 2024 | BlogPosts, Databricks, Snowflake | 0 comments

Icebergs reduce our lock-in, with a whole new world of flexibility

Iceberg tables offer a lot of the great features we know and love in Snowflake and Databricks, with the added benefit of (ostensible) vendor-agnosticism and therefore reduced lock-in. Good news! Every business has an eye on vendor lock-in, so we’re on the right track.

I’m familiar with Snowflake and Databricks as data platforms, and I want to explore using Iceberg tables for some of my data products, perhaps as part of a data mesh. I’m more familiar with Snowflake, so I’m going to start there. Apparently our first task is to pick a catalog — what’s a catalog?

You may think of Iceberg as a format for managing data in a single table, but the Iceberg library needs a way to keep track of those tables by name. Tasks like creating, dropping, and renaming tables are the responsibility of a catalog. Catalogs manage a collection of tables that are usually grouped into namespaces. The most important responsibility of a catalog is tracking a table’s current metadata, which is provided by the catalog when you load a table.

Icebergs being born

OK so a catalog sounds like the database management layer which tells the client which files to read / write for its query — then we can bring our own compute engine to do the work. Great — we can choose to put our catalog in Snowflake or link out to an external one.

(While this article isn’t an Iceberg 101, here’s a nice diagram from Snowflake showing the architecture of a Snowflake-managed Iceberg catalog and table, plus the optional extra of reading the data from outside Snowflake, via a Spark cluster)

“How Iceberg Tables work” by Ron Ortloff and Steve Herbert (https://www.snowflake.com/blog/unifying-iceberg-tables)

Catalog limitations

While researching options for interacting with Iceberg tables, I ran into the issue of cross-compatibility — which engine can write to which Iceberg tables? If I create an Iceberg table in Snowflake, what can Databricks do with it? Which version of the Iceberg spec does each one use? Ironically, it seems there’s a lot more to Iceberg than meets the eye (har har).

Oh that’s a lot of icebergs

The simplest catalog for Iceberg is the Hadoop catalog, which despite its name is basically some files containing metadata on how to read the data files. It’s not recommended for production though, so we can ignore it. In Snowflake that leaves us the options of Snowflake’s own catalog and AWS Glue… but in each case can we create Iceberg tables and write to them from multiple places? If we create the catalog in Snowflake:

Third-party clients cannot append to, delete from, or upsert data to Iceberg tables that use Snowflake as the catalog.

In fact, sharing between regions in Snowflake isn’t supported yet:

Cross-cloud and cross-region sharing of Iceberg Tables is not currently supported. The provider’s external volume, Snowflake account, and consumer’s Snowflake account must all be in the same cloud region.

(Which also means that, as per this KB article, when you create your Iceberg tables, your S3 bucket needs to be in the same region as your Snowflake account.)

So for a Snowflake catalog it’s writeable from Snowflake but read-only everywhere else. What about Databricks? It takes a similar approach, and supports read-only access to Delta tables when they’re masquerading as Iceberg tables using UniForm.

What about other options? At the time of writing the only other catalog supported by Snowflake is AWS Glue, and some sources (like this blogpost by Deepak Rajak) say that Glue gives us the ability to write from multiple engines, at the expense of quite a lot of setup. In fact we can look at the Iceberg documentation itself and see it gives 6 (six!) different options for a catalog in AWS, depending on your preferences, requirements, and the phase of the moon.

Read-only — does it matter?

The Rock says it doesn’t matter but probably wasn’t talking about Iceberg catalogs (https://www.youtube.com/watch?v=VAag-nlCJQ0)

Read-only is often seen as a limitation, but what use-cases are there really? In most data engineering architectures there’s a 1:1 mapping between the data-writing process and the data product, i.e. something loads and transforms data, then makes it available for other people to use. It’s unusual for user or process from another domain to need to alter a data product directly — we would expect them to provide data as an input to the machine rather than altering the outputs directly.

Further discussion of and evidence for this is in the excellent “Data Products for Dummies” under the discussion on composite data products. In the diagram below we would expect ‘T’ to be created and maintained by the process shown, and immutable otherwise — especially in a DataOps paradigm where the tables and views that make up a data product are just a projection of the code and configuration. As an analogy we wouldn’t expect someone at Netflix to alter the running code on server 23af765e94ac, instead they would commit their change into the codebase and the DevOps process would deploy new servers. Similarly we don’t expect another data domain to write to our data product directly, we would strongly prefer them to give us the data we need to incorporate so we can run it through our whole DataOps process including any testing and governance we need to apply.

🖐 If you have a use-case where team B needs to directly edit the data product from team A, or more generally where two data engines need to independently edit a table, then please add a comment and let me know!

Acceptance

So for the Snowflake and Databricks world we can write our iceberg tables from one place and read from either, and we’re OK with that. What can we do to reduce the lock-in? How about if we make sure we can migrate the catalog in the future?

 

Futurama will always be relevant (https://www.imdb.com/title/tt0584437/)

Dremio have a great blogpost on migrating Iceberg catalogs, which in turn links to the Project Nessie migration tool. They’ve also written a great resource on Iceberg in general, Apache Iceberg: The Definitive Guide, which includes some example Spark SQL configuration for performing a migration (see Chapter 5). We want to be confident that we can migrate out of Snowflake’s catalog if we need to, and while the proof of the pudding will be in the migrating, this seems like an appropriate set of tools and an amount of reference material that a catalog migration out of Snowflake is absolutely feasible — if it were ever required.

Conclusions

In conclusion I’d be happy to use the in-built Iceberg catalog from either Snowflake or Databricks, especially if I’m already using that platform. There are limitations to each when compared with other catalog providers, but a) the limitations aren’t terrible and b) we can expect both platforms to improve their features as the Delta vs Iceberg battle rumbles on. So:

  • Does the Snowflake-managed catalog let us get started with Iceberg? ✅
  • Does it have an outbound migration path if we need it? ✅
  • Is it easy to get started? ✅✅✅

Next

  1. If you want to have a play with Iceberg in Snowflake, try the “Getting Started with Iceberg Tables” quickstart here
  2. Next up I’m going to investigate Dremio Arctic, so do like and subscribe in case that ever makes it to publication
  3. And in the meantime thanks for reading, and please let me know if you found this useful or anything I might have overlooked!

Thanks to:

Data Products for Dummies” courtesy of DataOps.live

Apache Iceberg: The Definitive Guide” courtesy of Dremio

This blog is written by Dan Martyr

Review and feedback from Tom Saunders

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *