Here at Datalytyx we are very excited by today’s announcement of a strategic partnership between Snowflake and Databricks:
To see Snowflake and Databricks in action, sign up for a 30 minute demo by submitting your details here.
Snowflake, the powerful data warehouse built for the cloud, has been the go-to data warehouse solution for Datalytyx since we became the first EMEA partner of Snowflake 18 months ago. As for the Databricks Unified Analytics Platform, the availability of high performance, on-demand Spark clusters optimised for the cloud combined with a collaborative notebook environment has made it the standout tool for data engineering and in particular our data science practice at Datalytyx.
We were able to quickly get started using the instructions provided by Databricks in their blog post:
So, what makes both of these pieces of software the best at what they do, and what makes them so good together?
Snowflake is an extremely powerful cloud-based SQL database built on a cloud backend with both a slick web interface and command line tools. Let’s take a closer look at the key features of Snowflake and why it accelerates our big data solution development at Datalytyx.
Separation of storage and compute
Snowflake’s innovative Multi-Cluster, Shared Data Architecture is an exceptional piece of tech and has three layers at its core; storage, compute and services. Each layer is independently scalable and decoupled from the other. This enables customers to scale resources as they are required to take advantage of the elasticity of the cloud. Customers do not have to buy and allocate resources for peak consumption. This has straightaway solved a perennial issue of customers’ big data solutions; the ability to scale up without the large cost of additional hardware, and without the continual cost of maintaining that hardware if the scale is no longer needed.
Support multiple workloads
Snowflake is designed to easily support multiple disparate workloads because of the separation of compute and storage. You can easily spin up separate warehouses (virtual compute in essence) to support ETL, ELT and BI workloads individually. And each compute instance is not impacted by the others. There is zero resource contention as they run in entirely separate clusters. This has enabled our ETL developers and Data Architects to forget about the traditional issues of concurrency, compute and other heavy lifting, and just get on with their job. Furthermore, the compute on offer is so powerful that we at Datalytyx our now able to work at completely new scales. Billions is the new millions!
Fast zero copy cloning
Cloning data fast has been an ever-present challenge in all databases and data warehouses. The Snowflake CLONE command can create a clone of a table, a schema, or an entire database almost instantly. This means you can create multiple copies of production data without incurring additional storage costs. No need to have separate test/dev data sets. We have seen multi-terabyte databases cloned in a matter of seconds.
Snowflake provides unique capability to query, clone, and restore data that is found in tables, schemas or even entire databases at a point in time. These actions are executed as extensions to SQL (using AT, BEFORE and UNDROP clauses) and can be executed within the data retention period – 90 days for an enterprise level account. This enables customers to query data based on a point in time, restore a table from the point before it became corrupt, or clone a database before a recent set of updates were applied using nothing more than SQL statements. This gives our developers at Datalytyx peace of mind while they hack away at problems, knowing that any accident (which we know can happen all too easily) is one query away from being fixed.
Snowflake has been architected to natively handle unstructured data in a very unique way. Semi-structured data formats such as JSON, Avro, ORC, Parquet, or XML can be loaded into a single field in Snowflake. Snowflake stores these types internally in an efficient compressed columnar binary representation of the documents for performance and efficiency. Simple query extensions are provided to enable these semi-structured formats to be queried in SQL while remaining in their native format. Our team at Datalytyx has found this feature to be very impressive. The query API for unstructured data is highly intuitive and allows us to parse the information out of this kind of data at a speed and scale that we couldn’t before.
Now let’s take a closer look at the Databricks Unified Analytics Platform. It enables workloads including batch processing, real-time/stream processing, machine learning, deep learning, and graph analysis with an optimized version of Apache Spark from its original creators. It is a cloud native, managed service that offers 10-40x performance compared to open source Spark. Databricks is built with collaboration, performance, agile development and ecosystem integration in mind. The platform is driven by two key components; it runs on on-demand Spark clusters, executing code from a familiar feeling, feature rich collaborative notebook environment. Think of it as Jupyter notebooks with an integrated, powerful and reliable engine. Let’s take a closer look at the core features and benefits of the platform.
Collaboration is a major reason to choose Databricks for “unifying” data science and engineering efforts. Databricks provides a platform where data scientists and data engineers can easily share workspaces, manage clusters and run jobs through a single interface. They can also commit their code and artifacts to popular source control tools like GitHub and Bitbucket as part of a continuous integration/continuous development (CI/CD) process. Within Databricks, users can spin up clusters, create interactive notebooks and schedule jobs to run those notebooks. Using the Databricks portal, users can then easily share these artifacts with other users. This allows users to create and build models together in the same notebook in real time, to re-use data assets, libraries and compute resources across the same cluster, or to re-use and monitor scheduled jobs. This has been a game changer for our team at Datalytyx. While the open source tools alone do offer scope for collaboration and code sharing, Databricks takes this and adds a layer of automation and seamlessness such that code-sharing becomes an inherent part of the development process rather than a tricky chore. You will see huge accelerations in the learning and development life cycles of your teams when given the Databricks platform to deliver their projects so that data engineers and data scientists can focus on building pipelines and be more productive.
Simplicity of use and management
Databricks makes it very easy to create a Spark cluster out of the box according to the requirements for a particular use case without requiring DevOps. You can choose to enable the cluster to automatically scale up and down based on the workload or run a serverless pool to enable concurrency across users. Platforms such as AWS EMR on the other hand, will require you to choose node types, establish credentials, spin up virtual machines, configure file paths for ingestion and ETL. Databricks has used their deep knowledge of Spark to make it more secure and reliable compared to open source Spark. Also, it’s hard to accidentally leave a job running in Databricks, unlike other Spark platforms, as there are fail-safes that prevent this from happening by auto-terminating your inactive clusters to save resources.
Diverse language support
Databricks supports multiple languages for data engineering and data science such as Python, Scala, R, and SQL so you can use your existing skills to start building. There is built-in support for all our favourite open source libraries; pandas, ggplot, seaborn, Tensorflow, scikitlearn, XGBoost and more. So your team will never feel far from their data science roots. This has aided our transition to Databricks as our data scientists have been able to hit the ground running by using whichever tools they are most familiar with. Databricks is also optimised for machine learning and deep learning, with the ability to spin up GPU clusters on demand using Databricks ML Runtime. This means training deep learning models with any of the popular libraries has never been easier.
Snowflake and Databricks Combined
We have seen why each of these tools on their own have been the standout choice for us at Datalytyx: Snowflake as the SQL database and the Databricks Unified Analytics Platform as our cloud optimised Spark engine driving engineering and data science pipelines. But how can these work together and why are we so excited about specifically the partnership between the two? With the vast experience we have in providing cloud solutions for our customers, here are the key benefits of this combination going forward.
Databricks enables data engineers to quickly ingest and prepare data and store the results in Snowflake. Once in Snowflake, users can discover and analyze the data that are fresh and trusted in their data visualisation and BI tools of choice. Databricks enables users to collaborate to train machine learning using large data sets in Snowflake and productionise models at scale. The connector between Snowflake and Databricks means that users do not need to duplicate data between systems. The result is that data engineers and data scientists can build pipelines more rapidly with significantly less complexity and cost.
Cloud platform independency
Snowflake and Databricks run seamlessly on multiple clouds which is important for customers. Both have been established for many years on AWS and recently have expanded support for Microsoft Azure. Support for the two most dominant cloud platforms provides coverage across more than 90% of all organisations on cloud.
Snowflake and Databricks combined increase the performance of processing and querying data by 1-200x in the majority of situations. Databricks provides a series of performance enhancements on top of regular Apache Spark including caching, indexing and advanced query optimisations that significantly accelerates process time. Snowflake provides automated query optimisation and results caching so no indexes, no need to define partitions and partition keys, and no need to pre-shard any data for distribution, thus removing administration and significantly increasing speed. The Databricks connector to Snowflake can automatically push down Spark to Snowflake SQL operations. Combined, the ability to analyse terabytes of data, with virtually zero configuration tuning, is second to none.
Security at the core
Snowflake and Databricks both take a holistic approach to solving the enterprise security challenge by building in all the facets of security — encryption, identity management, role-based access control, data governance and compliance standards – into the core of their platforms. Data is automatically encrypted by default in each platform and is therefore secure at rest and in flight. Security, to our comfort, is 100% at the heart of design in both platforms.
Snowflake have prioritised ease of connectivity of their platform through their in-house developed Snowflake connectors. These allow data engineers and data scientists to utilise the Snowflake backend from within their notebooks. This could be to either move data to or from the cluster for processing or model building, or perform a SQL operation on the database itself as part of a scripted operation. The packages are highly optimised so the movement of data is as fast as the processing power of the engines on either side.
Simplicity of use together
Another benefit of the connector is that the code can remain in one place (Databricks notebooks) while the separate operations that each platform excel at – classical SQL-based operations on Snowflake, and more complex ETL, visualisations, analytics and machine learning on Databricks. While Snowflake’s web UI is itself a great SQL development platform, it doesn’t contain the richness of features that we have on Databricks that enable the agile development of pipelines and bring the collaborative notebook environment to life.
Snowflake and Databricks are already leaders in their respective categories of Cloud Data Warehousing and Unified Analytics Platforms. We see these technologies as highly complementary and well positioned to deliver the use cases that will help data engineers and data scientists to deliver innovation more rapidly and with less cost and complexity. This is demonstrated by the number of customers that are already using or considering using both platforms together.
More so, both platforms are unique and modern in their approach, and give an incredibly smooth and intuitive user experience. We found it very easy to get started with Databricks and Snowflake using the instructions in the following blog post (here) and example notebook (here). The products integrate seamlessly, scale with zero effort, are easy to use and just work! This is significantly different from the both the traditional data platforms and even to an extent the pure-play native cloud data platforms which require significant DevOps. Snowflake and Databricks are a hyper-modern match made in data heaven.
In fact, we are so impressed with how well the platforms work together that they now form the core of the managed data services Datalytyx delivers to its clients (Data Engineering as a Service and Data Science as a Service), and are the engine room behind the Datalytyx Data Platform – a fully managed Data Platform as a Service that includes 24×7 support and can be fully operational in less than 24 hours. We can bulk load your data into Snowflake in less than 1 week and be using Databricks to unearth the insights you want in just as little time.
Schedule time with us to see Snowflake and Databricks in action.