Databricks has launched a project to create an open-source data sharing protocol for securely sharing data across organisations in real time, independent of the platform on which the data resides.
The Delta Sharing initiative, part of Databrick’s open-source Delta Lake project, has already attracted support from a number of data providers, including NASDAQ, S&P and Factset, and leading IT vendors including Amazon Web Services, Microsoft and Google Cloud, according to Databricks.
Databricks is also expanding its technology portfolio with a new machine learning system and the addition of new data pipeline and data governance capabilities to its flagship Databricks Lakehouse Platform, which combines aspects of data warehouse and data lake systems.
Delta Sharing is the latest open-source initiative from Databricks, one of the most closely watched big data startups. Founded by the developers of the Apache Spark analytics engine, Databricks markets the Databricks Lakehouse Platform, its flagship unified data analytics platform.
In February Databricks, founded in 2013, raised US$1 billion in Series G funding, boosting the company’s market valuation to some US$28 billion. Observers are anticipating that the company may go public sometime this year in what could be one of the IT industry’s biggest IPOs.
Databricks takes the position that successful data management and AI initiatives go hand-in-hand.
“The data is the most important piece of your AI strategy,” said Joel Minnick, Databricks marketing vice president, in an interview with CRN USA. “Customers are trying to get value out of their data and that’s driving business initiatives. You can throw all the money in the world at AI. If you don’t have good data, you’re never going to get good results.”
The new Delta Sharing, included within the open-source Delta Lake 1.0 project, establishes a common standard for sharing all data types – structured and unstructured – with an open protocol that can be used in SQL, visual analytics tools, and programming languages such as Python and R, according to Databricks. Large-scale datasets also can be shared in the Apache Parquet and Delta Lake formats in real time without copying.
Delta Sharing extends the applicability of the data lakehouse architecture because it “enables an open, simple, collaborative approach to data and AI,” both within and between organizations, according to the company.
“The top challenge for data providers today is making their data easily and broadly consumable. Managing dozens of different data delivery solutions to reach all user platforms is untenable. An open, interoperable standard for real-time data sharing will dramatically improve the experience for both data providers and data users” said Matei Zaharia, chief technologist and Databricks co-founder, in a statement.
“Delta Sharing will standardise how data is securely exchanged between enterprises regardless of which storage or computing platform they use, and we are thrilled to make this innovation open source,“ Zaharia said.
Delta Sharing “has a lot of value for anyone working in the data space,” Minnick said, and he expects Databricks partners to use the open-source protocols to better serve customers with initiatives that involve data both inside their organisation and data from other businesses and organisations.
Delta Sharing is the latest of a number of popular open-source projects Databricks has created that span data processing, data engineering, data science and machine learning. Those include the Spark data processing engine created by Databricks’ founders, Delta Lake, ML Flow and Koalas – all, like Delta Sharing, donated to the Linux Foundation.
Databricks, which held its Data + AI Summit last week, also debuted the latest generation of its machine learning software, Databricks Machine Learning, built on the Databricks Lakehouse Platform. The software, according to the company, provides engineers with everything they need to build, train, deploy, manage and maintain ML models.
The ML software includes Databricks AutoML to automate the manual data science steps within the machine learning process. The Databricks Feature Store improves the discoverability, reuse and governance of ML model features within an enterprise’s data engineering platform.
Minnick called the new tool “real-world ML. Having that ML platform built on top of the data platform is really powerful. This is for people who need to get [machine learning] models into production.”
Databricks also debuted Delta Live Tables and Unity Catalog, new features within the company’s lakehouse platform that the company said enhance the platform’s data management capabilities.
About 80 percent of Databricks customers are running the company’s software across multiple cloud systems, Minnick said. Unity Catalog’s unified data catalog technology, underpinned by the new Delta Sharing standard, makes it easier to discover and govern an organisation’s data assets in data lakes spread across multiple clouds.
Delta Live Tables simplifies the development and management of reliable data pipelines on Delta Lake. Big data workloads often require huge data ELT (extract, load and transform) pipelines that are difficult to build and maintain, Minnick said. Live Tables “is removing a massive amount of the heavy lifting data engineers need to do to keep data pipelines flowing,” he said.