It's been a wild ride over the past six years as ZDNet gave us the opportunity to chronicle how, in the data world, bleeding edge has become the norm. In 2016, Big Data was still considered the thing of early adopters. Machine learning was confined to a relative handful of Global 2000 organizations, because they were the only ones who could afford to recruit teams from the limited pool of data scientists. The notion that combing through hundreds of terabytes or more of structuredandvariably structured data would become routine was a pipedream. When we began our part ofBig on Data, Snowflake, which cracked open the door to the elastic cloud data warehouse that could also handle JSON, was barely a couple years post stealth.
In a short piece, it's going to be impossible to compress all the highlights of the last few years, but we'll make a valiant try.
When we began our stint at ZDNet, we'd already been tracking the data landscape for over 20 years. So at that point, it was all too fitting that our very first ZDNet post on July 6, 2016, looked at the journey of what became one of the decade's biggest success stories. We posed the question, "What should MongoDB be when it grows up?" Yes, we spoke of the trials and tribulations of MongoDB, pursuing what cofounder and then-CTO Elliot Horowitz prophesized, that the document form of data was not only a more natural form of representing data, but would become the default go-to for enterprise systems.
MongoDB got past early performance hurdles with an extensible 2.0 storage engine that overcame a lot of the platform's show-stoppers. Mongo also began grudging coexistence with features like the BI Connector that allowed it to work with the Tableaus of the world. Yet today, even with relational database veteran Mark Porter taking the tech lead helm, they are still drinking the same Kool Aid that document is becoming the ultimate end state for core enterprise databases.
We might not agree with Porter, but Mongo's journey revealed a couple core themes that drove the most successful growth companies. First, don't be afraid to ditch the 1.0 technology before your installed base gets entrenched, but try keeping API compatibility to ease the transition. Secondly, build a great cloud experience. Today, MongoDB is a public company that is on track to exceed$1 billion inrevenues(not valuation), with more than half of its business coming from the cloud.
We've also seen other hot startups not handle the 2.0 transition as smoothly. InfluxDB, a time series database, was a developer favorite, just like Mongo. But Influx Data, the company, frittered away early momentum because it got to a point where its engineers couldn't say "No." Like Mongo, they also embraced a second generation architecture. Actually, they embraced several of them. Are you starting to see a disconnect here? Unlike MongoDB, InfluxDB's NextGen storage engine and development environments were not compatible with the 1.0 installed base, and surprise, surprise, a lot of customers didn't bother with the transition. While MongoDB is now a billion dollar public company, Influx Data has barely drawn$120 million in funding to date, and for a company of its modest size, is saddled with a product portfolio that grew far too complex.
It shouldn't be surprising that the early days of this column were driven by Big Data, a term that we used to capitalize because it required unique skills and platforms that weren't terribly easy to set up and use. The emphasis has shifted to "data" thanks, not only to the equivalent of Moore's Law for networking and storage, but more importantly, because of the operational simplicity and elasticity of the cloud. Start with volume: You can analyze pretty large multi-terabyte data sets on Snowflake. And in the cloud, there are now many paths to analyzing the rest of The Three V's of big data; Hadoop is no longer the sole path and is now considered a legacy platform. Today, Spark, data lakehouses, federated query, and ad hoc query to data lakes (a.k.a., cloud storage) can readily handle all the V's. But as we stated last year, Hadoop's legacy is not that of historical footnote, but instead a spark (pun intended) that accelerated a virtuous wave of innovation that got enterprises over their fear of data, and lots of it.
Over the past few years, the headlines have pivoted to cloud, AI, and of course, the continuing saga of open source. But peer under the covers, and this shift in spotlight wasnotaway from data, butbecauseof it. Cloud provided economical storage in many forms; AI requires good data and lots of it, and a large chunk of open source activity has been in databases, integration, and processing frameworks. It's still there, but we can hardly take it for granted.
The operational simplicity and the scale of the cloud control plane rendered the idea of marshalling your own clusters and taming the zoo animals obsolete. Five years ago, we forecast that the majority ofnewbig data workloads would be in the cloud by 2019; in retrospect, our prediction proved too conservative. A couple years ago, we forecast the emergence of what we termed The Hybrid Default, pointing to legacy enterprise applications as the last frontier for cloud deployment, and that the vast majority of it would stay on-premises.
That's prompted a wave of hybrid cloud platform introductions, and newer options fromAWS , Oracle and others to accommodate the needs of legacy workloads that otherwise don't translate easily to the cloud. For many of those hybrid platforms, data was often the very first service to get bundled in. And we're also now seeing cloud database as a service (DBaaS) providers introducenew custom options to capture many of those same legacy workloads where customers require more access and control over operating system, database configurations, and update cycles compared to existing vanilla DBaaS options. Those legacy applications, with all their customization and data gravity, are the last frontier for cloud adoption, and most of it will be hybrid.
The data cloud may be a victim of its own success if we don't make using it any easier. It was a core point in our parting shot in this year's outlook. Organizations that are adopting cloud database services are likely also consuming related analytic and AI services, and in many cases, may be utilizing multiple cloud database platforms. In a managed DBaaS or SaaS service, the cloud provider may handle the housekeeping, but for the most part, the burden is on the customer's shoulders to integrate use of the different services. More than a debate between specialized vs. multimodel or converged databases, it's also the need to either bundle related data, integration, analytics, and ML tools end-to-end, or to at least make these services more plug and play. In our Data 2022 outlook, we called on cloud providers to start "making the cloud easier" by relieving the customer of some of this integration work.
One place to start? Unify operational analytics and streaming. We're starting to see it Azure Synapse bundling in data pipelines and Spark processing; SAP Data Warehouse Cloud incorporating data visualization; while AWS, Google, and Teradata bring in machine learning (ML) inference workloads inside the database. But folks, this is all just a start.
While our prime focus in this space has been on data, it is virtually impossible to separate the consumption and management of data from AI, and more specifically, machine learning (ML). It's several things: using ML to help run databases; using data as the oxygen for training and running ML models; and increasingly, being able to process those models inside the database.
And in many ways, the growing accessibility of ML, especially through AutoML tools that automate or simplify putting the pieces of a model together or the embedding of ML into analytics is reminiscent of the disruption that Tableau brought to the analytics space, making self-service visualization table stakes. But ML will only be as strong as its weakest data link, a point that was emphasized to us when we in-depth surveyed a baker's dozen of chief data and analytics officers a few years back. No matter how much self-service technology you have, it turns out that in many organizations, data engineers will remain a more precious resource than data scientists.
Just as AI/ML has been a key tentpole in the data landscape, open source has enabled this Cambrian explosion of data platforms that, depending on your perspective, is blessing or curse. We've seen a lot of cool modest open source projects that could, from Kafka to Flink, Arrow, Grafana, and GraphQL take off from practically nowhere.
We've also seen petty family squabbles. When we began this column, the Hadoop open source community saw lots of competing overlapping projects. The Presto folks didn't learn Hadoop's lesson. The folks at Facebook who threw hissy fits when the lead developers of Presto, which originated there, left to form their own company. The result was stupid branding wars that resulted in Pyric victory: the Facebook folks who had little to do with Presto kept the trademark, but not the key contributors. The result fractured the community, knee-capping their own spinoff. Meanwhile, the top five contributors joined Starburst, the company that was exiled from the community, whose valuation has grown to 3.35 billion.
One of our earliest columns back in 2016 posed the question on whether open source software has become the default enterprise software business model. Those were innocent days; in the next few years, shots started firing over licensing. The trigger was concern that cloud providers were, as MariaDB CEO Michael Howard put it, strip mining open source (Howard was referring to AWS). We subsequently ventured the question of whether open core could be the salve for open source's growing pains. In spite of all the catcalls, open core is very much alive in what players like Redis and Apollo GraphQL are doing.
MongoDB fired the first shot with SSPL, followed by Confluent, CockroachDB, Elastic, MariaDB, Redis and others. Our take is that these players had valid points, but we grew concerned about the sheer variation of quasi open source licenses du jour that kept popping up.
Open source to this day remains a topic that gets many folks, on both sides of the argument, very defensive. The piece that drew the most flame tweets was our 2018 post on DataStax attempting to reconcile with the Apache Cassandra community, and it's notable today that the company is bending over backwards not to throw its weight around in the community.
So it's not surprising that over the past six years, one of our most popular posts posed the question, Are Open Source Databases Dead? Our conclusion from the whole experience is that open source has been an incredible incubator of innovation -just ask anybody in the PostgreSQL community. It's also one where no single open source strategy will ever be able to satisfy all of the people all of the time. But maybe this is all academic. Regardless of whether the database provider has a permissive or restrictive open source license, in this era where DBaaS is becoming the preferred mode for new database deployments, it's the cloud experience that counts. And that experience is not something you can license.
As we've noted, looking ahead is the great reckoning on how to deal with all of the data that is landing in our data lakes, or being generated by all sorts of polyglot sources, inside and outside the firewall. The connectivity promised by 5G promises to bring the edge closer than ever. It's in large part fueled the emerging debate over data meshes, data lakehouses, and data fabrics. It's a discussion that will consume much of the oxygen this year.
It's been a great run at ZDNet but it's time to move on. Big on Data is moving. Big on Data bro Andrew Brust and myself are moving our coverage under a new banner, The Data Pipeline, and we hope you'll join us for the next chapter of the journey.