I know what you’re thinking… Stop it. Just stop. My data belongs in a relational database, and you’re not going to convince me otherwise! I get it. I used to be you. Relational databases are our bread & butter, our wheelhouse, our go-to solution when we need to write, store, and read data. But bear with me. There is a case to be made for leveraging a data lake store, and frankly, I was surprised myself at how it all turned out.
Join me for a quick retrospective of why and how we implemented Azure Data Lake Store for a client.
Call it the benefits of the South Florida sunshine, the downtown West Palm Beach vibe, or the salt air, but I’m feeling positive about data lakes these days. Maybe it’s the water theme working its magic on me. Maybe I’ve finally wrapped my head around Big Data. Either way, we’re excited to have this technology in our toolkit.
We had a situation on our hands. Our client’s database servers were overloaded and unstable. They’d just gone through an acquisition, and data were stored in multiple places. A mission-critical process was suffering for it. They needed consolidated data, and fast. We had applications, old and new, that all needed access to this data. Throw in a few parallel projects, a large team of developers, multiple business objectives, and we were in a world of hurt with no time to spare.
The Innovative Architects team met and laid out our issues and what we needed. The list was simple, but a challenge:
- We need to spend as little time as possible moving the data
- We need some familiarity when developing – no huge learning curve, please!
- We need something that isn’t going to break the bank
- We all need to be able to work simultaneously
- We need something that scales
Spoiler Alert! We went with Azure Data Lake Store. Now let’s take each of those needs on that list and dive in.
Moving the Data Quickly and Simply
Hold on to your hat, because we’re about to talk concepts. The huge paradigm shift when thinking about data lakes versus relational databases is schema. When we design a new database, we define the structure of the data. We have tables and columns. Data must be shoved into those tables without violating any constraints we’ve created. With a data lake, the idea is that we store the data as-is. We worry about schema when we read the data. This makes getting the data into the lake so much easier… We weren’t living in a world of Extract-Transform-Load, we were only Ingesting and Storing. This allowed us to get the data into a common area for use by multiple applications, platforms, and audiences very quickly using Azure Data Factory (ADF) and U-SQL.
Familiar Languages = Less Learning Curve
Big Data sounds intimidating. Hive and Pig and numerous variations is a lot to take in. The nice thing about Microsoft’s implementation of Azure Data Lake is U-SQL. It’s a little bit country, a little bit rock n roll. Wait. Let’s try again… It’s a little bit SQL, a little bit C#. Together, it’s a useful and flexible language that most developers have at least baseline familiarity with. And, since we didn’t have to do data transform gymnastics to load the data lake, our processes were mostly straight reads and writes – which equals rapid deployment.
Additionally, we leveraged the fledgling but rapidly improving Data Factory for some scheduled load jobs. We created a process that pushed an entire SQL Server database into the lake on a daily basis using the Copy Data functionality, which is still in Preview mode.
I’ve heard another client say, “Big data means big money”. The perception out there is that Big Data is only meant for the Googles and Amazons of the world. This just isn’t true. Big Data is a misnomer. It should be Alt-Structured Data, or Non-Relational Data, or something. (I’m obviously not a marketing expert.) Big Data makes for a nice phrase, but it makes people think that they must have petabytes of data to leverage the technology. We’re here to tell you that you can have Big Data and save money doing it.
The great thing about Azure Data Lake Store is that you’re paying a relatively low cost for storage, a reasonable cost for read/write activities, and you’re leveraging Microsoft’s enormous cloud infrastructure to do it. The pricing details are here. To set up a similar environment on premises would have been cost-prohibitive. Add in the natural redundancy of Hadoop File System (HDFS), and you’ve got a cost-effective Big Data solution that naturally ties into the overall Microsoft platform, to include a Visual Studio SDK, Power BI integration and direct portal access to data.
Parallel Development is a Go!
If you’ve ever been part of a data-centric project, you know the challenges of standing up data stores while developers are trying to develop. The struggle and the refactoring are real. The database architects are trying to nail down schema design while the application architects are screaming that they need data, now. And if you must write complex ETL processes to load your data, well then, you’d better bring brownies for your app guys, because they’re going to be grumpy. With the data lake solution, we provisioned it and set up access in hours, not days. From there, we established a few guidelines for how to organize the data, and everyone went to town. Three developers simultaneously loading data into a data store on Day 2 of a project was a beautiful thing.
You Want Scalability? That’s How You Get Scalability!
It is a basic tenet of the Big Data ideal that scalability is built right into the infrastructure. Scaling data across multiple commodity servers is part of the deal. Conceptually, we all got this. In practice, it was just cool to see. We ramped up from 0 to many Gigabytes quickly, and not once did we have to check in with the infrastructure team to beg for more space. You’ll see it in the pricing model. They price by Terabyte ranges, not Gigabyte. Add in that there is literally no limit on transaction sizes or file sizes, and you’ve got a long-term solution to myriad data storage challenges. Disclaimer: Best not to test that “unlimited transaction size” thing – performance is still an issue when you get really big.
We started with a few basic objectives around consolidated staging data for our data lake, but once everyone wrapped their collective head around the idea of the lake’s inherent scalability, we found other uses for it. Legacy data that needs to live somewhere so we can sundown a server? Done. Outside files that had been stored on a file server? New home! Output that one process produced that was needed by another, unconnected application? Yes.
When Should You Consider a Data Lake?
With this client, we’ve learned a lot about how to use a data lake architecture. Here are a just few cases where a data lake might be a good option for your organization:
- You have multiple systems doing the same thing, and you need to put the data all in one place for analysis by your power users or data scientists
- You need a staging area where applications can persist data for use by other processes
- You need a landing/staging area for a traditional relational database, and you want it quickly
- You have streaming data coming in, and you need to put it somewhere
- You have legacy data you don’t want to lose, but you also don’t want to spend a lot of money transforming it
- You have a lot of flat files, Excel spreadsheets, and such that you’d love to get off the on-prem file server, but maintain accessibility for the processes that need them
The Wrap Up
It is important to temper my enthusiasm for a data lake as a data store solution with this statement: A data lake is not a silver bullet. There is still a time and a place for a relational database. But, if you need to stand up a hyper-flexible data store that gets your developers up and running quickly, a data lake is a great option to consider.