The Problem:Efficiency and optimization are hard – even harder in the cloud.
In a moment of brutal honesty any engineer worth their salt will tell you that performance tuning is more art than science. Engineering and designing systems for optimal performance on hardware was difficult before the cloud. The cloud exacerbates this considerably as the number of options and configurations is exponentially greater. In addition, each of these options comes with a subtle (but crucial) additional variable – cost!
This problem is especially true in the Big Data realm. Many tools and applications long ago moved beyond the typical boundaries of CPU and RAM, and into the realm disk I/O or network constraints. Here, the cloud makes things really interesting by attaching a consumption-based price tag to network throughput and IOPS. As if that weren’t enough, new abstraction layers, services, and pricing are introduced on an on-going basis. Let’s examine the following almost-real-life example.
The Challenge:Optimizing big data for efficiency in the cloud
We leverage ApacheⓇ Spark™ for a number of functions when running data through our pipeline both before and after executing entity resolution. On hardware based architectures we typically have a fixed number of dedicated nodes and we can monitor typical bottlenecks and tune the dials to get the best performance. The number of machines and their existing configurations are fixed.
To measure and improve performance we baseline the job, run it, make some changes, run it again, lather-rinse-repeat. For better or worse, we have a limited set of dials (variables) to adjust between runs. Additionally, each of these incremental experimentation runs doesn’t cost us anything extra from a hardware standpoint.
We recently transitioned this particular data processing engine to AWS atop Elastic MapReduce. Immediately upon moving a workload to AWS one is confronted with a dizzying array of options for implementation. Which size instances to use? Local storage, or EBS? IOPS provisioned or Magnetic drives? How much network throughput? Are the costs for the larger instances worth it for the extra network throughput? Fewer larger machines, or more smaller machines? Rewrite the job to leverage spot pricing, or maintain fixed cluster sizes? (sidenote: spot pricing is AWESOME, use it).
For this relatively benign and simple test case, we are immediately confronted with 56 different test scenarios we would need to run. This was after limiting our variables to only 7 instance types (of the 39 or so available at this writing), 4 cluster sizes (4,8,12,16), and 2 data storage options (local/ephemeral vs. EBS). This test scenario begins to get untenable once you start considering things like IOPS provisioning, magnetic drives vs. SSD, evaluations of network throughput, and putting dollar signs in front of everything.
It’s truly dizzying and a problem only a select few people truly enjoy spending their time solving. Fewer still are qualified to do so. The position requires an unusual mix of software development (understanding the inner workings and limitations of the application), systems engineering (understanding the inner workings and limitations of the environment) and financial modeling (how do you get the most bang for your buck).
The Solution:Cloud Efficiency Engineer
I call this new job role the “Cloud Efficiency Engineer”. It’s sort of a hybrid system engineer, cloud architect, performance tester, finance geek – “Cloud Unicorn” for short. The job description is pretty simple: work across teams to devise robust models of system performance and cost and help them to execute massive jobs efficiently.
If you’re a recent or up-and-coming college grad with a background in computer science or computer
engineering, consider differentiating yourself as a “Cloud Efficiency Engineer” now. Working to create some of these models on your own would go a long way toward landing a job and securing a pretty good salary. If you’re good enough, it wouldn’t be a stretch to convince an employer that you pay for yourself in cost savings!
The same goes for those who have been in the systems engineering game for awhile. A wealth of systems tuning know-how coupled with cloud expertise will go along way toward your job prospects over the coming decade. And if you’re already a few steps ahead of me and are doing this work, by all means send me your resume!