Auto Scale EMR: Enables Analysts to Cut Through the Flood of Data

Introduction

Business analysts are increasingly tasked to analyze large volumes of analytic data and share insights to help address their unique business challenges. With the data tsunami that is inundating the modern enterprise, it is imperative to equip analysts across industries with the right tools so they can cut through the flood of data and zero in on the critical information that drives their critical business decisions. For example, retail analysts need to focus on improving profitability by analyzing inventory risk and product allocation strategies while a bank analyst must hone in on the factors impacting credit scoring, loan approvals, fraud and risk exposure. Each of these scenarios requires processing vast amounts of data and summarizing them; and this is precisely the types of analytic workloads that technologies like AWS EMR and user interfaces such as Hue or Zeppelin are designed to handle. This blog focuses on how Core Compete successfully implemented an “Auto Scale EMR” that launches itself when the business analyst comes to work, auto scales based on the query load and provides a completely secure governance model.

Operational Model

Most organizations working in the AWS cloud allow business analyst to launch EMR clusters on demand (self-governed) or have them running 24×7 (dedicated). These approaches lead to higher costs and lower utilization of resources. This also leads to lack of operational governance.

Challenges of a self-governed model:

  • Business analysts most likely will overestimate the resource requirements and could launch a larger cluster. This could lead to lower resource utilization and higher costs
  • Business analyst responsible for shutting down the system after using – mileage will vary with each user and could run into common scenarios like “I forgot to shut down my cluster yesterday, sorry”.
  • Monitoring the usage across all users and aggregating total demand for compute across time and users requires another system

Challenges of a dedicated model:

  • In almost every instance of a dedicated model the resources are under-utilized leading to higher costs.

Auto Scale EMR operational model:

“Auto Scale EMR” is a shared EMR cluster that has the capability to auto-scale based on the resource utilization and is scheduled to start up and shut down at a predefined configured time of day. This EMR cluster is also integrated with Active Directory so that applications interfaces like Hue and Zeppelin are used by authorized users. This also helps us better understand the usage of the system by user and time of the day. The Auto Scale EMR increases the productivity of the business analyst with a reduced cost footprint.

Deployment Workflow:

A depiction of the deployment workflow that describes the different AWS services being leveraged to automate.

Deployment Steps

The following highlights the key deployment steps

EMR Launch Script:

The Cloud Formation Template file is saved in the S3 bucket. This template is the input to the lambda function to create the Cloud Formation Stack. Sample cloud formation template which can be customized according to your requirements here – https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-elasticmapreduce-cluster.html

Schedule:

A time-based cloud watch event rule will trigger the Lambda function that will create a Cloud Formation stack using the S3-template. Below is the snippet of cloud watch event rule.

emr rules

Permissions:

“Lambda Create Stack IAM Role” provides the privilege needed for lambda to fetch the template URL from S3 and to create the stack

Configurations:

  • AutoScaleEMR Stack” provisions the EMR with all the configuration required. This configuration can be changed as per requirement. Hive meta store needs to be set. The hive meta store needs to be updated in synch with the jobs that update the data lake on S3. This is an important step else the business analyst will be querying on stale data.

Auto Scaling:

Instance group config: Task nodes that can be auto scaled from 0 – 20 based on “YarnMemoryAvailablePercentage” and “ContainerPendingRatio

Yarn Scheduler:

yarn-site configuration properties have been set to use a fair scheduler.

Access URL:

A static private IP has been set by using a python function that is stored in a S3 bucket. And this script is run as part of master node bootstrapping. The function can be seen here  https://github.com/awslabs/aws-support-tools/blob/master/EMR/Assign_Private_IP/assign_private_ip.py

Authentication:

EMR will make a bind authentication with LDAP server. This ensures all LDAP users can access the applications hosted on EMR.

 “backend”: “desktop.auth.backend.LdapBackend”{        “base_dn”: “CN=Users,DC=domain,DC=com”,

        “ldap_url”: “”,

        “search_bind_authentication”: “true”,

        “bind_dn”: “CN=user_name,CN=Users,DC=domain,DC=com”,

        “bind_password”: “*********”

}

Above step binds the Hue to ldap. As zeppelin is not managed by EMR one needs to explicitly bind Zeppelin with LDAP. A separate Lambda function, that overwrites shiro.ini file, which is located at /etc/zeppelin/conf/ with the updated configuration, needs to be done. This step needs to be performed after EMR is launched.

Alerts and Notifications:

Both the success and failure notifications of the CFT launch are sent to a SNS topic. Below is the snippet of the CFT configuration.

Shutdown:

And at the end of the day, CFT stack is deleted by another lambda function which is triggered by a scheduled  cloud watch event rule.

Publish Access URLs:

Hadoop and other applications installed on the Amazon EMR cluster are available as user interfaces on the master node at different ports. The following table lists web interfaces that you can view on browser from Jump Host.

Note: Replace master-public-dns-name with static IP.

YARN ResourceManager http://master-public-dns-name:8088/
Hadoop HDFS NameNode http://master-public-dns-name:50070/
Spark HistoryServer http://master-public-dns-name:18080/
Zeppelin http://master-public-dns-name:8890/
Hue http://master-public-dns-name:8888/
JupyterHub https://master-public-dns-name:9443/

About the Author

Sachindeep Thodupunoori is a Cloud Engineer on Core Compete’s Cloud Analytics and Data Engineering team.

Related Posts