Flink on Amazon EKS
1 Overview
Flink supports different deployment modes when running on Kubernetes. We will show you how to deploy Flink on Kubernetes using the Native Kubernetes Deployment.
2 Amazon EKS
Amazon EKS is a fully managed Kubernetes service. EKS supports creating and managing spot instances using Amazon EKS managed node groups following Spot best practices. This enables you to take advantage of the steep savings and scale that Spot Instances provide for interruptible workloads. EKS-managed node groups require less operational effort compared to using self-managed nodes.
Flink can run jobs on Kubernetes via Application and Session Modes only.
3 AWS Spot Instances
Flink is distributed to manage and process high volumes of data. Designed for failure, they can run on machines with different configurations, inherently resilient and flexible. Spot Instances can optimize runtimes by increasing throughput, while spending the same (or less). Flink can tolerate interruptions using restart and failover strategies.
Job Manager and Task Manager are key building blocks of Flink. The Task Manager is the compute intensive part and Job Manager is the orchestrator. We would be running Task Manager on Spot Instances and Job Manager on On Demand Instances.
Flink supports elastic scaling via Reactive Mode. This is ideal with Spot Instances as it implements elastic scaling with higher throughput in a cost optimized way.
4 Flink Deployment
For production use, we recommend deploying Flink Applications in the Application Mode, as these modes provide a better isolation for the Applications. We will be bundling the user code in the Flink image for that purpose and upload in Amazon ECR. Amazon ECR is a fully managed container registry that makes it easy to store, manage, share, and deploy your container images and artifacts anywhere.
- Build the Amazon ECR Image
Login using the following cmd and don’t forget to replace the AWS_REGION and AWS_ACCOUNT_ID with your details.
Create a repository.
Build the Docker image.
Tag and Push your image to Amazon ECR.
- Create Amazon S3/Amazon Kinesis Access Policy
We must create an access policy to allow the Flink application to read/write from Amazon S3 and read Kinesis data streams. Run the following to create the policy. Note the ARN.
- Cluster and node groups deployment
Create an EKS cluster. The cluster takes approximately 15 minutes to launch.
Create the node group using the nodeGroup config file. We are using multiple nodeGroups of different sizes to adapt Spot best practice of diversification. Replace the <<Policy ARN>> string using the ARN string from the previous step.
Download the Cluster Autoscaler and edit it to add the cluster-name.
- Install the Cluster AutoScaler using the following command: kubectl apply -f cluster-autoscaler-autodiscover.yaml
Using EKS Managed node groups requires significantly less operational effort compared to using self-managed node group and enables: 1) Auto enforcement of Spot best practices. 2) Spot Instance lifecycle management. 3) Auto labeling of Pods.
eksctl has integrated amazon-ec2-instance-selector to enable auto selection of instances based on the criteria passed.
- Create service accounts for Flink
shell
kubectl create serviceaccount flink-service-account $ kubectl create clusterrolebinding flink-role-binding-flink \ $ --clusterrole=edit --serviceaccount=default:flink-service-account
Deploy Flink
This install folder here has all the YAML files required to deploy a standalone Flink cluster. Run the install.sh file. This will deploy the cluster with a JobManager, a pool of TaskManagers and a Service exposing JobManager’s ports.
This is a High-Availability(HA) deployment of Flink with the use of Kubernetes high availability service.
The JobManager runs on OnDemand and TaskManager on Spot. As the cluster is launched in Application Mode, if a node is interrupted only one job will be restarted.
Autoscaling is enabled by the use of Reactive Mode. Horizontal Pod Autoscaler is used to monitor the CPU load and scale accordingly.
Check-pointing is enabled which allows Flink to save state and be fault tolerant.
5 Conclusion
In this post, we demonstrated how you can run Flink workloads on a Kubernetes Cluster using Spot Instances, achieving scalability, resilience, and cost optimization.
Reference
Kinnar Sen. Optimizing Apache Flink on Amazon EKS using Amazon EC2 Spot Instances. AWS Compute Blog. 11 NOV 2021. https://aws.amazon.com/blogs/compute/optimizing-apache-flink-on-amazon-eks-using-amazon-ec2-spot-instances/.