EMR on ACK is newly released to help enterprises efficiently build big data platforms

EMR on ACK is newly released to help enterprises efficiently build big data platforms

Alibaba Cloud EMR on ACK provides users with a new way to build a big data platform. Users can deploy open source big data services on Alibaba Cloud Container Service (ACK). Taking advantage of ACK's service deployment and high-performance and scalable container application management capabilities, users only need to focus on the big data jobs themselves. Users can easily execute Spark, Presto, and Flink jobs on the ACK cluster, which is 100% compatible with open source and has better performance than open source.

1. Background

Technology Trends

Separation of storage and computing, evolution towards cloud native Online business, AI, and big data are uniformly connected to the ACK cluster, peak-shifting scheduling, offline and online co-location, and improved machine utilization Unified operation and maintenance entry, unified operation and maintenance tool chain, and unified monitoring system Cluster-centric -> job-centric Multi-version support, for example, Spark 2.x and Spark 3.x can be run at the same time

Cloud Native Faces Challenges

Computing and storage separation: How to build an HCFS file system based on object storage OSS

Need to be fully compatible with the existing HDFS

Performance comparable to HDFS, with lower costs

Computing engine shuffle Data storage and computing separation: How to solve ACK hybrid heterogeneous models

Heterogeneous models do not have local disks

Community [Spark-25299] discussed and supported Spark dynamic resources, which became an industry consensus

ACK Scheduling Capabilities: How to Solve Scheduling Performance Bottlenecks

Performance benchmarking Yarn

Multi-level queue management

Peak-shifting scheduling

Leveraging the capabilities of the K8s operating system to orchestrate the peaks and troughs of various businesses

Advantages of EMR on ACK

Remote Shuffle Service provides a storage and computing separation solution for intermediate shuffle data

It can make computing nodes without local disk and cloud disk

Supports enabling Spark dynamic resource function, the ultimate solution for Spark-25299

JindoFS provides lake acceleration solutions for OSS storage

Block mode 1TB TPCDS scenario has more than 15% performance improvement

The scheduling layer supports Scheduler Framework V2

Scheduling performance is more than 3x higher than that of the community

Provide multi-level queue management

Engine Capability Enhancement

In the 10TB TPCDS Benchmark scenario, EMR Spark has a 3x performance improvement over the community

Hudi and DeltaLake have enhanced performance compared to community functions

Complete peak-shifting scheduling solution

2. EMR containerized architecture

EMR on ACK Architecture

Lightweight management and control, connecting to existing data platforms, submitting to different execution platforms through data development clusters/scheduling platforms for peak-shift scheduling, adjusting the cloud-native data lake architecture according to business peak and off-peak strategies, ACK has strong elastic expansion and contraction capabilities
ACK manages heterogeneous clusters with good flexibility

3. Product Introduction

Product Home

Reference link: https://www.aliyun.com/product/emapreduce

Create a new cluster

Region: Currently open to Hangzhou, Shanghai, Beijing, Shenzhen and other regions (continuously open)
Cluster type: Spark, Shuffle Service, Presto
Spark — a general-purpose distributed big data processing engine that provides ETL, offline batch processing, data modeling, and other capabilities

Shuffle Service — Provides optimized Shuffle service for EMR computing engine to solve the dependency problem on local disks under Kubernetes

Solve the network and disk IO bottlenecks of large-scale computing clusters

Supports computing and storage separation architecture and can serve multiple EMR clusters

Presto — A distributed SQL interactive query engine based on memory that supports multiple data sources

Suitable for complex analysis of PB-level massive data and cross-data source queries

Component version: Spark (3.1.1)
Dedicated nodes:
Existing ACK cluster, share some nodes to EMR

Create a new ACK cluster and select the entire cluster as a dedicated node

OSS Bucket: used to store jobs, logs, jar packages, and other information

Cluster Management

Cluster ID/Name: Click to enter job management

Cluster status: Check whether the cluster is available. ACK cluster: Can be associated with an existing ACK. Cluster configuration: Spark job configuration. Release: Release space.

<<:  Looking at the future from the perspective of performance, how will operators enter the second half of the 5G competition?

>>:  5G has no presence? Wrong! It has already "bloomed in many places"

Recommend

Can we rely on HTTPS to keep us safe?

HTTPS is the guardian of web connections Most URL...

5G messaging is about to be launched in the commercial use countdown

5G messaging is regarded as a major business inno...

5G will be the world's most intelligent and interconnected cloud computing

We will enter the 5G era around 2020. 5G will hav...

What changes will the integration of 5G and the Internet of Things bring?

The convergence of 5G and the Internet of Things ...

Why do 5G mobile phones support more frequency bands?

How many 5G frequency bands a mobile phone can su...

Operators should not set traps for unlimited data packages

Operators generously offer "unlimited" ...

5G converged applications must be a “team competition”

With the popularization of the Internet, 5G integ...

5G mmWave filters: What is the best solution?

As cellular technology evolves, mobile bandwidth ...

Flexible consumption model reduces IT expenses and helps investments

Not all workloads are suitable for the cloud, whi...

Five reasons why data center liquid cooling is on the rise

Liquid cooling solutions are expected to enter mo...