Cloudera Developer Training for Apache Spark Training

Cloudera Developer Training for Apache Spark Training (CDTAS)


Cloudera Developer Training for Apache Spark Training Course Description

This three-day Cloudera Developer Training for Apache Spark Training course for Apache Spark enables you to build complete, unified big data applications combining batch, streaming, and interactive analytics on all their data. With Spark, developers can write sophisticated parallel applications to execute faster decisions, better decisions, and real-time actions, applied to a wide variety of use cases, architectures and industries.

Customize It:

With onsite Training, courses can be scheduled on a date that is convenient for you, and because they can be scheduled at your location, you don’t incur travel costs and students won’t be away from home. Onsite classes can also be tailored to meet your needs. You might shorten a 5-day class into a 3-day class, or combine portions of several related courses into a single course, or have the instructor vary the emphasis of topics depending on your staff’s and site’s requirements.

Audience/Target Group

Data Engineers

Cloudera Developer Training for Apache Spark Training (CDTAS)Related Courses:

Duration: 3 days

Class Prerequisites:

Course examples and exercises are presented in Python and Scala, so knowledge of one of these programming languages is required.
Basic knowledge of Linux is assumed.

What You Will Learn:

Using the Spark shell for interactive data analysis
The features of Spark’s Resilient Distributed Datasets
How Spark runs on a cluster
How Spark parallelizes task execution
Writing Spark applications
Processing streaming data with Spark

Course Content:

Module 1: Introduction to Spark

What is Spark?
Review: From Hadoop MapReduce to Spark
Review: HDFS
Review: YARN
Spark Overview

Module 2: Spark Basics

Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Module 3: Working with RDDs in Spark

Creating RDDs
Other General RDD Operations

Module 4: Aggregating Data with Pair RDDs

Key-Value Pair RDDs
Other Pair RDD Operations

Module 5: Writing and Deploying Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Hands-On Exercise: Write and Run a Spark Application
Configuring Spark Properties

Module 6: Parallel Processing

Review: Spark on a Cluster
RDD Partitions
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks

Module 7: Spark RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Module 8: Basic Spark Streaming

Spark Streaming Overview
Example: Streaming Request Count
Developing Spark Streaming Applications

Module 9: Advanced Spark Streaming

Multi-Batch Operations
State Operations
Sliding Window Operations
Advanced Data Sources

Module 10: Common Patterns in Spark Data Processing

Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Machine Learning
Example: k-means

Module 11: Improving Spark Performance

Shared Variables: Broadcast Variables
Shared Variables: Accumulators
Common Performance Issues
Diagnosing Performance Problems

Module 12: Spark SQL and DataFrames

Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
DataFrames and RDDs
Comparing Spark SQL, Impala and Hive-on-Spark

Request More Information

    Time Frame: 0-3 Months4-12 Months

    Print Friendly, PDF & Email