Cloudera Developer Training for Spark and Hadoop Training

Cloudera Developer Training for Spark and Hadoop Training (CDTSH1)

Introduction:

Cloudera Developer Training for Spark and Hadoop Training. Learn how to import data into your Apache Hadoop cluster and process it with Spark, Hive, Flume, Sqoop, Impala, and other Hadoop ecosystem tools.

This four-day hands-on Cloudera Developer Training for Spark and Hadoop Training course delivers the key concepts and expertise you need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. Employing Hadoop ecosystem projects such as Spark, Hive, Flume, Sqoop, and Impala, this training course is the best preparation for the real-world challenges faced by Hadoop developers. You will learn to identify which tool is the right one to use in a given situation, and will gain hands-on experience in developing using those tools.

This Cloudera Developer Training for Spark and Hadoop Training course is an excellent place to start for people working towards the CCA Spark & Hadoop Developer certification. Although further study is required before passing the exam, this course covers many of the subjects tested in the CCA Spark & Hadoop Developer exam.

Customize It:

With onsite Training, courses can be scheduled on a date that is convenient for you, and because they can be scheduled at your location, you don’t incur travel costs and students won’t be away from home. Onsite classes can also be tailored to meet your needs. You might shorten a 5-day class into a 3-day class, or combine portions of several related courses into a single course, or have the instructor vary the emphasis of topics depending on your staff’s and site’s requirements.

Audience/Target Group

Programmers
Developers
Engineers

Cloudera Developer Training for Spark and Hadoop Training (CDTSH1)Related Courses:

Duration: 4 days

Class Prerequisites:

Apache Spark examples and hands-on exercises are presented in Scala and Python, so the ability to program in one of those languages is required.
Basic familiarity with the Linux command line is assumed.
Basic knowledge of SQL is helpful
Prior knowledge of Hadoop is not required.

What You Will Learn:

How data is distributed, stored, and processed in a Hadoop cluster
How to use Sqoop and Flume to ingest data
How to process distributed data with Apache Spark
How to model structured data as tables in Impala and Hive
How to choose the best data storage format for different data usage patterns
Best practices for data storage

Course Content:

Module 1: Introduction to Hadoop and the Hadoop Ecosystem

Problems with Traditional Large-Scale Systems
Hadoop!
Data Storage and Ingest
Data Processing
Data Analysis and Exploration
Other Ecosystem Tools
Introduction to the Hands-On Exercises

Module 2: Hadoop Architecture and HDFS

Distributed Processing on a Cluster
Storage: HDFS Architecture
Storage: Using HDFS
Resource Management: YARN Architecture
Resource Management: Working with YARN

Module 3: Importing Relational Data with Apache Sqoop

Sqoop Overview
Basic Imports and Exports
Limiting Results
Improving Sqoop’s Performance
Sqoop 2

Module 4: Introduction to Impala and Hive

Introduction to Impala and Hive
Why Use Impala and Hive?
Querying Data With Impala and Hive
Comparing Hive and Impala to Traditional Databases

Module 5: Modeling and Managing Data with Impala and Hive

Data Storage Overview
Creating Databases and Tables
Loading Data into Tables
HCatalog
Impala Metadata Caching

Module 6: Data Formats

Selecting a File Format
Hadoop Tool Support for File Formats
Avro Schemas
Using Avro with Hive and Sqoop
Avro Schema Evolution
Compression

Module 7: Data Partitioning

Partitioning Overview
Partitioning in Impala and Hive

Module 8: Capturing Data with Apache Flume

What is Apache Flume?
Basic Flume Architecture
Flume Sources
Flume Sinks
Flume Channels
Flume Configuration

Module 9: Spark Basics

What is Apache Spark?
Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark

Module 10: Working with RDDs in Spark

Creating RDDs
Other General RDD Operations

Module 11: Writing and Deploying Spark Applications

Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Configuring Spark Properties
Logging

Module 12: Parallel Processing in Spark

Review: Spark on a Cluster
RDD Partitions
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks

Module 13: Spark RDD Persistence

RDD Lineage
RDD Persistence Overview
Distributed Persistence

Module 14: Common Patterns in Spark Data Processing

Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Machine Learning
Example: k-means

Module 15: DataFrames and Spark SQL

Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving DataFrames
Comparing Spark SQL, Impala and Hive-on-Spark

Request More Information

Time Frame: 0-3 Months4-12 Months

Print Friendly, PDF & Email

No Comments Yet.

Leave a comment