Cloudera Developer Training for Spark and Hadoop Training (CDTSH1)
Cloudera Developer Training for Spark and Hadoop Training. Learn how to import data into your Apache Hadoop cluster and process it with Spark, Hive, Flume, Sqoop, Impala, and other Hadoop ecosystem tools.
This four-day hands-on Cloudera Developer Training for Spark and Hadoop Training course delivers the key concepts and expertise you need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. Employing Hadoop ecosystem projects such as Spark, Hive, Flume, Sqoop, and Impala, this training course is the best preparation for the real-world challenges faced by Hadoop developers. You will learn to identify which tool is the right one to use in a given situation, and will gain hands-on experience in developing using those tools.
This Cloudera Developer Training for Spark and Hadoop Training course is an excellent place to start for people working towards the CCA Spark & Hadoop Developer certification. Although further study is required before passing the exam, this course covers many of the subjects tested in the CCA Spark & Hadoop Developer exam.
With onsite Training, courses can be scheduled on a date that is convenient for you, and because they can be scheduled at your location, you don’t incur travel costs and students won’t be away from home. Onsite classes can also be tailored to meet your needs. You might shorten a 5-day class into a 3-day class, or combine portions of several related courses into a single course, or have the instructor vary the emphasis of topics depending on your staff’s and site’s requirements.
Duration: 4 days
Apache Spark examples and hands-on exercises are presented in Scala and Python, so the ability to program in one of those languages is required.
Basic familiarity with the Linux command line is assumed.
Basic knowledge of SQL is helpful
Prior knowledge of Hadoop is not required.
What You Will Learn:
How data is distributed, stored, and processed in a Hadoop cluster
How to use Sqoop and Flume to ingest data
How to process distributed data with Apache Spark
How to model structured data as tables in Impala and Hive
How to choose the best data storage format for different data usage patterns
Best practices for data storage
Module 1: Introduction to Hadoop and the Hadoop Ecosystem
Problems with Traditional Large-Scale Systems
Data Storage and Ingest
Data Analysis and Exploration
Other Ecosystem Tools
Introduction to the Hands-On Exercises
Module 2: Hadoop Architecture and HDFS
Distributed Processing on a Cluster
Storage: HDFS Architecture
Storage: Using HDFS
Resource Management: YARN Architecture
Resource Management: Working with YARN
Module 3: Importing Relational Data with Apache Sqoop
Basic Imports and Exports
Improving Sqoop’s Performance
Module 4: Introduction to Impala and Hive
Introduction to Impala and Hive
Why Use Impala and Hive?
Querying Data With Impala and Hive
Comparing Hive and Impala to Traditional Databases
Module 5: Modeling and Managing Data with Impala and Hive
Data Storage Overview
Creating Databases and Tables
Loading Data into Tables
Impala Metadata Caching
Module 6: Data Formats
Selecting a File Format
Hadoop Tool Support for File Formats
Using Avro with Hive and Sqoop
Avro Schema Evolution
Module 7: Data Partitioning
Partitioning in Impala and Hive
Module 8: Capturing Data with Apache Flume
What is Apache Flume?
Basic Flume Architecture
Module 9: Spark Basics
What is Apache Spark?
Using the Spark Shell
RDDs (Resilient Distributed Datasets)
Functional Programming in Spark
Module 10: Working with RDDs in Spark
Other General RDD Operations
Module 11: Writing and Deploying Spark Applications
Spark Applications vs. Spark Shell
Creating the SparkContext
Building a Spark Application (Scala and Java)
Running a Spark Application
The Spark Application Web UI
Configuring Spark Properties
Module 12: Parallel Processing in Spark
Review: Spark on a Cluster
Partitioning of File-based RDDs
HDFS and Data Locality
Executing Parallel Operations
Stages and Tasks
Module 13: Spark RDD Persistence
RDD Persistence Overview
Module 14: Common Patterns in Spark Data Processing
Common Spark Use Cases
Iterative Algorithms in Spark
Graph Processing and Analysis
Module 15: DataFrames and Spark SQL
Spark SQL and the SQL Context
Transforming and Querying DataFrames
Comparing Spark SQL, Impala and Hive-on-Spark
Request More Information