Data Science at Scale using Spark and Hadoop Training

Data Science at Scale using Spark and Hadoop Training (DSSH)


Data Science at Scale using Spark and Hadoop Training Course Description

Data Science at Scale using Spark and Hadoop training is a 3 day instructor-led class where you will learn how scientists use data to solve problems by understanding the tools and techniques they use. Through in-class simulations, participants apply data science methods to real-world challenges in different industries and prepare for data scientist roles in the field.

Upon completion of the Data Science at Scale using Spark and Hadoop Training course, attendees are encouraged to continue their study and register for the Cloudera Certified Professional: Data Scientist (CCP-DS) exam.

Customize It:

With onsite Training, courses can be scheduled on a date that is convenient for you, and because they can be scheduled at your location, you don’t incur travel costs and students won’t be away from home. Onsite classes can also be tailored to meet your needs. You might shorten a 5-day class into a 3-day class, or combine portions of several related courses into a single course, or have the instructor vary the emphasis of topics depending on your staff’s and site’s requirements.

Audience/Target Group

Data analysts

Data Science at Scale using Spark and Hadoop Training (DSSH)Related Courses:

Duration: 3 days

Class Prerequisites:

Proficiency in a scripting language
Python is strongly preferred
Perl or Ruby is sufficient
Basic knowledge of Apache Hadoop
Experience working in Linux environments

What You Will Learn:

How to identify potential business use cases where data science can provide impactful results
How to obtain, clean and combine disparate data sources to create a coherent picture for analysis
What statistical methods to leverage for data exploration that will provide critical insight into your data
Where and when to leverage Hadoop streaming and Apache Spark for data science pipelines
What machine learning technique to use for a particular data science project
How to implement and manage recommenders using Spark’s MLlib, and how to set up and evaluate data experiments
What are the pitfalls of deploying new analytics projects to production, at scale

Course Content:

Module1: Data Science Overview

What Is Data Science?
The Growing Need for Data Science
The Role of a Data Scientist

Module 2: Use Cases

Defense and Intelligence
Telecommunications and Utilities
Healthcare and Pharmaceuticals

Module 3: Project Lifecycle

Steps in the Project Lifecycle
Lab Scenario Explanation

Module 4: Data Acquisition

Where to Source Data
Acquisition Techniques

Module 5: Evaluating Input Data

Data Formats
Data Quantity
Data Quality

Module 6: Data Transformation

File Format Conversion
Joining Data Sets

Module 7: Data Analysis and Statistical Methods

Relationship Between Statistics and Probability
Descriptive Statistics
Inferential Statistics
Vectors and Matrices

Module 8: Fundamentals of Machine Learning

The Three C’s of Machine Learning
Importance of Data and Algorithms
Spotlight: Naive Bayes Classifiers

Module 9: Recommender Overview

What is a Recommender System?
Types of Collaborative Filtering
Limitations of Recommender Systems
Fundamental Concepts

Module 10: Introduction to Apache Spark and MLlib

What is Apache Spark?
Comparison to MapReduce
Fundamentals of Apache Spark
Spark’s MLlib Package

Module 11: Implementing Recommenders with MLlib

Overview of ALS Method for Latent Factor Recommenders
Hyperparameters for ALS Recommenders
Building a Recommender in MLlib
Tuning Hyperparameters

Module 12: Experimentation and Evaluation

Designing Effective Experiments
Conducting an Effective Experiment
User Interfaces for Recommenders

Module 13: Production Deployment and Beyond

Deploying to Production
Tips and Techniques for Working at Scale
Summarizing and Visualizing Results
Considerations for Improvement
Next Steps for Recommenders

Request More Information

Time Frame: 0-3 Months4-12 Months

No Comments Yet.

Leave a comment