From Beginner To Intermediate

Big Data with Amazon Cloud, Hadoop/Spark and Docker

A class for computer-literate people with no data science background who wish to learn Big Data programming.

Big Data with Amazon Cloud Hadoop Spark and Docker
big data on wheels
Meet Sam, our bootcamp instructor
Certificate Awarded
big Data with Amazon Cloud, HadoopSpark and Docker
Best Data Science Bootcamp Switchup
4.89 / 5
(317 Reviews)
Best Data Science Bootcamp
5 Years Running
Course Report Rating
4.84 / 5
( 322 Reviews)
Best Data Science Bootcamp
5 Years Running

Big Data with Amazon Cloud, Hadoop/Spark and Docker

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class.

Unit 1: Introduction to Hadoop
  • 1. Data Engineering Toolkits
    • Running Linux using Docker containers
    • Linux CLI command and bash scripts
    • Python basics
  • 2. Hadoop and MapReduce
    • Big Data Overview
    • HDFS
    • YARN
    • MapReduce
Unit 2: MapReduce
  • 3. MapReduce using MRJob 1
    • Protocols for Input & Output
    • Filtering
  • 4. MapReduce using MRJob 2
    • Top n
    • Inverted Index
    • Multi-step Jobs
Unit 3: Apache Hive
  • 5. Apache Hive 1
    • Databases for Big Data
    • HiveQL and Querying Data
    • Windowing And Analytics Functions
    • MapReduce Scripts
  • 6. Apache Hive 2
    • Tables in Hive
    • Managed Tables and External Tables
    • Storage Formats
    • Partitions and Buckets
Unit 4: Apache Pig
  • 7. Apache Pig 1
    • Overview
    • Pig Latin: Data Types
    • Pig Latin: Relational Operators
  • 8. Apache Pig 2
    • More Pig Latin: Relational operators
    • More Pig Latin: Functions
    • Compiling Pig to MapReduce
    • The Parallel Clause
    • Join Optimizations
Unit 5 : Apache Spark and AWS
  • 9. Apache Spark – Spark Core
    • Spark Overview
    • Running Spark using Databricks Notebooks
    • Working with PySpark: RDDs
    • Transformations and Actions
  • 10. Apache Spark – Spark SQL
    • Spark DataFrame
    • SQL Operations using Spark SQL
  • 11. Apache Spark – Spark ML
    • ML Pipeline using PySpark
  • 12. Amazon Elastic MapReduce
    • Overview
    • Amazon Web Services: IAM, EC2, S3
    • Creating EMR Cluster
    • Submitting Jobs
    • Intro to AWS CLI

* Tuition paid for part-time courses can be applied to the Data Science Bootcamps if admitted within 9 months


Customer Reviews

This is an awesome program you will not regret attending! I was in the Jan-April 2016 cohort. The course covers everything you need to know to apply for data scientist jobs. We started from the fundamental of stats in R, and moved into machine learning in both R and Python. In the last two weeks we also got a fair exposure to big data tools like Hadoop and Spark. The instructors we have are AMAZING!!! They are super knowledgeable and also very passionate about data science. TAs are the most hard working group of people I know! They really try their best to help you. Students at the bootcamp are impressive as well. Most of them either have a phd degree or have significant/successful work experiences prior to joining the bootcamp. We did five projects during the bootcamp which you can totally show off during your job interviews! And you will have a least a few job interviews guaranteed during/after the bootcamp. They really tried their hardest to help you preparing and securing job interviews. I personally had at least 5 interviews while I was still in the bootcamp, and was hired only two weeks after the bootcamp ended. The program wasn’t easy, you will have a ton of homework and projects to do, but they are always there to support and help you. I would recommend this 12 weeks bootcamp to anyone who wants to be a data scientist or simply interested in data science.

Wendy Yu

The productivity of all the students including myself is one of the most remarkable things about my NYC Data Science Academy experience. The environment of top tier professors, hands on teaching assistants and extensive resources to learn from is inevitably going to prepare you well for a data science career if you are willing to put in the effort. You learn a wide array of tools and gain very good insight about the field itself. They're very proactive when it comes to helping you with your job search.

John Sipala
View more customer reviews

Reasons to Enroll


Our instructors are consistently highly rated by their students.  They not only know their subject cold, they are experts at teaching you.


Our curriculum is continuously updated to reflect the latest technology trends.


Learn on the latest technology.  When you complete this course, you will have a solid foundation in python and the use of the tools.

Sam obtained their Bachelor’s Degree in Mathematics from Bard College, while dabbling in some computer science classes along the way. After school, they decided to make a break from traditional academic life, and worked for several years doing carpentry for places like West Elm, and events like Shakespeare in the Park. Eventually they would turn to the field of Data Science, wherein a passionate blend of creative and analytical thinking can lead to some robust outcomes. They’ve worked in industry building models for Fortune 500 companies that ease the hiring process, and allow for interviewees to be more than just their resume. Their interests include reading, sewing, carpentry, painting, understanding systems, and communicating information effectively to others.

Sam Audino, Data Science Bootcamp Instructor

Your Certificate of Completion


To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.


Certificates are awarded at the end of the program at the satisfactory completion of the course. Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

Certificate of Completion

big Data with Amazon Cloud, HadoopSpark and Docker


MapReduce using MRJob

Jake Bialer
NYC Data Science Academy's Instructor, Jake Bialer, walks through a lecture on MapReduce examples.