‹ Back To Training

Processing Big Data with Pivotal HD

Timeline: 4 Days

Prerequisites

  • Willingness to participate in a demanding, high-intensity training experience
  • Comfort with Java programming and data technologies a plus
  • Have a basic understanding of virtualization concepts

Topics

Expand All › ‹ Collapse All

  • Introductions and course logistics
  • Course objectives
  • What is Hadoop?
  • The Hadoop ecosystem: Pig, Hive, HBase, Zookeeper…
  • Understanding MapReduce and HDFS (Hadoop Distributed File System)
  • Insuring Data Integrity (checksum…)
  • Saving space: input/output compression in Hadoop
  • Launching a Hadoop job
  • Configuring the Hadoop runtime
  • Design goals: ability to run on commodity hardware, be fault tolerant…
  • Scaling from one datanode to hundreds of datanodes
  • HDFS commands
  • Working with file paths
  • HDFS administration (UI, admin commands…)
  • Working with the Java API for HDFS
  • Working with a Secordary NameNode, Federated NameNodes and High Availability NameNodes
  • Map Reduce overview
  • Hadoop versions
  • Writing a mapper
  • Writing a reducer
  • Debugging and testing
  • The Writable hierarchy
  • Partitionners, Combiners, Shuffle
  • How to reuse objects and Garbage Collector optimization
  • Map Reduce restrictions
  • Joins (Map side and Reduce side)
  • High level alternatives to writing Java Mappers and Reducers
  • Hadoop streaming
  • Pig scripting
  • SQL in Hadoop
  • Hive overview
  • Hive tables and DDL
  • Partitions and external tables
  • Selecting data
  • Joins
  • Transforms & User Defined Functions (UDFs)
  • HAWQ Installation and Environment
  • Configuration and Operation Overview
  • Client access to HAWQ
  • Introduction to HAWQ SQL
  • Quick introduction to Spring JDBC and Test Support
  • Creating database tables
  • Queries
  • Joins
  • Functions
  • External Tables overview
  • Loading data with gpfdist/gpload
  • External tables with PXF
  • Loading & unloading data recap
  • Hadoop and Sqoop
  • Query Plans
  • Using ANALYZE and EXPLAIN
  • Distributions and partitioning
  • Data storage and I/O