Big Data Management and Analytics has been becoming increasingly important for deriving valuable and actionable insights in in several important and diverse domains such as smart cities, transportation, healthcare and financial services. On the other hand, Cloud computing platforms, such as Hadoop, incorporate the capabilities of processing, managing and analyzing such Big Data in a highly scalable manner. This course is designed to equip students with the fundamentals of Big Data management & analytics (including data mining, machine learning techniques etc.) as well as facilitate them in understanding how Big Data can be efficiently processed in Cloud computing platforms. The course also has a significant “hands-on” lab component, where students will gain exposure to processing and analyzing Big Data on Hadoop.
Unit 1: Introduction to Big Data and its applications
This unit introduces the concept of Big Data and explains its four dimensions (i.e., volume, velocity, variety & veracity). Then it details several applications of Big Data analytics to motivate the ever-increasing importance of Big Data in today’s world. Applications cover a wide gamut of domains ranging from transportation services to finance to social media. Moreover, it describes how Big Data can represent a high value proposition to businesses as a source of competitive advantage in improving some of their key performance metrics such as market share, profit margins etc.
Unit 2: Issues associated with Big Data Management
This unit discusses various key issue which arise in the processing of Big Data. Notably, many of these issues also arise while processing data that do not fall under the Big Data category. However, such issues are significantly exacerbated due to the tremendously large volumes and typically high complexity of Big Data. Issues include (but are not limited to) data cleaning, data heterogeneity, data integration, replication, caching, maintenance of data consistency, scalability and so on. The unit also covers the inherent trade-offs associated with each of these issues.
Unit 3: Concepts of Cloud computing
This unit discusses the key concepts and principles of Cloud Computing. It also incorporates detailed information about Cloud-related terminology. The topics covered in this unit include (but are not limited to) pros and cons of Cloud computing, Cloud architecture, Cloud service models (IaaS, PaaS, SaaS), Cloud applications (Azure, AWS etc.), effective resource allocation and cost efficiencies in Cloud computing, multitenancy and so on.
Unit 4: Hadoop and MapReduce
This unit covers the key concepts of Hadoop and MapReduce for solving real-world analytics problems associated with Big Data. The topics covered in this unit include (but are not limited to) Hadoop Distributed File system and several key Hadoop-related modules or software packages such as Hive, Pig, HBase, Spark, Flume, Sqoop, Oozie etc. Students will not only understand the concepts of these Hadoop packages, but also engage in some hands-on development work on these modules to gain a deeper level of expertise.
Unit 5: Data Models & NoSQL
This unit discusses the four key data models that are important for handling Big Data. The models are key-value DB, column-family DB, document DB and graph DB. For each of these data models, the unit will cover some of the important real-world technologies from both a theoretical perspective as well as from a practical hands-on point of view. Examples include HBase, Cassandra, Hypertable, BigTable, Dynamo DB, Mongo DB, Neo4J, Redis etc. This unit will also present the various trade-offs associated with selecting an appropriate data model based on issues such as the requirements of the respective applications, the specific properties of the underlying data, complexity of performing analytics and scalability.
Unit 6: Big Data Strategy and Implementation
This unit examines the business and strategic perspective of Big Data. Topics covered in this unit include (but are not limited to) a brief overview of some of the fundamental concepts of business strategy & business intelligence, understanding the key requirements of the relevant stakeholder(s), defining a Big Data strategy & creating plans for implementing the strategy, selecting appropriate Big Data tools and technologies based on the requirements of stakeholder(s) and cost-benefit trade-offs, maximizing the benefits obtaining by analyzing Big Data and maintaining a sustainable competitive advantage in the market.