ABOUT THIS COURSE
Interested in increasing your knowledge of the Big Data landscape?
This course is for those new to data science and interested in understanding why the Big Data Era has come to be. It is for those who want to become conversant with the terminology and the core concepts behind big data problems, applications, and systems. It is for those who want to start thinking about how Big Data might be useful in their business or career. It provides an introduction to one of the most common frameworks, Hadoop, that has made big data analysis easier and more accessible — increasing the potential for data to transform our world!
Describe the Big Data landscape including examples of real world big data problems including the three key sources of Big Data: people, organizations, and sensors.
Explain the V’s of Big Data (volume, velocity, variety, veracity, valence, and value) and why each impacts data collection, monitoring, storage, analysis and reporting.
Get value out of Big Data by using a 5-step process to structure your analysis.
Identify what are and what are not big data problems and be able to recast big data problems as Data science questions.
Provide an explanation of the architectural components and programming models used for scalable big data analysis.
Summarize the features and value of core Hadoop stack components including the YARN resource and job management system, the HDFS file system and the MapReduce programming model.
Install and run a program using Hadoop! This course is for those new to data science. No prior programming experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the hands-on assignments.
(A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit;
(B) 8 GB RAM;
(C) 20 GB disk free.
How to find your hardware information: (Windows): Open System by clicking the Start button, right- clicking Computer, and then clicking Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB RAM purchased in the last 3 years will meet the minimum requirements. You will need a high speed internet connection because you will be downloading files up to 4 Gb in size. Software Requirements: This course relies on several open- source software tools, including Apache Hadoop. All required software can be downloaded and installed free of charge. Software requirements include: Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+.
BIG DATA IS THE FUEL FOR TODAY'S ANALYTICS APPLICATION
The development of Big data technologies unlocked a treasure trove of information for businesses. Before that, BI and analytics applications were mostly limited to structured data stored in relational databases and data warehouses — transactions and financial records, for example. A lot of potentially valuable data that did’nt fit the relational mold was left unused. No more, though.
Big data environments can be used to process, manage and analyze many different types of data. The data riches now available to organizations include customer databases and emails, internet clickstream records, log files, images, social network posts, sensor data, medical information and much more.
Companies increasingly are trying to take advantage of all that data to help drive better business strategies and decisions. In a survey of IT and business executives from 94 large companies conducted by consultancy New Vantage Partners in late 2021, 91.7% said they’re increasing their investments in big data projects and other data and AI initiatives, while 92.1% reported that their organizations are getting measurable business results and outcomes from such initiatives.

Why is big data important for businesses?

Before big data platforms and tools were developed, many organizations could use only a small fraction of their data in operational and analytics applications. The rest often got pushed to the side as so-called dark data, which is processed and stored but not put to further use. Effective big data management processes enable businesses to better utilize their data assets. Being able to do so expands the kinds of data analytics that companies can run and the business value they can get. Big data creates increased opportunities for machine learning, predictive analytics, data mining, streaming analytics, text mining and other Data science and advanced analytics disciplines. Using those disciplines, big Data analytics applications help businesses better understand customers, identify operational issues, detect fraudulent transactions and manage supply chains, among other uses.
If done well, the end results include more effective marketing and advertising campaigns, improved business processes, increased revenue, reduced costs and stronger strategic planning — all of which can lead to better financial results and competitive advantages over business rivals. In addition, big data contributes to breakthroughs in medical diagnoses and treatments, scientific research and smart city initiatives, law enforcement and other government programs..
What are common big data challenges?

Because of its very nature, big data tends to be challenging to process, manage and use effectively. Big data environments typically are complex, with multiple systems and tools that need to be well orchestrated to work smoothly together. The data itself is also complex, particularly when data sets are large and varied or involve streaming data.
Those issues can be broken down into the following categories:
- Technical challenges that include selecting the right big data tools and technologies and designing big data systems so they can be scaled as needed;
- Data management challenges, from processing and storing large amounts of data to cleansing, integrating, preparing and governing them;
- Analytics challenges, such as ensuring that business needs are understood and that analytics results are relevant to an organizations’ business strategy; and
- Program management challenges that include keeping costs under control and finding workers with the required big data skills.
- Hiring and retaining skilled workers can be particularly difficult because key contributors such as data scientists, data architects and big data engineers are in high demand.
Key elements of big data environments

Big data management and analytics initiatives involve various components and functions. These are some of their core aspects that need to be factored into project plans upfront.
The traditional data warehouse can be incorporated into big data architectures to store structured data. More commonly, though, architectures feature data lakes, which can store different data sets in their native formats and typically are built on technologies such as Spark, Hadoop, No-SQL databases and cloud object storage services. Other architectural layers support data management and analytics processes. A solid architecture also provides the underpinnings that data engineers need to create big data pipelines to funnel data into repositories and analytics applications.
Big data systems are primarily used for analytics applications, which can range from straightforward BI and reporting to various forms of advanced analytics done by data science teams. Machine learning, in particular, has benefited from the availability of big data -- once mostly a scientific pursuit, it's now widely used by businesses to find patterns and anomalies in large data sets. An article by Kathleen Wach, another principal analyst and managing partner at Cognilytica, further explains how big data and machine learning algorithms can be used together to make analytics more effective.
Before sets of big data can be processed and analyzed, they need to be collected, often from both internal systems and external data sources. That can be a complicated undertaking because of the amount of data, its variety and the number of different sources that may be involved. Data security and privacy issues add to the challenges, even more so now that businesses need to comply with GDPR, CCPA and other regulations.
Integrating data sets is also a crucial task in big data environments, and it adds new requirements and challenges compared to traditional data integration processes. For example, the volume, variety and velocity characteristics of big data may not lend themselves to conventional extract, transform and load procedures. As a result, data management teams often must adopt new integration techniques for big data. Once data is integrated and ready for use, it needs to be prepared for analysis, a process that includes data discovery, cleansing, modeling, validation and other steps. In data lakes that store data in its raw form, data preparation is often done by data scientists or data engineers to fit the needs of individual analytics applications.
Effective data governance is also vital to help ensure that collections of big data are consistent and get used properly in compliance with privacy regulations and internal data standards alike. But governing big data poses new challenges for data governance managers because of the wide variety of data they often need to oversee now. Frequently done as part of data governance programs, data quality management is an important facet of big data deployments, too. And likewise, the combination of big data and data quality requires new processes for identifying and fixing errors and other quality issues.
Big data technologies and tools

The big data era began in earnest when the Hadoop distributed processing framework was first released in 2006, providing an open source platform that could handle diverse sets of data. A broad ecosystem of supporting technologies was built up around Hadoop, including the Spark data processing engine. In addition, various No-SQL databases were developed, offering more platforms for managing and storing data that SQL-based relational databases weren’t equipped to handle.
While Hadoop’s built-in Map-reduce processing engine has been partially eclipsed by Spark and other newer technologies, it and other Hadoop components are still used by many organizations. Overall, the technologies that now are common options for big data environments include the following categories:
- Processing engines. Examples include Spark, Hadoop Map-reduce and stream processing platforms such as Flink, Kafka, Samza, Storm and Spark’s Structured Streaming module.
- Storage repositories. Examples include the Hadoop Distributed File System and cloud object storage services such as Amazon Simple Storage Service and Google Cloud Storage.
- NoSQL databases. Examples include Cassandra, Couchbase, CouchDB, HBase, MarkLogic Data Hub, MongoDB, Redis and Neo4j.
- SQL query engines. Examples include Drill, Hive, Presto and Trino.
- Data lake and data warehouse platforms. Examples include Amazon Redshift, Delta Lake, Google BigQuery, Kylin and Snowflake.
- Commercial platforms and managed services. Examples include Amazon EMR,Azure HDInsight, Cloudera Data Platform and Google Cloud Dataproc.
What are future trends in big data?

Increasingly, organizations are running big data systems in the cloud, often using vendor-managed platforms that provide big data as a service to simplify deployments and ongoing management. Big data trends, moving to the cloud enables businesses to deal with almost limitless amounts of new data and pay for storage and compute capability on demand without having to maintain their own large and complex data centers.
Also listed the following as notable trends:
- increasing data diversity, driven in particular by growing data volumes from IoT devices that are leading more organizations to adopt edge computing to better handle processing workloads;
- further increases in enterprise use of machine learning and other AI technologies, both for data analytics and to enable chatbots to provide better customer support with more personalized interactions; and
- Wider adoption of DataOps practices for managing data flows, as well as a heightened focus on data stewardship to help organizations deal with data governance, security and privacy issues.

Big Data in Manufacturing and Natural Resources
Big Data in Manufacturing and Natural Resources
In the natural resources industry, Big Data allows for predictive modeling to support decision making that has been utilized for ingesting and integrating large amounts of data from geospatial data, graphical data, text, and temporal data. Areas of interest where this has been used include; seismic interpretation and reservoir characterization.
Big data has also been used in solving today’s manufacturing challenges and to gain a competitive advantage, among other benefits.
In the graphic below, a study by Deloitte shows the use of supply chain capabilities from Big Data currently in use and their expected use in the future.

Big Data in the Banking and Securities Industry
Big Data in the Banking and Securities Industry
The Securities Exchange Commission (SEC) is using Big Data to monitor financial market activity.They are currently using network analytics and natural language processors to catch illegal trading activity in the financial markets.
Retail traders, Big banks, hedge funds, and other so-called ‘big boys’ in the financial markets use Big Data for trade analytics used in high-frequency trading, pre-trade decision-support analytics, sentiment measurement, Predictive Analytics, etc.
This industry also heavily relies on Big Data for risk analytics, including; anti-money laundering, demand enterprise risk management, Know Your Customer, and fraud mitigation.
Big Data providers are specific to this industry includes 1010data, Panopticon Software, Streambase Systems, Nice Actimize, and Quartet FS.

Big Data in the Healthcare Sector
Big Data in the Healthcare Sector
Some hospitals, like Beth Israel, are using data collected from a cell phone app, from millions of patients, to allow doctors to use evidence-based medicine as opposed to administering several medical/lab tests to all patients who go to the hospital. A battery of tests can be efficient, but it can also be expensive and usually ineffective.
Free public health data and Google Maps have been used by the University of Florida to create visual data that allows for faster identification and efficient analysis of healthcare information, used in tracking the spread of chronic disease. Obamacare has also utilized Big Data in a variety of ways. Big Data Providers in this industry include Recombinant Data, Humedica, Explorys, and Cerner..

Big Data in the Communications, Media and Entertainment Industry
Big Data in the Communications, Media and Entertainment Industry
Organizations in this industry simultaneously analyze customer data along with behavioral data to create detailed customer profiles that can be used to:
- Create content for different target audiences
- Recommend content on demand
- Measure content performance
A case in point is the Wimbledon Championships (YouTube Video) that leverages Big Data to deliver
detailed sentiment analysis on the tennis matches to TV, mobile, and web users in real-time.
Spotify, an on-demand music service, uses Hadoop Big Data analytics, to collect data from its millions of users worldwide and then uses the analyzed data to give informed music recommendations to individual users.
Amazon Prime, which is driven to provide a great customer experience by offering video, music, and Kindle books in a one-stop-shop, also heavily utilizes Big Data.
Big Data Providers in this industry include Infochimps, Splunk, Pervasive Software, and Visible
Measures.

Big Data in the Insurance Industry
Big Data in the Insurance Industry
Big data has been used in the industry to provide customer insights for transparent and simpler products, by analyzing and predicting customer behavior through data derived from social media, GPS-enabled devices, and CCTV footage. The Big Data also allows for better customer retention from insurance companies.
When it comes to claims management, predictive analytics from Big Data has been used to offer
faster service since massive amounts of data can be analyzed mainly in the underwriting stage.
Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of claims
throughout the claims cycle has been used to provide insights.
Big Data Providers in this industry include Sprint, Qualcomm, Octo Telematics, The Climate Corp.

Big Data in the Transportation Industry
Big Data in the Transportation Industry
Some applications of Big Data by governments, private organizations, and individuals include:
- Governments use of Big Data: traffic control, route planning, intelligent transport systems, congestion management (by predicting traffic conditions)
- Private-sector use of Big Data in transport: revenue management, technological enhancements, logistics and for competitive advantage (by consolidating shipments and optimizing freight movement)
- Individual use of Big Data includes route planning to save on fuel and time, for travel arrangements in tourism, etc.

Big Data in the Retail and Wholesale Industry
Big Data in the Retail and Wholesale Industry
Big data from customer loyalty data, POS, store inventory, local demographics data continues to be gathered by retail and wholesale stores.
In New York’s Big Show retail trade conference in 2014, companies like Microsoft, Cisco, and IBM pitched the need for the retail industry to utilize Big Data for analytics and other uses, including:
- Optimized staffing through data from shopping patterns, local events, and so on
- Reduced fraud
- Timely analysis of inventory
Social media use also has a lot of potential use and continues to be slowly but surely adopted, especially by brick and mortar stores. Social media is used for customer prospecting, customer retention, promotion of products, and more.
Big Data Providers in this industry include First Retail, First Insight, Fujitsu, Infor, Epicor, and Vistex.

Big Data in Education
Big Data in Education
Big data is used quite significantly in higher education. For example, The University of Tasmania. An Australian university with over 26000 students has deployed a Learning and Management System that tracks, among other things, when a student logs onto the system, how much time is spent ondifferent pages in the system, as well as the overall progress of a student over time.
In a different use case of the use of Big Data in education, it is also used to measure teacher’s effectiveness to ensure a pleasant experience for both students and teachers. Teacher’s performance can be fine-tuned and measured against student numbers, subject matter, student demographics, student aspirations, behavioral classification, and several other variables.
On a governmental level, the Office of Educational Technology in the U. S. Department of Education is using Big Data to develop analytics to help correct course students who are going astray while using online Big Data certification courses. Click patterns are also being used to detect boredom.
Big Data Providers in this industry include Knewton and Carnegie Learning and MyFit/Naviance.
Useful Links
Corporate Office:
- (Level 20, 40 Bank street, London, E14 5NR, United Kingdom.)
- (Sutton #28/29-H, 959, 1st Ave, New York, NY.10022, United State Of America. )
- (10 Anson Road, #11-20 International Plaza, Singapore. 079903.)
Subscribe Now
Don’t miss our future updates! Get Subscribed Today!