Hadoop cluster:
Hadoop clusters integrate MapReduce and HDFS. This is core Hadoop. Each portion of a Hadoop cluster has a specific job and runs on a specific machine. A Hadoop cluster runs on a typical master/slave architecture with the master controlling all the data management, communication, and processing. In Hadoop, the master is called the name node. The slave is called the data node and does all the work. A cluster usually has one name node. However, in newer versions of Hadoop, there can be a secondary name node used for backup and for failover. The name node controls the whole cluster. It contains the master schema and knows the name and location in the cluster of every file on the HDFS system. The name node does not contain any data or files itself. Any client application coordinates communication to the cluster through the name node when locating, moving, copying, or deleting a file. The name node will respond by returning a list of data nodes where the data can be found. For processing MapReduce jobs, the name node contains the job tracker. The job tracker assigns the processing for MapReduce jobs to nodes across the cluster. Slave or data nodes hold all the data from the HDFS cluster and perform all the computation. Each data node has a task tracker. The task tracker monitors all the MapReduce processing within its data node and communicates results back to its name node. Data nodes do not communicate with each other. Only the name node knows how many data nodes are in a Hadoop cluster.
Hadoop clusters integrate MapReduce and HDFS. A Hadoop cluster runs on a typical master/slave architecture. The master controls all the data management. The name node controls the whole cluster. The name node does not contain any files itself. The name node contains the job tracker. The job tracker assigns the processing for MapReduce jobs to data nodes across the cluster. The slave (data) nodes hold all the data from the HDFS. Each data node has a task tracker. The task tracker monitors all MapReduce processing. Data nodes do not communicate with each other.
Hadoop, by default, is configured to run in local standalone mode or as a single node cluster. This is the easiest way to learn Hadoop. For more advanced uses, Hadoop can be configured in pseudo-distributed mode or fully-distributed mode. As you may guess, both these modes require more complex configuration than when running as a single node cluster. Also, when running in fully-distributed mode, the data nodes need to be configured to communicate with the name node. When running in pseudo-distributed mode, Hadoop runs each of its daemons in its own Java thread. This simulates running a true cluster, but on a small scale. Data nodes and their associated task trackers are virtual. They exist on the same physical machine as the name node. Pseudo-distributed mode also differs from local standalone mode as in pseudo-distributed mode, the HDFS is used rather than the native file system in standalone mode. For pseudo-distributed mode, there will be single master node with a single job tracker, which is responsible for scheduling tasks of that job to the data nodes. The job tracker also monitors the progress of the slave nodes and assigns the re-execution of unsuccessful tasks of slave nodes. For a single slave node, there will be a task tracker to execute the tasks as per the master node's directions. When setting up Hadoop in pseudo-distributed mode, three configuration changes need to be made. One to each of the core-site, hdfs-site, and the mapred-site configuration files.
Hadoop, by default, is configured to run in the local standalone mode. Hadoop may be configured in the pseudo-distributed mode. Hadoop may be configured in the fully distributed mode. When running in the fully distributed mode, data nodes need to be configured to communicate with the name node. In the pseudo-distributed mode, Hadoop runs each of its daemons in its own Java thread. Data nodes and their associated task trackers are virtual. The pseudo-distributed mode also differs from the local standalone mode. HDFS uses the pseudo-distributed mode rather than the native file system. In the pseudo-distributed mode, there will be a single master node. The job tracker also monitors the progress of slave nodes. For a single slave node, there will be a single task tracker.
Hadoop creation and evolution:
Since core Hadoop is so sparse, there are a myriad of tools and technologies that are used to augment and in some cases improve Hadoop's core technologies. There are so many of these tools, it is impossible to count them all. In this area of Hadoop development, the virtual space where tools are developed for Hadoop has a name – it's called the Hadoop ecosystem. In the Hadoop ecosystem, there are commercial vendors creating proprietary tools. There are also developers and researchers creating small niche applications. When you start to develop in Hadoop, and perhaps start to customize your own Hadoop solution, you become part of this Hadoop ecosystem as well. The HDFS has alternatives and supplements that work within the Hadoop ecosystem. For example, Red Hat offers a product for large-scale networking attached system and storing files named GlusterFS. It was originally developed by a company named Gluster Incorporated. Red Hat acquired GlusterFS after their purchase of Gluster Incorporated in 2011. The next year, Red Hat Storage server was released as an enterprise supported version of the GlusterFS integrated with Red Hat Enterprise Linux. In addition, Ceph is a storage platform architected to provide file, block, and object storage – all from a single cluster of servers. The main objective of Ceph is to be completely distributed and eliminate a single point of failure. Also Ceph attempts to be exponentially scalable into the Exabyte level. Like many tools in the ecosystem, Ceph is free. Ceph replicates data making it fault tolerant. A drawback of Ceph is it does not run well on certain versions of Hadoop.
In the Hadoop ecosystem, there are tools that augment or replace portions of Hadoop. These tools exist in a virtual space called the Hadoop ecosystem. In the Hadoop ecosystem, there are large commercial vendors creating proprietary tools. There are also developers and researchers creating small niche applications. The HDFS has alternatives and supplements. Red Hat offers a product for large-scale network storage called the GlusterFS. Ceph is a storage platform architecture that provides file, block, and object storage. Ceph attempts to be exponentially scalable. Ceph replicates data, making it fault tolerant.
Apache MapReduce also has many supplementary tools. Apache Pig is an engine used for parallelizing multiple data flows on Hadoop. Pig even has its own language, aptly named Pig Latin. Pig Latin is used to express these data flows. Pig Latin gives developers the ability to develop their own functions for writing, processing, and reading data. Pig Latin also includes familiar operators for legacy data operations such as sort, filter, and join. Pig Latin runs over Hadoop and by default uses the Hadoop Distributed File System and Hadoop MapReduce. There are also tools that could be used for developers who like to use SQL. For example, Apache Hive is a data warehousing infrastructure developed by Facebook. Hive is an OLAP tool used for data aggregation, deep queries, and data analysis. It also has a SQL like language called HiveQL. Facebook has recently open sourced Presto. Presto is a SQL engine that is up to five times faster than Hive for querying very large files in the Hadoop Distributed File System and other file systems in the Hadoop ecosystem. Scheduling software can be found in the Hadoop ecosystem. Apache Oozie is a workflow scheduling system for MapReduce jobs. Oozie uses direct acyclical graphs. Oozie Coordinator can launch MapReduce jobs one at a time or by data availability. LinkedIn also has a workflow manager called Azkaban. Azkaban is a batch scheduler that is a confluence of cron and make utilities and runs under a graphical user interface.
MapReduce also has many supplementary tools. Apache Pig is an engine used for analyzing multiple data flows on Hadoop. Pig Latin gives developers the ability to develop their own functions. Pig Latin also includes familiar operators for legacy data operations. Pig Latin runs over Hadoop. Apache Hive is a data warehousing infrastructure developed by Facebook. Apache Hive is an OLAP tool used for data aggregation. It has a SQL-like language called HiveQL. Facebook has recently open sourced Presto. Presto is a SQL engine that is up to five times faster than Apache Hive. Scheduling software can also be found in the Hadoop ecosystem. Apache Oozie is a workflow scheduling system for MapReduce jobs. LinkedIn also has a workflow scheduling system called Azkaban. Azkaban is a batch scheduler. Azkaban has a user-friendly graphical user interface.
Use of YARN:
Hadoop YARN stands for Yet Another Resource Negotiator. To understand YARN, you first need to understand the evolution of Hadoop. Apache Hadoop has become the de facto open source platform since 2005. Large companies, such as Yahoo!, Facebook, and EBay, have used Hadoop at the core of their Big Data operations. The widespread use of Hadoop has led to the identification of the limitations within the core framework. Many of these limitations were solved in the next large Hadoop release – commonly called Hadoop 2. Hadoop 2 contains significant improvements in the HDFS distributed store layer, especially in the area of snapshots, NFS, and high availability. YARN is the next generation of the MapReduce computing framework built from the ground up, based on the lessons learned from the first version of Hadoop. YARN has been running in Hadoop installations for about two years. Yahoo! and Facebook are two notable large clusters that use YARN. The first version of Hadoop was built for large Big Data batch applications. A single application, no matter if it was batch, interactive, or online, was forced to use the same infrastructure. Basically, all usage patterns must use the same infrastructure. This forced the creation of silos to manage the mixed workload. In this scenario, the jobtracker manages cluster resources and job scheduling, and the per-node tasktracker manages the tasks of its node. This design has its limitations; everything is forced to look like MapReduce. There is a theoretical limit to the size of the cluster and the tasks it can process. Cluster failure can kill running or queued jobs, and resource utilization is not optimal.
Hadoop YARN stands for "Yet Another Resource Negotiator." The widespread use of Hadoop led to the identification of limitations within the core framework. Many of these limitations were solved in the next large Hadoop release. YARN is the next generation of the MapReduce computing framework. Yahoo! and Facebook are two notable large clusters that use YARN. The first version of Hadoop was built for large Big Data batch applications. All usage patterns must use the same infrastructure. This forced the creation of silos to manage the mixed workload. The job tracker manager clusters resources and job scheduling.
Hadoop 2 solved many problems of the original Hadoop release. Hadoop 2 is a multipurpose platform consisting of multiple different kinds of jobs such as batch, interactive, streaming, online, et cetera. In Hadoop 1, MapReduce was responsible for data processing as well as cluster resource management. In Hadoop 2, MapReduce is only responsible for data processing. YARN has been added to Hadoop 2 to handle cluster resource management. The YARN architecture includes: a resource manager, RM – that acts as a central agent, managing and allocating cluster resources; a node manager, NM – that manages and enforces node resources; and an application master, or an AM that manages each application life cycle and task scheduling. YARN takes Hadoop beyond batch and allows all of the data to be stored in one place. All types of applications – batch, applications, online, streaming, et cetera – can now all be run natively within Hadoop. YARN is the layer that sits between the running applications and a HDFS2's redundant, reliable storage. The benefits of YARN are many. With YARN, new applications and services are possible. Cluster utilization is also greatly improved since Hadoop 1. Since cluster utilization is optimized, there is no theoretical limit to the size a cluster can grow or how many tasks can run on it. YARN also makes the cluster more agile and able to share services. Finally, since YARN separates MapReduce from the HDFS, the Hadoop environment is much more suitable for applications that can't wait for other batch jobs to finish.
Hadoop 2 solved many of the problems of the original Hadoop release. MapReduce was responsible for data processing as well as cluster resource management. Now MapReduce is only responsible for data processing. The YARN architecture includes a resource manager. YARN takes Hadoop beyond batch processing and allows all of the data to be stored in one place. All types of applications from batch, application, online, and streaming can now run natively in Hadoop. YARN is the layer that sits between the running applications and the HDFS2's redundant and reliable storage. The benefits of YARN are many. Cluster utilization is also greatly improved from Hadoop 1. There is no theoretical limit to the size a cluster can grow. YARN also makes the cluster more agile. The Hadoop environment is much more suitable for applications that cannot wait for other batch jobs to finish.
Components of Hadoop:
You may be surprised how sparse it might seem at the core of Hadoop, there are only a few core components. One of the most alluring features of Hadoop is how extensible it is. Very few organizations run only core Hadoop. In this topic, I illustrate and describe the components of core Hadoop. By the end of this topic, you will know the different parts of Hadoop and what purposes they serve. Core Hadoop keeps its data in what is called the Hadoop Distributed File System also known as the HDFS. In the HDFS, data is stored as large blocks inside the Hadoop cluster across different physical systems or nodes. Data is also replicated in the HDFS meaning if a node containing a block of data goes down or becomes otherwise unreachable, that block of data can be obtained on a different data node because data is redundantly distributed across several nodes. The HDFS ensures data protection, performance, and reliability. The HDFS can be standalone meaning it has other uses outside of Hadoop. MapReduce is a software framework and programming model first developed by Google in 2004. The model is inspired by a mapping and reducing algorithm that is common in functional programming. MapReduce is used to simplify the processing of very large datasets over large clusters – thousands of nodes of inexpensive off-the-shelf computers in a fault-tolerant, reliable way. Computational processing takes place on all types of data – structured, semi-structured, and unstructured. Hadoop MapReduce has been a highly popular and accepted parallel computing algorithm. Because of this, MapReduce is one of the core components of Hadoop and is generally accepted to be a large reason for the success of the Hadoop framework. Hadoop MapReduce is at the center of many third-party commercial distributions.
Core Hadoop keeps its data in what is called the Hadoop Distributed File System. Data is stored as large blocks. Data is also replicated in the HDFS. Data is redundantly distributed across several nodes. The HDFS can be standalone. MapReduce is a programming model and software framework. It is inspired by the map and reduce functions commonly used in functional programming. Computational processing occurs on both structured and unstructured data. It is a core component of the Hadoop platform.
Hive – an Apache project – is an OLAP framework built over the Hadoop HDFS and is designed for scalable data analysis, querying, and managing large amounts of data. Hive is not part of core Hadoop. Its data structure is similar to traditional database management systems with tables, which supports queries like unions, joins, select queries and subqueries. HiveQL – a query language similar to SQL – is used to query the Hive database. HiveQL supports DDL operations – creation, insertion of records – and does not allow DML operations such as deletions and update of records. Access methods such as CLI, JDBC, and ODBC, and Thrift are used to access the data in the Apache Hive framework and interface with external systems for business continuality. In Hive, the basic structural unit is a table, which has an associated HDFS directory. Each table has more than one partition and the data is evenly distributed within the HDFS through subdirectories. Each partition is further divided in the buckets that represent the hash of column data and is stored as a file inside a partition. Hive is popularly used by organizations for data mining, ad hoc querying, reporting, and research of latest trends. Mostly used for batch jobs over large sets of append-only data like web logs, Hive is known for scalability, extensibility, fault tolerance, and loose coupling with its input format. Like Hive, Apache Pig is a tool used for data extraction. Pig is not part of core Hadoop and must be installed separately. Pig has its own language called Pig Latin and uses a SQL-like interface to allow the developer access data. Users like Pig as it's simple and has a look and feel of legacy tools for legacy database management systems. Developers can use languages such as Java, JavaScript, and Python to write custom operations called UDFs or user-defined functions. Pig and Hive are the two most popular data management tools in the Apache Hadoop ecosystem.
Apache Hive, an Apache project, is an OLAP framework built over the Hadoop HDFS. Apache Hive is not part of core Hadoop. Its data structure is similar to traditional relational databases with tables. HiveQL is a query language similar to SQL. HiveQL supports DDL operations, such as the creation and insertion of records. In Apache Hive, the basic structural unit is a table, which has an associated HDFS directory. Each partition is further divided into buckets. Apache Hive is popularly used by organizations for data mining. Apache Hive is known for scalability, extensibility, and fault tolerance. Apache Pig is a tool used for data extraction. Apache Pig is not part of core Hadoop and must be installed separately. Multiple versions of Apache Pig exist. Users like Apache Pig as it is simple and has the look and feel of legacy tools. Apache Pig and Apache Hive are the two most popular data management tools in the Apache Hadoop ecosystem.
Different types of data:
In this demo, we are going to talk about the three different types of data – structured data, semi-structured data, and unstructured data – and talk about how understanding the differences between the three is important for your understanding of Big Data and Hadoop in general. Let's start with structured data. I'm going to connect to Oracle and look at an Oracle table. Let's look at a table called ledger. So, if I type in desc ledger;, I get some metadata about the table itself. Now this is structured data. Structured data has a schema and it has a specific order in which it's stored. Now here in a relational database management system, we have structured data and tables where we actually have a table name and here we have names of columns and we have their data types. So up until just very recently, we would just call this data; but in the world of Big Data, we call this structured data because it has a predefined format, it has a predefined place – it has a predefined place where its word is stored, and it has some schema information and some metadata information. And it also has a predetermined way that we can get data out of it such as if I do a select person from ledger; by running the SQL statement, I would get the PERSON column displayed back to me from the ledger table. So in addition to having metadata and having predefined formats, we also have, basically, a predefined way in which we could interact with the data, in this case being SQL. Okay, let's Close this,
The presenter opens the Run SQL Command Line window. In the window, the presenter types the following lines of code:
SQL> connect system/december23
Connected.
SQL> desc ledger;
A table showing information about the names and types of structured data appears.
The presenter then types the following command in the window:
SQL> select person from ledger;
The window now shows information about several people.
and let's talk about semi-structured data. What I have here is I have an XML document – that's a decent example of something that we would call semi-structured. Now we have some sort of information about the data itself. For example, if you look at the root element of this XML document, we could say that we have an XML document of BradyGirls that has specific Girl elements and within the Girl element we got an FName, an LName, an Age, Hobby, and School. So we have some sort of structure about what this data is, but it's not as detailed as what we would expect to see when we are looking at the Oracle table. Again, when we looked at the Oracle table, we had structure. What we had – metadata that described not only what these data elements were, but exactly what data types they are. And also within Oracle, we had a predefined way in which it was actually stored. Here not so much because if you look at each of these Girl elements, it doesn't really define, you know, where they came from. They could come from different sources like, for example, this semi-structured data actually could have come from a more structured data source being a database table or could have come from something a little bit more unstructured such as, maybe, a PDF file or, maybe, even an Excel spreadsheet. So semi-structured has some sort of identification about the different data elements, but it doesn't have anything that shows what data types are or, maybe, the source of the destination. So let me Close this.
The presenter opens the semistructured-Notepad file. The file contains the following xml code:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href=https://yorkville.rice.iit,.edu/fall2014/529/firstxsl.xsl?>
<BradyGirls>
<Girl>
<FName>Marsha</FName>
<LName>Brady</LName>
<Age>17</Age>
<Hobby>Cheerleading</Hobby>
<School>Westdale High</School>
</Girl>
<Girl>
<FName>Jane</FName>
<LName>Brady</LName>
<Age>15</Age>
<Hobby>Sports</Hobby>
<School>Fillmore Jr</School>
</Girl>
<Girl>
<FName>Cindy</FName>
<LName>Brady</LName>
<Age>9<Age>
<Hobby>Doll</Hobby>
<School>Clinton Elementary</School>
</Girl>
</BradyGirls>
The third type of data is called unstructured data. Let me click this icon and a photograph will open. Now by definition, unstructured data is data that's not structured or not semi-structured, which means it's kind of a huge catch all for things that we can't categorize. Now remember in a structured data system, we have an overlying schema or data dictionary that tells us a lot about our data. In the Oracle example we went over, we had a schema that described the table names, the column names, the data types of the columns, and even some areas that described the relationship between the tables. In the XML document, the example we went over of semi-structured data is we had a little bit less. We had only elements that describe what the name of the data was, nothing about the source and the destination of the data. Now here in an unstructured system, we have no overlying schema at all. There is nothing that describes this data or how we can describe this data. There's some kind of application.
The presenter then opens a photo titled unstructured. The photo shows a husband and wife with two kids.
So what we try to do is we try to make unstructured data a little bit more structured like, for example, what we can do, perhaps, is we can take this photograph and insert it in a structured system and a relational database management table. But, if we were to do that, we would lose all meaning of the photo itself. We would just know that we have a photo saved in a table. We would have zero context. So what we can do is we can try to derive some properties about this photo. We can say this photo has four people in it and we could describe...maybe put in some columns that would describe if they are wearing shoes or not or if they are wearing jewelry or not or if they are wearing a hat or some sort of ribbon in their hair. And we would be able to measure those and actually put them in a more structured system. But even if we did that, we still lose a lot of the intangible content of this and the intangible context. For example, what is the weather? Is it raining? Is it snowing? Is the sun shining? What is the mood of the people in this photo? Look at each individual person and determine what their mood is. Also, we would lose things such as if this was taken in a rural setting or if this was taken in an urban setting. What kind of clothes are they wearing? Are they wearing professional clothes? Are they wearing casual clothes? Are they wearing formal clothes? Now these are all the intangibles that Big Data needs because these intangibles drive product offerings and they drive marketing campaigns. Now this is something that Hadoop really helps us with. It's the processing of unstructured data.
Four types of NoSQL databases
NoSQL databases are basically classified into four different types based on the data-storage methodology. First are the key-value store databases – they store data as large hash tables with unique keys and specific values indexed to it. Second are the column family store databases – they store data in blocks from single columns. These columns can be grouped together. They have a unique key identifier. Third are the document databases – they store data in documents composed of tagged elements. Last is the graph family of databases. Graph databases store data in graphical format using nodes and edges to represent the data and the format of a network. Key-value databases use a schema less or no-schema format for storing data. Keys can be synthetic or automatically generated. Values can be any format like strings such as JSON or BLOB. These databases use the hash-table concept where there is a unique key and pointer, which points at the item data in the table. A collection of keys is called a bucket. There may be identical keys in different buckets. Performance is increased due to the cache of the mapping of the data with the pointers and the keys. Since it only involves the keys, the values, and the buckets, scalability in terms of horizontal and vertical dimensions is comparatively easy. Since there is no schema-defined structure, querying of a particular value or updating a specific value will not be as efficient as if there are multiple simultaneous transactions. If the volume of data increases, you may need to generate complex-character strings to maintain the uniqueness of the keys among the extremely large dataset.
The four main types of NoSQL database are key-value stores, column family stores, document databases, and graph databases. Key-value stores implement hash table methodology. They have unique keys and values with index pointers. They are simple and easy to implement. The pros of NoSQL databases are performance, scalability, and flexibility. The con of NoSQL databases is it has no schema. The NoSQL database works best with huge data sets that are less complex.
In the column family of data stores, data is stored and accessed through column cells instead of rows. Columns are clustered in column families. There are no limitations on the number of columns in a column family. They can be created either by runtime or compile time while defining the schema. Unlike in RDBMS, where different rows are stored in different locations on the disk, performance is increased as the columns or the cells related to a specific column family are stored in continuous disk entries. This enables faster searches and access of the data. Portions of the column model are – the data model of a column family data store; the column, which is a list of data with the name and the value; the key, which is a unique identifier comprising of a set of columns; the column family, which is a group of columns; and the super family, which is a group of column families.
The column family store stores data using unique keys and multiple columns. The main concept of the column family store is creating columns using column families and super columns. The column family store works best with complex data sets. The pros of the column family store are scalability and performance. The con of the column family store is its flexibility.
In the document family of data stores, data is stored in the form of documents, which in itself is a compressed form of key-value pairs. This is more similar to key-value pair databases. The important difference is the stored data in documents comprise of complex structures to manage data in the form of XML, JSON, BSON, et cetera. Performance improves since the document database stores all of the metadata of the stored data and attributes. This enables the content-driven querying of data across different documents. This feature is not available in key-value databases. Scalability is also easy since it does not have a defined schema. It is easy to add fields to JSON and BSON without having to actually define them. One of the cons is querying, which tends to be very, very, very slow. The graph family of databases uses graphical representation with nodes and edges to address the flexibility and scalability concerns. Graph properties are used for index-free adjacency to represent the relationship between the nodes via edges. Nodes are defined and labeled with certain properties. Edges are directional and labeled. They represent a relation. Graph databases are best with data with complex but definitive relations. They are flexible. And they scale very, very well.
In document databases, each document can be identified with a unique key. Each document, also known as a JSON document, acts as a value. Document databases are also called semi-structured data. They have interrelated values associated with keys. Document databases work best with file and document-related complex data. The pros of document databases are flexibility, scalability, and performance. The con of document database is querying. Graph theory is implemented in a flexible graph model. Data can be represented with N - number of interrelated relations - as edges between nodes or documents. There are no columns or rows in the graph theory. The pros of the graph theory are flexibility and scalability. The cons of the graph theory are clustering and performance.
HDFS overview:
One of the core pieces of Hadoop is the Hadoop Distributed File System, or the HDFS. The HDFS is architected to hold very large files and data, and offer very fast transaction rates when interacting with the data. File storage in the HDFS is redundant. Each file segment is stored in at least three places. This redundancy leads to high availability of the data. The HFDS is designed to operate even if a data node goes down or becomes, otherwise, unreachable. In this lesson, you will understand the basics of the Hadoop Distributed File System. By the end of this lesson, you should understand the basics of the Hadoop Distributed File System and how it is used. You should also know how the HDFS is architected and structured, and know how the HDFS compares to other file system models. Like most other file systems, the HDFS breaks its files down into blocks. All files are split up into blocks of a fixed size. These blocks are huge – about 64 megabytes. Each of these blocks is stored in multiple places across the Hadoop cluster. Each computer on a Hadoop cluster that stores data is called a data node. A file, of course, is made up of many blocks, and each block is stored on multiple data nodes. This distribution of blocks means that there are multiple ways to reconstruct a file. Since the HDFS has such large block sizes, it is a good fit for Big Data applications that need a massive amount of storage capacity across multiple machines. More machines mean more data, more storage capacity, and more data redundancy.
One of the core pieces of Hadoop is the Hadoop Distributed File System. File storage in HDFS is redundant. Each file segment is stored in at least three places. HDFS is designed to operate even if data nodes go down.
The Google File System – the GFS – was the inspirational design for the HDFS. The Google File System incorporates the concept of having multiple computers serving a single file. But the GFS did not take an account for the loss of one of these computers. If in the case of a failure, a file on the GFS would be useless. The HDFS improves on the GFS design by introducing data replication. The same block for a file may be distributed across multiple data nodes. The default is 3. The HDFS is also scalable. Need more storage capacity? Just add more data nodes. The HDFS is architected for Big Data applications. It has few other obvious uses. Many performance and usability tradeoffs have been made to this effect. The HDFS does not pretend to be everything to everybody. The HDFS contrasts in design to the Network File System, or the NFS. The NFS is the one of the most popular file systems. It has been around for a long time. The NFS is pretty easy to understand, but it also has some limitations that do not lend itself to Big Data applications. In the NFS, remote access to a single logical volume is provided via accessing storage on a single machine. Remote clients can access files as though they were local. This works well for most applications, but not for Big Data applications where the HDFS is a much better fit. All of this data needs to be managed. In the HDFS, data management occurs on a single computer called the name node. The name node only needs to know a few things about the files it controls on the data nodes such as the name of the file, permissions, and the locations of file blocks. If an application needs to open a file, a request is made to the name node. The name node retrieves the file – perhaps in parallel, assembles it, and returns it to the client. Client applications only deal with the name node.
The Google File System was the inspiration for the design of the HDFS. HDFS improves on the GFS design by introducing data replication. HDFS is scalable. HDFS contrasts in design to the Network File System. The Network File System is one of the most popular file systems. The Network File System is pretty easy to understand and works well with most applications. HDFS is a better fit for Big Data applications. The name node manages all the data. The name node only needs to know a few things about the files. The name node controls the data node. If an application needs to open a file, a request is made to the name node.
Basics of HDFS:
In this lesson, you will learn the HDFS and learn basic HDFS navigation operations. You'll learn how to navigate the command line and look inside the HDFS. The first step is to start the HDFS with the following command – user@namenode:hadoop$ bin/start-dfs.sh. There is a lot happening in the above script. First it will start the name node machine. Most likely, this is the machine that you are logged into. Then all the data node instances and each slave node will be started. If this installation is a single machine cluster, the data nodes are virtual and exist on the physical machine – the same physical machine as the name node. On a real Hadoop cluster, this startup script will also SSH into each of the data nodes and start them. Let's start with a few commands – someone@anynode:hadoop$ bin/hadoop dfs -ls. And look to see, if there's any contents in the HDFS, if this is a new installation, there should be nothing there. The -ls command should look familiar to many of you. If you issue it here, it returns nothing as no files are yet present in your Hadoop cluster. The command attempted to list the contents of your home directory in the HDFS. Remember, that your home directory in the HDFS is completely separate from your home directory on the host computer. The Hadoop Distributed File System operates in its known namespace, separate from the local file system. Also note that the HDFS has no concept of a current working directory, so the cd commands don't even exist here. In this example, if you ran the -ls command without an argument, there may be a few initial directory contents. These are system objects created in the installation process. Each of these objects are created by a username. The name Hadoop is the system name in which Hadoop daemons were started – data node, name node, et cetera. The username supergroup is a group in which the members include the username that started the HDFS instance.
The command to start the data node is user@namenode: hadoop$ bin/start-dfs.sh. The command to find an installation is someone@anynode: hadoop$ bin/Hadoop dfs -ls. The name "Hadoop" is the system name in which Hadoop daemons were started. The username "supergroup" is a group in which the members include the username that started the HDFS instance.
If it does not exist, create your home directory –someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user. In the above, the directory user may already exist. If it does, you don't need to issue the command to create. The next step is to create your home directory – someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user/person. Of course, replace the user/person with your username or if you like, you can repeat this process with other user IDs. Of course, there are more commands in the bin/hadoop dfs. We only covered a few of them, just the basics to get you started. If you want to see all the commands that could be executed in FS Shell, just execute bin/hadoop dfs with no parameters. Also, if you forget what a command does, just type the following – bin/hadoop dfs -help commandName. The system will display a short form explanation on what that command does. The commands we used so far are in the DFS module. This module has all common commands for file and directory manipulation. They will also work for all objects in the file system. For administrative control, the dfsadmin module commands are used to query and manipulate the whole system. For system status, a status report for the Hadoop Distributed File System can be obtained by the report command. This system report will retrieve information about the running status of the HDFS cluster as well as other diagnostic metrics. To shut down the Hadoop Distributed File System if not logged in already, log in to the name node and issue the stop-dfs command.
The steps to create a home directory are as follows:
1. someone@anynode: hadoop$ bin/hadoop dfs –mkdir /user
2. someone@anynode: hadoop$ bin/Hadoop dfs –mkdir /user/person
You will execute the "bin/Hadoop dfs" command with no parameters to see all commands. If you forget what a command does, type bin/Hadoop dfs –help commandName.
HDFS file operations:
In this topic, you will learn how to perform file operations within the HDFS. You will understand how to add files, how to list files, and see their properties. We will start by inserting data into the cluster. There are a few differences you might notice. In Linux and UNIX, individual user's files are in the home\user folder. In the Hadoop, HDFS files are in the user\user folder. Also, if a command line command, like ls, requires a directory name and one is not supplied, the HDFS assumes the default directory name. Some commands require you to explicitly specify the full source and destination paths. If you use any relative paths as arguments to the Hadoop Distributed File System or MapReduce, the paths are considered to be relative to the base directory. You will find that moving data around in the HDFS is fairly easy and intuitive. If you have experience using FTP or even DOS, the HDFS will come easy to you. For example, to upload or really actually to insert a new file into the HDFS, you can use the put command. And to list files, you can use the ls command. By using these two commands, you can easily add files into HDFS and list them and list all these associated file properties.
Data management within HDFS is simple and is somewhat similar to DOS of FTP. You use the put command to upload files. The put command inserts files into HDFS. The ls command lists files in HDFS. The put command can also be used to move more than single files. Entire folders can be uploaded with the put command.
The put command can also be used to move more than one single file. It can be used to upload multiple files or even entire folders or directory trees into the HDFS. To do this, use the put command. The listing displays that the file tree was uploaded correctly. Notice that in addition to the file path, the numbers or replicas of the file are listed. This is denoted by the "R" and the number that comes after it. Thus R1 means that there is one replica of the file or R3 meaning that there are three replicas of the file. Also present are the file size, upload time, owner, and the file permission information. In place of the put command, it is copyFromLocal. The functionality of put and copyFromLocal are identical. The put command on a file is either fully successful or fully failed. When the command uploads a file into the HDFS, it first copies the contents of the file onto the data nodes. When the data nodes have indicated that the file content is transferred, the put command completes and the file is made available to the system. There will never be a state where the file is partially uploaded. When the file is available, the data nodes will indicate to this system that the file is ready.
The put command on a file is either fully successful or fully failed. The file is first copied to data nodes and the data nodes monitor the transfer. Files are never partially uploaded. Data nodes indicate when the transfer is successful.
There are a few ways to retrieve data from the Hadoop Distributed File System. An easy way is to use stdout with the cat command to display the contents of a file. The cat command can also be used to pipe data to other destinations or applications. If you have not done so already, upload a few files into the HDFS. This example assumes that you have a file named "foo" in your home directory in the HDFS. Display data with cat. Next use get to copy a file from Hadoop Distributed File System to the local file system. The get command is the opposite of put. The get command will copy a file or directory structure from the HDFS into a folder in a local file system. Another way to do the same thing is by using the copyToLocal command. Like the put command, get will operate on directories in addition to individual files.
Data can also be downloaded from HDFS using the stdout command. The cat command is used to pipe data.
Basic principles of MapReduce:
MapReduce is an algorithm used to divide up Big Data jobs. The MapReduce algorithm has been around for a long time. Although Hadoop did not invent MapReduce, one of the core components in Hadoop is the MapReduce engine and how it distributes work to nodes in a cluster. In this topic, you will learn the basic principles of MapReduce and general mapping issues. You will also be able to describe the two steps in MapReduce using mappers and reducers. In short, you will learn how MapReduce is part of the Hadoop framework and how you use it to solve Big Data applications. In this example, we are going to go over the anatomy of a MapReduce job and some programming considerations and also a bit of history about MapReduce and why it's so important. Now what you see here is you see Eclipse opened up with a MapReduce job within it. Now Apache MapReduce is Java based. So you are going to have to know Java. If you don't know Java, you're going to have to learn Java to work with MapReduce. If you don't know Java, there are some tools that we could use down the road to help, such as Pig and Hive which are nonJava based tools, for writing MapReduce jobs. But here let's look at some project settings. I'm going to go to Project - Properties and show you the two .jar files that you need to have in your path for this to work. So I have the hadoop-common-2.3.0.jar and I have the hadoop-mapreduce-client-core-2.3.0.jar. These are the only two that you need to get started.
The Eclipse SDK window is open and shows a MapReduce example. The Eclipse SDK window has the File, Edit, Source, Refractor, Navigate, Search, Project, Run, Window, and Help menus. Below the menus, the window is divided into three areas. The area to the left shows the Package Explorer directory structure. The area in the middle shows the Dictionary.java tabbed page. The area to the right shows the Outline section. The presenter clicks the Project menu to open the Project shortcut menu options and selects the Properties shortcut menu option to display the "Properties for MapReduce Example" dialog box. The dialog box shows the necessary jar files that are needed to run the MapReduce example.
Now as far as MapReduce goes, Hadoop did not invent MapReduce. MapReduce is a programming algorithm, a distributed programming algorithm, that's been around for at least ten years or so. And let's talk a little bit about what it does. Up until very recently being the year 2000 or so, there is always an exponential increase in processing speed and power of CPUs, whereas from the 80s to the 2000s, there seemed to be always an exponential growth in how fast computers got. So the way we process information really didn't change that much because we would always assume that we would get an exponential leap in processing power with the new batch of processors that came out. Now right around the year 2000 or so, there seemed to be some stagnation in how fast these chips can get. So people started looking at different ways that we could process data. So rather than assuming that CPUs are going to get faster and faster and faster, we started looking at the way we process data. In simple words, maybe we should look more at distributing the process and the computation rather than having a CPU do it. And that's really the genesis of MapReduce. MapReduce has been around since approximately 2004 and as the saying goes – "Hadoop did not invent MapReduce, but it did make it famous."
The presenter discusses various classes in the Dictionary.java tabbed page.
So what MapReduce does is MapReduce...here Apache MapReduce...what it does is, it takes the MapReduce algorithm and distributes it across many different nodes in the cluster. Now when you just turn out, you might just be using a single node cluster. So you might ask yourself the real utility of this, you know, if it's only running on a single node. And you are right that we are not actually distributing this across multiple nodes as you are learning it, but it does at least illustrate the basics in how the whole thing works. So let's go over the code. Again, it's a simple program and there is really not a lot to it. We have our imports and as you can see we have some standard Java packages that we are importing. So we are going to need the java.io.IOException package and the java.util.StringTokenizer package. And these imports here are from those two .jar files that I showed you previously. We are getting our MapReduce classes and we are getting some of our core Hadoop classes. Now as far as the code goes, MapReduce programs are Java applications. So being a Java application, we are going to have a main method.
The presenter discusses the main method in the Dictionary.java tabbed page.
So I'm going to navigate down to the main method and as you can see there are only really a few lines of code. When we're writing MapReduce programs, most of the functionality that is implied – or most of the functionality that is executed – is hidden from us. Sort of like when we are doing anything with XML where an XML parcer might actually hide most of its implementations. MapReduce works in the same way. The only thing that we are really concerned with or the two main things that we're concerned with are the mappers and the reducers. Now here what we are doing is in our main method – again our main method is our method where the job actually is search running is we do a couple of things, really just a couple of things where we set some job parameters where we put here some information that has to do with the system configuration. And then we are setting some job parameters and these lines of code here. And then we are using the FileInputFormat to determine exactly the format of the data that is going into our mapper. And where the opposite is also true where we are setting a FileOutputFormat from our reducer.
The presenter discusses the WordMapper class in the Dictionary.java tabbed page. The WordMapper class reads as follows:
public static class WordMapper extends Mapper <Text, Text, Text, Text>
So as you can see the code and actually getting this to run, as far as we're concerned, is not that large or not that vast. Now going up to the top of the program, we really have two static classes. Where we have our static class that does the mapping as you can see highlighted here. And anytime that we define a mapper, what we want to do is we want to extend the mapper class and you could see we have our code for reading our key value pairs here. And then we have the code for our reducer.
The presenter discusses the Reducer class in the Dictionary.java tabbed page. The Reducer class reads as follows:
public static class AllTranslationsReducer extends Reduce <Text, Text, Text, Text>
And when we build a reducer we always extend the reducer class. That's pretty much it as far as what is exposed to the developer, where we have our main method, we have our map section, and a reduce section. Now again, if you are running this on a single node cluster, this is still a very good illustration on how MapReduce works. Now deploying this and actually building this MapReduce program and jarring it up and installing it is a much more complicated task and it's a lot more difficult than just to run this as a regular Java application. So actually jarring this up and deploying it is a whole different task than actually programming a MapReduce job.