What is Apache Hadoop?

August 24, 2022

Apache Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce. Hadoop is an Apache project and is released under the Apache License. MapReduce is a programming model for large-scale data processing. It is designed to process and generate large data sets with a parallel, distributed algorithm on a cluster. In a MapReduce program, the map function is applied to each input record to produce a set of intermediate key/value pairs, and the reduce function is then applied to all the intermediate values associated with the same key to produce a set of output values. Apache Hadoop's MapReduce programming model takes advantage of data locality - the fact that data is often stored and processed near where it is used - to improve performance.

Apache Hadoop's HDFS is a distributed file system that stores files in a manner that provides high availability and reliability. HDFS
stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS is designed to work with MapReduce, and it provides a number of features that improve the performance of MapReduce programs.

Apache Hadoop also includes a number of other components, such as a resource manager (Yarn) , a columnar storage system (HBase), and a machine learning library.

Apache Hadoop is often used in conjunction with Apache Spark, a fast, general-purpose cluster computing system. Hadoop can be deployed on a single node or on a cluster of thousands of nodes. It is fault-tolerant, scalable, and easy to maintain.

Apache Hadoop is written in Java and is available under the Apache License.

Bite-sized

What is Apache Hadoop?

Comments

Post a Comment

Popular posts from this blog

ZooKeeper as distributed consensus service

Recommendation systems

What is Apache Airflow?