Post

BigData - Apache Hadoop


Apache Hadoop


history

Doug Cutting: Lucene -> Nutch

Screen Shot 2022-04-17 at 16.18.04

Screen Shot 2022-04-17 at 16.18.19

Doug Cutting + Yahoo: -> Hadoop: NDFS + MapReduce

Screen Shot 2022-04-17 at 16.19.14

Screen Shot 2022-04-17 at 16.19.41

Screen Shot 2022-04-17 at 16.19.49


basic

before

Screen Shot 2021-02-09 at 11.30.12

nowday

Screen Shot 2021-02-09 at 11.30.34

data process become a problem

Screen Shot 2021-02-09 at 11.30.41

hadoop

Hadoop

  • comsisted of 3 components

Screen Shot 2021-02-09 at 11.31.01


Hadoop 核心

HDFS + MapReduce Screen Shot 2022-04-17 at 16.20.47

HDFS - Storage unit

Screen Shot 2021-02-09 at 11.31.11

HDFS

  • replication
  • fault-tolerant

Screen Shot 2021-02-09 at 11.31.31

Screen Shot 2022-04-17 at 16.22.10

Screen Shot 2022-04-17 at 16.23.36

Screen Shot 2022-04-17 at 16.23.54


MapReduce

Screen Shot 2021-02-09 at 11.32.12

Screen Shot 2021-02-09 at 11.32.33

  • Apache Hadoop
    • an open source framework for big data.
    • It is based on the MapReduce programming model which Google invented and published.
      • "Map function"
        • runs in parallel with a massive dataset to produce intermediate results.
      • "Reduce function"
        • builds a final result set based on all those intermediate results.
    • The term “Hadoop” is often used informally to encompass Apache Hadoop itself, and related projects such as Apache Spark, Apache Pig, and Apache Hive.

Screen Shot 2022-04-17 at 16.24.55

Screen Shot 2022-04-17 at 16.25.45


YARN

Screen Shot 2021-02-09 at 11.32.49

Screen Shot 2022-04-17 at 16.26.11


usages

Screen Shot 2021-02-09 at 11.33.05

  • use by a lot of big company
    • Data warehousing
    • recommendation system
    • fraud detection

Screen Shot 2021-02-09 at 11.33.20


Screen Shot 2022-04-17 at 16.26.38

Screen Shot 2022-04-17 at 16.27.21

Screen Shot 2022-04-17 at 16.28.52

.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.