When we talk about ETL, ETL means to extract, transform & load (ETL)
一个典型的ETLπpeline consists of a data source, followed by a transformation, used for filtering or cleaning data, ending in a data sink.
So in the case of Hadoop and Spark, an ETL flow can be defined as:
Data is coming from various sources such as databases, Kafka, Twitter, etc.
To get some meaningful insights we need to filter out or clean the data using Spark, MapReduce, hive, pig, etc.
Finally, after processing(transformation) the data, it is stored in a data sink such as HDFS, table, etc.
Hope this will help you.