Both traditional relational (RDBMS) and Hadoop database systems have similar functionalities in terms of collection, storage, processing, recovery, extraction and data manipulation. However, they use radically different approaches in terms of data processing, and the problems they are trying to solve.
RDBMS systems focus on solving traditional problems, with an operational and general approach allowing the manipulation of data structures such as banking or sales transactions, location information, etc. On the other hand, Hadoop, which was created more recently, tries to solve nonconventional problems, and therefore allows you, for example, to handle structured, semi-structured or even unstructured data. For example this could be text, videos, audios, Facebook posts, or click tracking.
RDBMS technology has been extensively tested, it is highly consistent, and has very mature systems supported by companies with great experience such as Oracle, IBM, Teradata or Microsoft, just to name some of them. In addition highly reliable systems based on free software may also be found, as is the case of PostgreSQL. Hadoop, however, is a much shorter history, it is open source, and thanks to the demand for systems that can deal with big data problems, has been gathering significant traction in recent years.
RDBMS is very well adapted to working with well-structured data through the use of relatively small or medium tables, and with a database diagram defined by an entity relationship (ER) model, which makes these systems particularly consistent and appropriate for online transaction processing (OLTP). But its availability and consistency of data is also its main weakness, when handling large volumes of information, in particular when the limits of the hardware force you to partition the information. At this point, and as the CAP theorem (or Brewer’s Theorem) clearly warns, where a commitment solution may only be achieved by choosing two of the ideal qualities of any data systems, escalation in terms of data volume is not one of the qualities in RDBMS type systems.
On the other hand, Hadoop seeks to achieve consistency and handle extremely large volumes of data based on horizontal scaling and logical partitioning of the data at the expense of resigning their availability. That is why Hadoop is generally used to process information in batch form or through streams where the data is handled in large blocks and sequentially, and it is not intended to provide operational services on them.
The Hadoop architecture allows you to easily add storage and processing nodes in a distributed way. These nodes also allow you to process data blocks in parallel. Additionally, it implements the replication of its data blocks in several nodes. This saves the distances in an analogous way as an RAID disk system does, although in this case the blocks are much larger (in the order of 64MB to 128MB) unlike the typical size of a disk cluster that may be found in the order of 4KB.
From the point of view of the scalability at the hardware level, there are also notable differences since not only is Hadoop highly scalable, but it also allows you to do so by using standard or low cost hardware, adding new nodes that expand the storage and processing capacity at the same time. This is not seen in RDBMS-based systems where scaling in terms of processing capacity and information volume is generally achieved at the cost of expensive hardware.
MapReduce is a distributed programming model that allows massive data processing in parallel. It was developed as a scalable and fault-tolerant alternative for massive data processing (Parrales Bravo, et al., 2010). The MapReduce model is based on 2 fundamental functions:
The map function () has the feature of working on large volumes of data. These data are divided into two or more parts. Each of these parts contains record sets or text lines. A map function () is executed for each data portion separately, in order to calculate a set of intermediate values based on the processing of each record. MapReduce groups the values according to the intermediate key and then sends them to the reduce function ().
The reduce function () is executed for each element of each list of intermediate values received. The final result is obtained by collecting and interpreting the results of all executed processes.
MapReduce operates exclusively with key/value pairs, that is, the framework sees the work input as a set of key/value pairs and produces sets of key/value pairs as work output.
July 08 / 2020
April 23 / 2020
As we gradually get used to our new COVID-19 reality, daily life from just a few weeks ago now feels like a lifetime away. For businesses this has created,...Read post