The open-source Hadoop Distributed File System (HDFS) is like a normal Windows file server, but distributed over several servers.


The files can be kept redundant on several servers. If one of these servers fails, the data is automatically copied to another server. This makes HDFS very fail-safe. HDFS runs on inexpensive commodity servers, expensive server hardware is not needed.

Scalability - Price

HDFS is designed to run on a huge number (up to thousands) of machines. Since there are no license costs and the hardware costs are low, the system is inexpensive to scale.

Difference to RDBMS

A classic Relational Database Management System (RDBMS) is designed for structured data. The structure of the database in tabular form must already be determined at design time. The advantage of RDBMS is that the query language (mostly SQL) can be executed on the servers and nearby the data. This is very efficient because you do not have to download data for the queries from the servers.

A classic file system is based on files and thus on non-structured data. On a file server documents, videos, music or graphics can be stored. Programs that process these files do not run on the file server but on application servers or clients. If parallel processing is required, this must be explicitly programmed.

HDFS enables the processing of data directly on the servers. The programs that are developed for this are automatically designed for parallel processing, as the libraries (Map-Reduce) encapsulate parallel processing. As the volume of data increases constantly, the file server cluster has to be extended. It is a great advantage if you only need to extend the cluster with new servers, but the software can continue to be used without any changes.

There are extensions to Hadoop that provide the capabilities of Relational Database Management System on the Hadoop platform.