Open Data and Big Data: Data Management and a Comparison of Hadoop vs. Spark

Antonino Battaglia : 3 September 2025 07:39

Today, for all large companies, data represents a strategic resource of primary importance. Thanks to the information obtained, it is possible to establish decision-making strategies that, in every sector, can improve a company’s operations and profits. Similar considerations can be made within the Public Administration, where data not only represents a tool for improving internal efficiency and decision-making processes, but, if made available to the community, can generate social and economic value.

Although with different methods and purposes, two fundamental phenomena emerge in this context, united by the need to manage large amounts of information: Open Data and Big Data. Like all technologies that must manage a large quantity and variety of data, they must guarantee the security, integrity, and availability of the information. Basically, when we talk about Open Data, we’re referring to public information, where the data is presented in an open, standardized, and freely available format, free of copyright.

Given the open nature of the data, it will be compatible with different technologies and will feature open and interoperable data formats, such as CSV or JSON, leveraging APIs to access datasets or the use of Linked Open Data. Let’s take a look at Linked Open Data, which leverages semantic technologies to structure data, thus enabling a deeper understanding of the meaning and context of the information. Fundamentally, this approach is based on fundamental principles. The use of URIs (Uniform Resource Identifiers) uniquely identifies each resource, and data must be interconnected via links, allowing navigation between them.

To ensure that information can be interpreted correctly, various standards are used, including, without going into detail, RDF (Resource Description Framework), which allows for the structured representation of information on the web, and OWL (Web Ontology Language), which defines the relationships between data, facilitating their understanding.

Open Data will face several challenges, primarily data quality, which is often incomplete or outdated. Furthermore, the mere publication of data (in the case of services offered by public administration) does not necessarily guarantee their use by citizens. Furthermore, given the nature of the data, it will often relate to individuals, and must therefore respect and protect their privacy. Fundamentally, however, training and raising public awareness about the importance and use of open data can contribute to greater engagement and more effective use of available information.

When the data is not necessarily public and requires technology capable of managing large quantities, often in raw format, we can talk about Big Data. Commonly, reference is made to the 5Vs: volume (large volume of data), velocity (high operating speed), variety (variety of data), veracity (veracity of the data source), and value (value of the data, linked to the subject’s interest).

Among the most important technologies in the context of Big Data infrastructure, we find:

NoSQL Database
Hadoop
Apache Spark

Without claiming to be exhaustive, we’ll try to provide some insights into the differences between the various technologies. From an efficiency standpoint, NoSQL, unlike relational databases, is not structured. Within these systems, the burdensome constraints of the relational structure are traded for faster, more dynamic operations. These databases are based on the concept of distributed technology, where data can be stored across multiple processing nodes. We will no longer have a transaction-based model, but rather one based on data flexibility.

Data storage within these technologies can be developed using various models. Among the models used is the Key/Value model, according to which objects are stored in structures called buckets in the form of a key-value list. Another model used is the Document model, through which objects are saved within documents with a structure that follows that of Object-Oriented Programming (OO).

The main advantage of the Document model is its high compatibility with current systems, including MongoDB. A final model is the Columnar model, in which data is organized in columns rather than rows. In practice, although the concept presents a change that may seem merely formal, it improves distribution within storage space.

We mentioned MongoDB, a document-oriented database designed to manage large volumes of unstructured or semi-structured data. MongoDB uses the BSON (Binary JSON) binary representation, which allows for data from homogeneous or non-homogeneous collections. These documents, in turn, are organized within collections, within which indexes can be implemented to improve performance.

Recall that, to date, Apache Hadoop is one of the most widely used frameworks for managing and archiving large amounts of data. Its main feature is its ability to break down large problems into smaller elements, dividing large amounts of data into smaller, manageable pieces. Architecturally, Hadoop leverages the principle that failures must be handled at the application level and uses server clusters or virtual servers for its implementation.

Hadoop is primarily made up of two fundamental components:

Hadoop Distributed File System (HDFS): A distributed file system designed to reliably store large amounts of data and provide high-speed access.
MapReduce: A programming model for processing large data sets in parallel on a cluster of computers.

The first module is Hadoop Distributed File System (HDFS), a distributed file system that stores data on commodity hardware, providing very high aggregate bandwidth. A distributed file system is necessary when the amount of data to be stored can exceed what can be stored on a single machine. It is therefore necessary to partition the data across multiple machines, providing recovery mechanisms in the event of a machine failure. Compared to normal file systems, distributed ones require network communication and are more complex to develop.

HDFS has a master/slave architecture. Within a cluster, there will be a single master node (NodeName), a server that governs the file system namespace and manages client access. Files are stored separately in DataNodes, typically one per cluster node. Internally, a file is broken into one or more blocks, and these blocks are stored in a set of DataNodes.

The second module is MapReduce, a programming model for processing large amounts of data, designed to enable computations to be performed in parallel. It is based on the “divide and conquer” principle, in which a large problem is broken down into independent problems that can be solved in parallel. Essentially, several key-value pairs are generated by the Map component, while the Reduce component aggregates them into multiple pairs.

In addition to the two modules just described, we can also highlight the presence of the YARN and COMMON modules. YARN is the cluster’s resource manager and organizes data jobs, while COMMON contains the libraries needed for the modules to run.

Born from Hadoop, Apache Spark has specific features designed to deliver the computational speed, scalability, and programmability needed for big data, particularly streaming data, graph data, analytics, and machine learning. Unlike Hadoop, Spark allows for fast, optimized queries by leveraging its in-cache processing. Scalability is achieved by dividing processing workflows across large clusters of computers.

Spark features a hierarchical primary/secondary architecture. The primary node, called the driver, controls the secondary nodes, manages the cluster, and provides data to the application client. The main program of a Spark application (the driver program) contains a SparkContext object, an instance of which communicates with the cluster’s resource manager to request a set of resources (RAM, cores, etc.) for the executors.

A key aspect of Spark’s tasks is the creation of distributed datasets, known as Resilient Distributed Datasets (RDDs). RDDs are resilient (and therefore rebuildable in the event of a failure), immutable (each operation creates a new RDD), and partitionable to allow for parallel processing. Spark loads data by referencing a data source or by parallelizing an existing collection using SparkContext’s parallelization method to store data in an RDD for processing. Once the data is loaded into an RDD, Spark performs transformations and actions on it in memory, which is the key to Spark’s speed. Spark also stores data in memory unless the system runs out of memory or the user chooses to write the data to disk for preservation.

Each dataset in an RDD is divided into logical partitions, which can be computed on different nodes in the cluster. Users can perform two types of operations on RDDs: transformations and actions. Transformations are operations applied to create a new RDD, while actions are used to instruct Apache Spark to apply the computation and return the result to the driver.

An operational feature of Spark is the Directed Acyclic Graph (DAG). Unlike Hadoop’s two-part MapReduce execution process, a graph is created to orchestrate and schedule worker nodes within the cluster. This feature is attractive because any failures or errors can be resolved by considering data operations recorded in a previous state. At the core of parallel data processing is the Spark Core module, which handles scheduling and optimization. It provides core functionality for Spark libraries and GraphX graph data processing.

“Spark vs. Hadoop” is a frequently searched term on the web, but, as noted above, Spark is more of an enhancement to Hadoop, and more specifically, to Hadoop’s native data processing component, MapReduce. In fact, Spark is based on the MapReduce framework, and most Hadoop distributions today support Spark. Like MapReduce, Spark allows programmers to write applications that process massive data sets faster by processing portions of the data set in parallel across large computer clusters. However, while MapReduce processes data on disk, adding read and write times that slow down processing, Spark performs calculations in memory. As a result, Spark can process data up to 100 times faster than MapReduce.

We can say that, although Open Data and Big Data are conceptually distinct, their integration can generate significant synergies. Combining data from public datasets, accessible to all, with heterogeneous information collected from different sources offers unique opportunities for analysis and innovation. This combination not only enriches the information landscape but also enables the development of more effective and targeted solutions that can address specific societal and market needs. In an era of exponential growth in data, the ability to integrate and analyze these resources becomes crucial to making informed decisions and promoting sustainable progress.

Antonino Battaglia
Antonino Battaglia Electronics engineer with over ten years of experience in industrial automation and cybersecurity. Passionate about blockchain technology and cryptocurrencies, he collaborates with Red Hot Cyber, sharing his knowledge on cybersecurity, automation and IoT.

Lista degli articoli