What is Web Scraping? Let's Understand It Better
Red Hot Cyber
Cybersecurity is about sharing. Recognize the risk, combat it, share your experiences, and encourage others to do better than you.
Search
Banner Ransomfeed 320x100 1
TM RedHotCyber 970x120 042543
What is Web Scraping? Let’s Understand It Better

What is Web Scraping? Let’s Understand It Better

Redazione RHC : 11 November 2025 21:15

We have often talked about huge user databases being sold on underground forums and said that this was web scraping.

In April 2021, Facebook lost 533 million users , while in June 2021, LinkedIn lost 700 million users , practically its entire user base which in fact currently amounts to 756 million users.

LinkedIn immediately clarified:

“Our teams have been investigating a series of alleged LinkedIn data leaks that were made available for sale. We want to be clear that this is not a data breach and that no private LinkedIn member data was exposed.”

But then, if everything is in order and there hasn’t been a breach, people rightly wonder: where did all this information come from? Has someone discovered a flaw in a database or a web API from which an attacker has extracted millions of user profiles?

These are more than legitimate questions, but to better understand the phenomenon and draw the right conclusions, we need to begin by understanding what web scraping is, which we will do in this article.

What is web scraping?

Web scraping is the process of automatically extracting data or gathering information from the World Wide Web. It is a developing field and consists of a set of techniques that allow the downloading of legitimate information from web platforms, text processing, semantic understanding, and the use of artificial intelligence to correctly and coherently organize the information into a database.

In fact, it’s not much different from what search engines constantly do with their “crawling” activities, downloading pages, analyzing them, and classifying them using advanced artificial intelligence algorithms within their immense distributed databases.

Crawling refers to the process used by major search engines when they send their crawlers (like Googlebot) online to index web content. Scraping, on the other hand, is a structured activity designed to extract targeted data from a given website.

Web scraping activities are performed for statistical surveys, marketing purposes, and to gain a competitive advantage, as companies can learn about their competitors’ strategies well in advance.

There are methods used by some websites to prevent web scraping, such as detecting and blocking bots from viewing their pages.

Illegal Uses

As with all things, especially on the internet, there are activities carried out for legitimate purposes and those carried out for illicit purposes, and web data scraping is no exception. In this case, it is used to create entire databases of information used for commercial purposes, or to create vast databases of user profiles, classified, indexed, and categorized.

This typically happens on social networks, where most people actually share information about their private lives, including sensitive information like addresses, phone numbers, jobs, and more.

As we’ve seen, scraping involves massively downloading web pages from a specific site, and this can be done manually (though it’s virtually impossible), but this activity can be replicated through specially designed “bots” designed to download public information from profiles and then organize it so it can be easily searched within a database.

Data correlation

As we have seen previously in several articles (for example in the article “every data leak is everyone’s problem” ), such information can be “enriched” by using the famous collections of previous data breaches, and then correlating them with each other to generate a precise “fingerprint” of a given identity.

As mentioned, some information about a person doesn’t change very easily, so a data leak/breach, even five years old, is an invaluable source of information and correlations. For example, who changes their phone number, home address, or email address every five years?

Indeed, it’s not uncommon to find website scraped data enriched with other data leaks or breaches for sale. All this is used to gain more information about a person and then attack them through targeted phishing, SIM swapping, credential theft, and much more. And since cybercrime prizes speed and low cost, having access to a vast collection of information on specific user groups is highly attractive.

How to protect ourselves from web scraping

There is virtually no protection from web scraping.

The important thing to understand is that when you publish information online, it stays there forever, and in truth, anyone who stumbles upon that particular web page will be able to read, download, archive, and analyze it.

The only way to stop web scraping is to avoid posting content on a website or social network, and to eliminate the leakage of information (public profiles) outside of your own connections.

Using an advanced bot management solution, however, can help websites almost completely block access to scrapers, although it’s not actually that simple.

Immagine del sitoRedazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli