Ever since the invention of the World Wide Web, web scraping has been one of its most integral facets. It is how search engines are able to gather and display hundreds of thousands of results instantaneously. And also how companies build databases, develop marketing strategies, generate leads, and so on.
While its potentials are immense, there are also concerns regarding the legality of web scraping. Thanks to some high-profile cases (which we’ll look at later in this article) and some frequent issues, “Is web scraping legal?” is one of the most frequently asked questions. The answer? Well, it depends — on each use case.
For SaaS and Daas providers, and data-driven businesses alike, it is important to have a clear understanding about all of the aspects of web scraping. In this post, we look at its legal aspect and try to give you an overview about:
- Common misconceptions
- Things to consider
- Current legislations
- Frequent legal issues
- Some high-profile cases
Commonly held misconceptions
The general belief is that everything you see online is free to scrape and reuse. This is probably the biggest misconception out there regarding web scraping. This could land any individual or company in legal hot water.
The question of web scraping’s legality isn’t as black and white as one might assume — there’s also an ethical side that one must be aware of and familiar with. Knowing what kind of data is legal, illegal, or somewhere in between will help your decision-making. It can help you avoid any unintended and unnecessary consequences.
Things to consider
There are a few things one needs to consider before and after scraping any data.
Types of data
In most cases, the degree of ease with which any web data is accessible more or less determines where the data lies on the legality spectrum.
Public data
Scraping data from public sites is perfectly legal. This refers to the data and information on websites which is obtained without the need to log in or authenticate one’s identity. Some examples of such websites are ecommerce platforms like Amazon and BestBuy.
Although these data sources might try to protect public information by posing various obstacles to scrapers and crawlers, extracting data points off them is absolutely fine.
Private or personal data
Any data that can reveal a person’s identity, such as their name, address, date of birth, medical and financial details, and contact information, are called Personally Identifiable Information, or PII.
As a general rule, it is illegal to scrape any personal information without the person’s consent or without any legal motivation. EU and California currently have the strictest laws regarding the legality of web scraping in this respect.
Copyrighted data
It is illegal to scrape any openly accessible data like images, songs, articles, etc. that are intellectual properties of any business or individual. Because their owners have full control over their use and reproduction, scrapers require explicit consent in order to extract them. As a workaround, you can use snippets of the data, or cite and credit the sources to use the data.
Website Terms of Service (ToS)
Before scraping any website for its data, one needs to be aware of its policies regarding its access. If they explicitly contain any scraping restrictions, then it is worth assuming that scraping would constitute a breach of their ToS. In addition, even if there aren’t any such policies, one should be wary that their content may still be copyrighted.
Scraping behind a subscription or login
Services like LinkedIn require users to have an account before any data is visible. When signing up with these services, you’re almost always agreeing to their terms which forbid scraping their data.
Since scraper bots and crawlers use your account credentials to gain access to the data, the service provider can easily identify and ban you from their platform altogether. Hence, it is advised to stay away from this option and instead try to find publicly available data.
Current legislations
As there currently aren’t any clear laws determining the legality or otherwise of web scraping, lawsuits are handled on a case-by-case basis. Having said that, the General Data Protection Regulation (GDPR) and the US Privacy Act are referred to in most cases in Europe and the US respectively.
GDPR
The GDPR came into effect in May 2018 and protects the personal details of people within the European Economic Area (EEA). Some example of personal data include people’s names, emails, phone numbers, dates of birth, IP address, credit card and bank details, medical records and multimedia like photos, audio and videos.
The GDPR classifies the protection of personal data as a “fundamental right“. As such, it prohibits the processing of personal data unless it is done under one of six lawful bases. Which are consent, contract, public task, vital interest, legitimate interest, or legal requirement. When the processing is based on consent, the data subject has the right to revoke it at any time.
Furthermore, data controllers must clearly disclose any data collection, declare the lawful basis and purpose, and state how long they plan to retain the data. Additionally, disclose if it is subject to sharing with third parties or outside of the EEA.
US Privacy Act
While the US doesn’t have one federal regulation that legislates data privacy and protection like the EU, there are several industry-specific legal acts. Such as the GLBA for finance, HIPAA for healthcare, and COPPA for children’s data.
In 2020 however, California passed a state law — the Californian Consumer Privacy Act (CCPA) — which requires companies collecting personal data to explicitly disclose how they intend to use that data. It also allows consumers to remove their information or opt-out of data collection. The same rules also apply to data scraping companies.
Comparison
The GDPR and CCPA both allow consumers to access and remove their personal information and opt-out altogether at any given time. However, users can edit their data under the GDPR, but not under the CCPA. Similarly, the CCPA only asks for privacy notice on websites, while the GDPR requires explicit user consent.
Frequent legal issues
The following are some of the most repeated offenses and issues in context of web scraping.
Copyright infringement
As mentioned above, although scraping any openly accessible data may be legal, there may be certain restrictions and legal consequences if the data is copyright-protected. If you want to abide by the legal framework, you cannot publish or use any such data for commercial purposes.
Any infringement of copyrighted data is prosecutable, regardless of how you access and collect the data.
Computer Fraud and Abuse Act violation
The CFAA was passed in 1984 to prohibit any unauthorized access to computers and networks. The reason for the draft originally was to protect military, financial, and other sensitive data. Later they extended it to include all private information.
The CFAA doesn’t apply to web crawlers and scraping techniques that access only publicly available information.
Trespass to chattels
A trespass to chattels (or site security) occurs when a website or its servers are violated or hurt in any way. In web scraping, a crawler repeatedly sending requests affects the target website’s performance by crashing or slowing down its server.
From a legal standpoint, the site owners might consider the frequent requests as an intentional attack on their system. As a result, it is important and morally responsible for DaaS providers to build scrapers that don’t harm the target website.
High-profile cases
As mentioned earlier, there are a few historical cases which have set legal precedence in web scraping lawsuits.
eBay vs Bidder’s Edge (1999)
Bidder’s Edge, a website that gathered auction listings, sent 100,000 daily requests to eBay’s servers to access its ongoing auctions. This obviously resulted in damage to eBay’s systems. In late 1999, eBay filed an injunction against Bidder’s Edge, stating the violation of the Trespass to Chattels law.
Although both parties later settled the case out of court for an undisclosed amount, it set a legal precedent for future cases.
HiQ Labs vs LinkedIn (2019)
This historic case started when hiQ Labs, a data analytics company, sued LinkedIn for prohibiting it from scraping public profiles on LinkedIn. HiQ Labs used the data to consult employers about job applicants.
In 2019, the Ninth Circuit Court of Appeals ruled that the CFAA did not apply since the data was publicly available and non-copyright. As a result, LinkedIn wasn’t able to prevent hiQ Labs from accessing its public profiles. It did however restrict the user profiles to be accessible only after logging in.
It is worth mentioning that the case is far from concluded as LinkedIn continues to pursue the matter with the US Supreme Court.
Update (April 2022):
In its second ruling on 18 April 2022, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act. The CFAA governs what constitutes computer hacking under U.S. law. [via TechCrunch]
Update (December 2022):
On 6 December 2022, hiQ Labs and LinkedIn reached a confidential settlement agreement, thus ending their long-running litigation.
Summary
Since the question of web scraping isn’t black or white, you must analyze each use case thoroughly to avoid any unintended consequences. You need to consider existing legislation, the types of data you collect, and the data sources’ terms and policies. Furthermore the ethical usage after extraction.
Here at Grepsr, we take our web scraping responsibilities extremely seriously. We adhere to all legal frameworks before, during, and after taking on a data acquisition project. We also follow the best ethical practices and stay within the bounds of the legality of web scraping. Therefore, we ensure that we avoid disrupting our target websites’ performances. All whilst continuing to deliver the most accurate and reliable data to all our customers.