Written byAsmit JoshionMay 17, 2021
Ever since the invention of the world wide web, web scraping has been one of its most integral facets. It is how search engines are able to gather and display hundreds of thousands of results instantaneously, and how companies build databases, develop marketing strategies, generate leads and so on.
While its potentials are immense, there are also concerns regarding the legality of web scraping. Thanks to some high-profile cases (which we’ll look at later in this article) and some frequent issues, “Is web scraping legal?” is one of the most frequently asked questions. The answer? Well, it depends — on each use-case.
For SaaS and Daas providers, and data-driven businesses alike, it is important to have a clear understanding about all of the aspects of web scraping. In this post, we look at its legal aspect and try to give you an overview about:
- Common misconceptions
- Things to consider
- Current legislations
- Frequent legal issues
- Some high-profile cases
Commonly held misconceptions
The general belief is that everything you see online is free to scrape and re-use. This is probably the biggest misconception out there regarding web scraping, and could land any individual or company in legal hot water.
The question of web scraping’s legality isn’t as black and white as one might assume — there’s also an ethical side that one must be aware of and familiar with. Knowing what kind of data is legal, illegal or somewhere in between will help your decision making, and help you avoid any unintended and unnecessary consequences.
Things to consider
There are few things one needs to consider before and after scraping any data.
Types of data
In most cases, the degree of ease with which any web data is accessible more or less determines where the data lies on the legality spectrum.
Scraping data from public sites is perfectly legal. This refers to the data and information on websites which is obtained without the need to log in or authenticate one’s identity. Some examples of such websites are ecommerce platforms like Amazon and BestBuy.
Although these data sources might try to protect the public information by posing various obstacles to scrapers and crawlers, extracting data points off them is absolutely fine.
Private or personal data
Any data that can reveal a person’s identity, such as their name, address, date of birth, medical and financial details, and contact information, are called Personally Identifiable Information, or PII.
As a general rule, it is illegal to scrape any personal information without the person’s consent or without any legal motivation. EU and California currently have the strictest laws in this respect.
It is illegal to scrape any openly accessible data like images, songs, articles, etc. that are intellectual properties of any business or individual. Because their owners have full control over their use and reproduction, scrapers require explicit consent in order to extract them. As a workaround, you can use snippets of the data, or cite and credit the sources to use the data.
Website Terms of Service (ToS)
Before scraping any website for its data, one needs to be aware of what its policies regarding the access of its data are. If they explicitly contain any scraping restrictions, then it is worth assuming that scraping would constitute a breach of their ToS. In addition, even if there aren’t any such policies, one should be wary that their content may still be copyrighted.
Scraping behind a subscription or login
Services like LinkedIn require users to have an account before any data is visible. When signing up with these services, you’re almost always agreeing to their terms which forbid scraping their data.
Since scraper bots and crawlers use your account credentials to gain access to the data, the service provider can easily identify and ban you from their platform altogether. Hence, it is advised to stay away from this option and instead try to find publicly available data.
As there currently aren’t any clear laws determining the legality or otherwise of web scraping, lawsuits are handled on a case-by-case basis. Having said that, the General Data Protection Regulation (GDPR) and the US Privacy Act are referred to in most cases in Europe and the US respectively.
The GDPR came into effect in May 2018 and protects the personal details of people within the European Economic Area (EEA). Some example of personal data include people’s names, emails, phone numbers, dates of birth, IP address, credit card and bank details, medical records and multimedia like photos, audio and videos.
The GDPR classifies the protection of personal data as a “fundamental right“. As such, it prohibits the processing of personal data unless it is done under one of six lawful bases — consent, contract, public task, vital interest, legitimate interest or legal requirement. When the processing is based on consent, the data subject has the right to revoke it at any time.
Furthermore, data controllers must clearly disclose any data collection, declare the lawful basis and purpose, and state how long the data is being retained and if it is subject to sharing with third parties or outside of the EEA.
US Privacy Act
While the US doesn’t have one federal regulation that legislates data privacy and protection like the EU, there are several industry-specific legal acts, such as the GLBA for finance, HIPAA for healthcare and COPPA for children’s data.
In 2020 however, California passed a state law — the Californian Consumer Privacy Act (CCPA) — which requires companies collecting personal data to explicitly disclose how they intend to use that data and also allows consumers to remove their information or opt-out of data collection. The same rules also apply to data scraping companies.
The GDPR and CCPA both allow consumers to access and remove their personal information, and opt-out altogether at any given time. However, users can edit their data under the GDPR, but not under the CCPA. Similarly, the CCPA only asks for privacy notice on websites, while the GDPR requires explicit user consent.
Frequent legal issues
The following are some of the most repeated offenses and issues in context of web scraping.
As mentioned above, although scraping any openly accessible data may be legal, there may be certain restrictions and legal consequences if the data is copyright-protected. Any such data cannot be published or used for commercial purposes if you want to abide by the legal framework.
Any infringement of copyrighted data is prosecutable, regardless of how you access and collect the data.
Computer Fraud and Abuse Act violation
The CFAA was passed in 1984 to prohibit any unauthorized access to computers and networks. Originally drafted to protect military, financial and other sensitive data, it was extended to include all private information.
The CFAA doesn’t apply to web crawlers and scraping techniques that access only publicly available information.
Trespass to chattels
A trespass to chattels (or site security) occurs when a website or its servers are violated or hurt in any way. In context of web scraping, a crawler repeatedly sending requests can effect the target website’s performance by crashing or slowing down its server.
From a legal standpoint, the site owners might consider the frequent requests as an intentional attack to their system. As a result, it is important and morally responsible for DaaS providers to build scrapers that do not harm the target website.
As mentioned earlier, there are a few historical cases which have set legal precedence in web scraping lawsuits.
eBay vs Bidder’s Edge (1999)
Bidder’s Edge, a website that gathered auction listings, sent 100,000 daily requests to eBay’s servers to access its ongoing auctions, which resulted in damage to eBay’s systems. In late 1999, eBay filed an injunction against Bidder’s Edge, stating the violation of the Trespass to Chattels law.
Although both parties later settled the case out of court for an undisclosed amount, it set a legal precedent for future cases.
HiQ Labs vs LinkedIn (2019)
This historic case started when hiQ Labs, a data analytics company, sued LinkedIn for prohibiting it from scraping public profiles on LinkedIn. HiQ Labs used the data to consult employers about job applicants.
In 2019, the Ninth Circuit Court of Appeals ruled that the CFAA did not apply since the data was publicly available and non-copyright. As a result, LinkedIn wasn’t able to prevent hiQ Labs from accessing its public profiles. It did however restrict the user profiles to be accessible only after logging in.
It is worth mentioning that the case is far from concluded as LinkedIn continues to pursue the matter with the US Supreme Court.
Update (April 2022): In its second ruling on 18 April 2022, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law. [via TechCrunch]
Update (December 2022): On 6 December 2022, hiQ Labs and LinkedIn reached a confidential settlement agreement, thus ending their long-running litigation.
Since the question of web scraping isn’t black or white, you must analyze each use-case thoroughly to avoid any unintended consequences. You need to consider existing legislations, the types of data being collected, the data source’s terms and policies, and also the ethical usage after extraction.
Here at Grepsr, we take our web scraping responsibilities extremely seriously and adhere to all legal frameworks before, during and after taking a data acquisition project on. We also follow the best ethical practices in order to avoid disrupting our target websites’ performances while continuing to deliver the most accurate and reliable data to all our customers.