Web scraping has become an essential tool for extracting data from websites in various industries.
However, understanding the terminology associated with web scraping can sometimes be challenging.
In this blog post, we have endeavored to provide you with a comprehensive glossary of terms that will help you navigate the world of web scraping easily.
Whether you are new to web scraping or a seasoned professional, this glossary will serve as a handy reference guide to ensure you stay well-informed.
An account represents an individual customer account, a business, or even a partner organization with whom we do business. It serves as the basis for managing and organizing data scraping projects.
2. Account Owner
The Account Owner is a designated point of contact from Grepsr responsible for delivery, support, and account expansion. This role is reserved for certain account types and ensures smooth communication and coordination between the customer and Grepsr.
3. Data Platform
The Data Platform is Grepsr’s proprietary, enterprise-grade system for data project management. It consists of two complementary pieces: the backend infrastructure that handles data extraction and management, and the frontend interface that enables users to configure and monitor their scraping projects.
4. Data Project
A project is a vehicle through which customer requirements are translated into workable data, and value is delivered. It includes data requirements such as URLs and data points to extract, as well as additional instructions required to pull data effectively.
5. Data Report
Project requirements are grouped into sets called Reports. A Report represents a use case or a granular set of data and delivery requirements that can be executed at once and delivered together. Each Report is associated with a set of programmatic instructions to source data known as a Crawler or Service.
6. Data Crawler (or Spider)
A Crawler programmatically opens and interacts with a website to parse content and extract data. It is versioned to reflect changes in the data scope over time. A successful Project has at least one Report associated with a unique Crawler version.
A Run is the execution of a Crawler. It retrieves data from the target website based on the defined instructions and configuration.
A Dataset is the data output resulting from a Run. It contains the extracted data in a structured format ready for analysis and processing.
Pages within a Dataset are similar to sheets in a spreadsheet. Each Dataset consists of at least one Page, which allows for the normalization of the final output, akin to a relational database or separation of concerns.
Columns are the extracted fields in a Dataset or a Page in a Dataset. They organize the data and provide a clear structure to the extracted information.
11. Indexed Column
Indexing a Column implies that the generated data output for that particular Column is stored in a way that allows filtering, sorting, and searching across millions of records without any delay.
Each line of record in a Dataset is referred to as a Row. Rows contain the extracted data for each specific instance or entry.
In a JSON output, a Row of record is referred to as an Object. Unlike a Row, an Object can be layered, allowing for a more complex structure of data representation.
14. Data Quality
Quality is an umbrella term used to quantitatively and qualitatively measure the overall health of a Report. It takes various factors into consideration, including Accuracy, Completeness, Data Distribution, Rows, and Requests.
15. Data Accuracy
Accuracy is a numeric score, expressed as a percentage, that measures if the sourced data complies with the expected data format. Compliance is validated by rules assigned to different Columns in a Dataset. Higher Accuracy indicates better adherence to data standards.
16. Data Completeness
Completeness refers to the state where the data contains all the information available to extract from the source. It is measured using a Fill Rate, which calculates the data density within the Dataset.
17. Fill Rate
Fill Rate is a numeric score, expressed as a percentage, that measures the data density within a Dataset. It indicates the quantity of empty cells versus cells with data. A higher Fill Rate signifies a more complete Dataset.
18. Data Distribution
Data Distribution measures the occurrence of a certain value in a Column. It is particularly useful for Indexed Columns and acts as a proxy for data quality. If the data distribution deviates from the norm, it may indicate potential issues with the sourced data.
19. Data Crawler Requests
A Request is an HTTP request made to the server to retrieve content. The Crawler makes a series of Requests to load and interact with a web page to extract the necessary data. Requests can be successful, meaning the requested content is served by the server, or failed, indicating an error occurred.
A Team refers to a set of users belonging to the same Account. Teams can have different roles, such as Team Manager or Viewer. The Team Manager has administrative rights and access to all Projects in the Account, while the Viewer has limited rights and access only to specific added Projects.
Dive into Grepsr’s all-inclusive glossary of web scraping terms, tailored to empower you with the knowledge needed to excel in data extraction.Web scraping is a powerful technique for extracting data from websites, and understanding the associated terminology is essential. This glossary provides a comprehensive list of terms that will help you navigate the world of web scraping with confidence.
Whether you are a beginner or an experienced user, having a clear understanding of these terms will empower you to effectively leverage web scraping in your data-driven projects.