search-close-icon

Search here

Can't find what you are looking for?

Feel free to get in touch with us for more information about our products and services.

A Comprehensive Glossary of Terms for Web Scraping

Web-scraping-glossary

Web scraping has become an essential tool for extracting data from websites in various industries. 

However, understanding the terminology associated with web scraping can sometimes be challenging.

In this blog post, we provide you with a comprehensive glossary of terms that will definitely guide you to navigate the world of web scraping easily. 

Whether you are new to data extraction or a seasoned professional, this glossary will serve as a handy reference to ensure you stay well-informed.

1. Account

An account represents an individual customer account, a business, or even a partner organization with whom we do business. It serves as the basis for managing and organizing data scraping projects.

2. Account Owner

Similarly, the Account Owner is a designated point of contact from Grepsr responsible for delivery, support, and account expansion. This role is reserved for certain account types and ensures smooth communication and coordination between the customer and Grepsr.

3. Data Platform

The Data Platform is Grepsr’s proprietary, enterprise-grade system for data project management. It consists of two complementary pieces, first is the backend infrastructure that handles data extraction and management. Consecutively, the frontend interface enables users to configure and monitor their scraping projects.

4. Data Project

A project is a vehicle through which customer requirements are translated into workable data, and value is delivered. It includes data requirements such as URLs and data points to extract, as well as additional instructions required to pull data effectively.

Data to make or break your business
Get high-priority web data for your business, when you want it.

5. Data Report

Project requirements are grouped into sets called Reports. A Report represents a use case or a granular set of data and delivery requirements. They can execute at once and deliver together. Each Report is associated with a set of programmatic instructions to source data known as a Crawler or Service.

6. Data Crawler (or Spider)

A Crawler programmatically opens and interacts with a website to parse content and extract data. It is versioned to reflect changes in the data scope over time. As a result, a successful Project has at least one Report associated with a unique Crawler version.

7. Run

A Run is the execution of a Crawler. It retrieves data from the target website based on the defined instructions and configuration.

8. Dataset

A Dataset is the data output resulting from a Run. It contains the extracted data in a structured format ready for analysis and processing.

9. Page

Pages within a Dataset are similar to sheets in a spreadsheet. Each Dataset consists of at least one Page, which allows for the normalization of the final output, akin to a relational database or separation of concerns.

10. Columns

Columns are the extracted fields in a Dataset or a Page in a Dataset. They organize the data and provide a clear structure to the extracted information.

11. Indexed Column

Indexing a column is a crucial process in database management. It implies that the generated data output for that particular column is stored in a way that allows filtering, sorting, and searching across millions of records without any delay.

12. Rows

Each line of record in a Dataset is a Row. Rows contain the extracted data for each specific instance or entry.

13. Object

In a JSON output, a Row of records is an Object. Unlike a Row, an Object can be layered, allowing for a more complex structure of data representation.

14. Data Quality

Quality is an umbrella term to measure the quantitative, qualitative, and overall health of a Report. It takes various factors into consideration. It includes Accuracy, Completeness, Data Distribution, Rows, and Requests.

15. Data Accuracy

Accuracy is a numeric score, expressed as a percentage, that measures if the sourced data complies with the expected data format. Rules assigned to different Columns in a Dataset validate the compliance. Hence, higher Accuracy indicates better adherence to data standards.

16. Data Completeness

Completeness refers to the state where the data contains all the information available to extract from the source. A Fill Rate measures it which calculates the data density within the Dataset.

17. Fill Rate

Furthermore, the Fill Rate is a numeric score, expressed as a percentage, that measures the data density within a Dataset. It indicates the number of empty cells versus cells with data. Additionally, a higher Fill Rate signifies a more complete Dataset.

18. Data Distribution

Data Distribution measures the occurrence of a certain value in a Column. It is particularly useful for Indexed Columns and acts as a proxy for data quality. However, if the data distribution deviates from the norm, it may indicate potential issues with the sourced data.

19. Data Crawler Requests

A Request is an HTTP request made to the server to retrieve content. Subsequently, the Crawler makes a series of Requests to load and interact with a web page to extract the necessary data. Afterwards, the content request is either served by the server or failed, indicating an error.

20. Team

A Team refers to a set of users belonging to the same Account. Teams can have different roles, such as Team Manager or Viewer. The Team Manager has administrative rights and access to all Projects in the Account, while the Viewer has limited rights and access only to specific added Projects.

In conclusion

Dive into Grepsr’s all-inclusive glossary of web scraping terms, tailored to empower you with the knowledge needed to excel in data extraction. Altogether, web scraping is a powerful technique for extracting data from websites, and understanding the associated terminology is essential. Thus, this glossary provides a comprehensive list of terms that will help you navigate the world of web scraping with confidence.

Therefore, either as a beginner or an experienced user, having a clear understanding of these terms will empower you to effectively leverage web scraping in your data-driven projects.

Web data made accessible. At scale.
Tell us what you need. Let us ease your data sourcing pains!
BLOG

A collection of articles, announcements and updates from Grepsr

ETL for Web Scraping

ETL for Web Scraping – A Comprehensive Guide

Dive into the world of web scraping, and data, learn how ETL helps you transform raw data into actionable insights.

data quality metrics

Know Your Data Quality Metrics With Grepsr

The importance of data quality cannot be overstated. One wrong entry and the corruption will spread without exception. The best way to counter this threat is to set up effective data quality metrics. 

data normalization

Applications of Data Normalization in Retail & E-Commerce

From improving customer experience to establishing brand authority, data normalization has wide-ranging applications in retail and ecommerce.

data quality

Perfecting the 1:10:100 Rule in Data Quality

Never let bad data hurt your brand reputation again — get Grepsr’s expertise to ensure the highest data quality

data normalization

What is Data Normalization & Why Enterprises Need it

In the current era of big data, every successful business collects and analyzes vast amounts of data on a daily basis. All of their major decisions are based on the insights gathered from this analysis, for which quality data is the foundation. One of the most important characteristics of quality data is its consistency, which […]

QA protocols at Grepsr

QA at Grepsr — How We Ensure Highest Quality Data

Ever since our founding, Grepsr has strived to become the go-to solution for the highest quality service in the data extraction business. At Grepsr, quality is ensured by continuous monitoring of data through a robust QA infrastructure for accuracy and reliability. In addition to the highly responsive and easy-to-communicate customer service, we pride ourselves in […]

benefits of high quality data

Benefits of High Quality Data to Any Data-Driven Business

From increased revenue to better customer relations, high quality data is key to your organization’s growth.

quality data

Five Primary Characteristics of High-Quality Data

Big data is at the foundation of all the megatrends that are happening today. Chris Lynch, American writer More businesses worldwide in recent years are charting their course based on what data is telling them. With such reliance, it is imperative that the data you’re working with is of the highest quality. Grepsr provides data […]

Importance of Data & Data Quality Assessment

According to Charles Babbage, one of the major inventors of computer technology, “Errors using inadequate data are much less than those using no data at all.” Babbage lived in the 19th century when the world had not yet fully realized the importance of data. At least not in the commercial sense. Had Babbage been around […]

Introducing Grepsr’s Data Quality Report

Quality assured data to help you make the best business decisions

arrow-up-icon