awesome-architecture-mds/data-analytics/scrapy/on_boarding.md at main · CodeBoarding/awesome-architecture-mds

graph LR
    Crawler_Orchestration["Crawler Orchestration"]
    Execution_Engine["Execution Engine"]
    Request_Management["Request Management"]
    Downloader_System["Downloader System"]
    Scraping_Parsing["Scraping & Parsing"]
    Spider_Logic["Spider Logic"]
    Item_Processing_Pipelines["Item Processing Pipelines"]
    Configuration_Settings["Configuration & Settings"]
    Extensibility_Signals["Extensibility & Signals"]
    Crawler_Orchestration -- "initiates" --> Execution_Engine
    Execution_Engine -- "manages flow to" --> Request_Management
    Execution_Engine -- "manages flow to" --> Downloader_System
    Execution_Engine -- "manages flow to" --> Scraping_Parsing
    Request_Management -- "receives requests from" --> Execution_Engine
    Request_Management -- "provides requests to" --> Downloader_System
    Downloader_System -- "receives requests from" --> Execution_Engine
    Downloader_System -- "sends responses to" --> Execution_Engine
    Downloader_System -- "intercepts/modifies requests/responses between" --> Execution_Engine
    Scraping_Parsing -- "receives responses from" --> Execution_Engine
    Scraping_Parsing -- "sends items to" --> Item_Processing_Pipelines
    Scraping_Parsing -- "generates new requests back to" --> Execution_Engine
    Spider_Logic -- "generates initial requests for" --> Execution_Engine
    Spider_Logic -- "processes responses from" --> Scraping_Parsing
    Spider_Logic -- "produces items/requests for" --> Scraping_Parsing
    Item_Processing_Pipelines -- "receives items from" --> Scraping_Parsing
    Configuration_Settings -- "configures" --> Crawler_Orchestration
    Configuration_Settings -- "configures" --> Execution_Engine
    Configuration_Settings -- "configures" --> Request_Management
    Configuration_Settings -- "configures" --> Downloader_System
    Configuration_Settings -- "configures" --> Scraping_Parsing
    Configuration_Settings -- "configures" --> Spider_Logic
    Configuration_Settings -- "configures" --> Item_Processing_Pipelines
    Configuration_Settings -- "configures" --> Extensibility_Signals
    Extensibility_Signals -- "extends/communicates with" --> Crawler_Orchestration
    Extensibility_Signals -- "extends/communicates with" --> Execution_Engine
    Extensibility_Signals -- "extends/communicates with" --> Request_Management
    Extensibility_Signals -- "extends/communicates with" --> Downloader_System
    Extensibility_Signals -- "extends/communicates with" --> Scraping_Parsing
    Extensibility_Signals -- "extends/communicates with" --> Spider_Logic
    Extensibility_Signals -- "extends/communicates with" --> Item_Processing_Pipelines
    Extensibility_Signals -- "extends/communicates with" --> Configuration_Settings
    click Crawler_Orchestration href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Crawler Orchestration.md" "Details"
    click Execution_Engine href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Execution Engine.md" "Details"
    click Request_Management href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Request Management.md" "Details"
    click Downloader_System href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Downloader System.md" "Details"
    click Scraping_Parsing href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Scraping & Parsing.md" "Details"
    click Spider_Logic href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Spider Logic.md" "Details"
    click Item_Processing_Pipelines href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Item Processing Pipelines.md" "Details"
    click Configuration_Settings href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Configuration & Settings.md" "Details"
    click Extensibility_Signals href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Extensibility & Signals.md" "Details"

Component Details

Scrapy is an open-source web crawling framework designed for fast, high-level screen scraping and web crawling. It allows users to define how to crawl websites and extract structured data using Spiders, which interact with a core Execution Engine that manages requests, responses, and item processing through a series of pluggable components like Downloader Middlewares and Item Pipelines. The framework is highly extensible via signals and extensions, and its behavior is controlled through a comprehensive settings system.

Crawler Orchestration

Manages the overall lifecycle of crawling processes, including initiating, executing, and terminating crawls. It provides the user interface for starting and managing Scrapy projects and individual spiders.

Related Classes/Methods:

scrapy.crawler.Crawler (58:322)
scrapy.crawler.CrawlerRunnerBase (325:375)
scrapy.crawler.CrawlerRunner (378:464)
scrapy.crawler.AsyncCrawlerRunner (467:560)
scrapy.crawler.CrawlerProcessBase (563:632)
scrapy.crawler.CrawlerProcess (635:707)
scrapy.crawler.AsyncCrawlerProcess (710:788)
scrapy.cmdline (full file reference)
scrapy.commands.bench.Command (19:33)
scrapy.commands.check.Command (42:115)
scrapy.commands.crawl.Command (12:34)
scrapy.commands.edit.Command (10:47)
scrapy.commands.fetch.Command (20:97)
scrapy.commands.genspider.Command (48:226)
scrapy.commands.list.Command (12:24)
scrapy.commands.parse.Command (38:410)
scrapy.commands.runspider.Command (32:64)
scrapy.commands.settings.Command (8:64)
scrapy.commands.shell.Command (24:101)
scrapy.commands.startproject.Command (35:141)
scrapy.commands.version.Command (8:35)
scrapy.commands.view.Command (11:28)
scrapy.commands.ScrapyCommand (full file reference)
scrapy.commands.BaseRunSpiderCommand (full file reference)

Execution Engine

The central orchestrator of the Scrapy framework. It manages the flow of requests and responses between the Scheduler, Downloader, and Scraper, ensuring efficient and concurrent crawling operations.

Related Classes/Methods:

scrapy.core.engine.ExecutionEngine (89:520)
scrapy.core.engine._Slot (52:86)
scrapy.utils.engine (full file reference)

Request Management

Handles the queuing, prioritization, and deduplication of web requests. It ensures that only unique and relevant requests are processed by the Downloader, utilizing specific data structures for HTTP communication.

Related Classes/Methods:

scrapy.core.scheduler.BaseScheduler (55:127)
scrapy.core.scheduler.Scheduler (130:498)
scrapy.pqueues.ScrapyPriorityQueue (52:233)
scrapy.pqueues.DownloaderAwarePriorityQueue (254:367)
scrapy.dupefilters.BaseDupeFilter (29:61)
scrapy.dupefilters.RFPDupeFilter (64:155)
scrapy.http.request.Request (full file reference)
scrapy.http.request.form.FormRequest (38:93)
scrapy.http.request.json_request.JsonRequest (22:77)
scrapy.http.request.rpc.XmlRpcRequest (23:40)
scrapy.http.response.Response (full file reference)
scrapy.http.response.text.TextResponse (42:294)
scrapy.http.response.html.HtmlResponse (11:12)
scrapy.http.response.json.JsonResponse (11:12)
scrapy.http.response.xml.XmlResponse (11:12)
scrapy.http.headers.Headers (23:130)
scrapy.http.cookies.CookieJar (27:108)
scrapy.utils.request (full file reference)
scrapy.utils.response (full file reference)
scrapy.utils.url (full file reference)
scrapy.utils.httpobj (full file reference)

Downloader System

Responsible for fetching web content from various sources, handling different protocols, and applying a chain of middlewares to requests before sending them and to responses upon receiving them.

Related Classes/Methods:

Scraping & Parsing

Processes downloaded responses, extracts structured data (items) based on spider logic, and can generate new requests for further crawling. It also includes tools for selecting data from HTML/XML.

Related Classes/Methods:

scrapy.core.scraper.Scraper (99:453)
scrapy.core.scraper.Slot (55:96)
scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor (164:284)
scrapy.linkextractors.lxmlhtml.LxmlParserLinkExtractor (60:157)
scrapy.selector.unified.Selector (39:101)
scrapy.selector.unified.SelectorList (32:36)
scrapy.loader.ItemLoader (full file reference)
scrapy.exporters.BaseItemExporter (38:108)
scrapy.exporters.JsonLinesItemExporter (111:121)
scrapy.exporters.JsonItemExporter (124:162)
scrapy.exporters.XmlItemExporter (165:219)
scrapy.exporters.CsvItemExporter (222:292)
scrapy.exporters.PickleItemExporter (295:303)
scrapy.exporters.MarshalItemExporter (306:320)
scrapy.exporters.PprintItemExporter (323:330)
scrapy.exporters.PythonItemExporter (333:373)
scrapy.shell.Shell (36:212)
scrapy.contracts.Contract (full file reference)
scrapy.contracts.ContractsManager (full file reference)

Spider Logic

Defines the custom crawling behavior for specific websites. Spiders specify how to initiate requests, parse responses, and yield extracted items or new requests. Spider middlewares allow for pre- and post-processing of spider input/output.

Related Classes/Methods:

Item Processing Pipelines

A series of components that process extracted items sequentially. This includes data validation, cleaning, persistence to various storage backends, and handling associated media files (images, files).

Related Classes/Methods:

scrapy.pipelines.images.ImagesPipeline (43:274)
scrapy.pipelines.files.FilesPipeline (414:746)
scrapy.pipelines.media.MediaPipeline (48:336)
scrapy.pipelines.files.FSFilesStore (104:152)
scrapy.pipelines.files.S3FilesStore (155:274)
scrapy.pipelines.files.GCSFilesStore (277:349)
scrapy.pipelines.ItemPipelineManager (full file reference)

Configuration & Settings

Centralized management of all configurable parameters for a Scrapy project. It allows users to customize the behavior of various components through a hierarchical settings system.

Related Classes/Methods:

scrapy.settings.Settings (full file reference)
scrapy.settings.BaseSettings (full file reference)
scrapy.utils.conf (full file reference)

Extensibility & Signals

Provides a flexible system for extending Scrapy's core functionality and enabling communication between different components through a robust signal dispatching mechanism. Extensions can hook into various stages of the crawling process.

Related Classes/Methods:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Component Details

Crawler Orchestration

Execution Engine

Request Management

Downloader System

Scraping & Parsing

Spider Logic

Item Processing Pipelines

Configuration & Settings

Extensibility & Signals

FAQ

FilesExpand file tree

on_boarding.md

Latest commit

History

on_boarding.md

File metadata and controls

Component Details

Crawler Orchestration

Execution Engine

Request Management

Downloader System

Scraping & Parsing

Spider Logic

Item Processing Pipelines

Configuration & Settings

Extensibility & Signals

FAQ