graph LR
Crawler_Orchestration["Crawler Orchestration"]
Execution_Engine["Execution Engine"]
Request_Management["Request Management"]
Downloader_System["Downloader System"]
Scraping_Parsing["Scraping & Parsing"]
Spider_Logic["Spider Logic"]
Item_Processing_Pipelines["Item Processing Pipelines"]
Configuration_Settings["Configuration & Settings"]
Extensibility_Signals["Extensibility & Signals"]
Crawler_Orchestration -- "initiates" --> Execution_Engine
Execution_Engine -- "manages flow to" --> Request_Management
Execution_Engine -- "manages flow to" --> Downloader_System
Execution_Engine -- "manages flow to" --> Scraping_Parsing
Request_Management -- "receives requests from" --> Execution_Engine
Request_Management -- "provides requests to" --> Downloader_System
Downloader_System -- "receives requests from" --> Execution_Engine
Downloader_System -- "sends responses to" --> Execution_Engine
Downloader_System -- "intercepts/modifies requests/responses between" --> Execution_Engine
Scraping_Parsing -- "receives responses from" --> Execution_Engine
Scraping_Parsing -- "sends items to" --> Item_Processing_Pipelines
Scraping_Parsing -- "generates new requests back to" --> Execution_Engine
Spider_Logic -- "generates initial requests for" --> Execution_Engine
Spider_Logic -- "processes responses from" --> Scraping_Parsing
Spider_Logic -- "produces items/requests for" --> Scraping_Parsing
Item_Processing_Pipelines -- "receives items from" --> Scraping_Parsing
Configuration_Settings -- "configures" --> Crawler_Orchestration
Configuration_Settings -- "configures" --> Execution_Engine
Configuration_Settings -- "configures" --> Request_Management
Configuration_Settings -- "configures" --> Downloader_System
Configuration_Settings -- "configures" --> Scraping_Parsing
Configuration_Settings -- "configures" --> Spider_Logic
Configuration_Settings -- "configures" --> Item_Processing_Pipelines
Configuration_Settings -- "configures" --> Extensibility_Signals
Extensibility_Signals -- "extends/communicates with" --> Crawler_Orchestration
Extensibility_Signals -- "extends/communicates with" --> Execution_Engine
Extensibility_Signals -- "extends/communicates with" --> Request_Management
Extensibility_Signals -- "extends/communicates with" --> Downloader_System
Extensibility_Signals -- "extends/communicates with" --> Scraping_Parsing
Extensibility_Signals -- "extends/communicates with" --> Spider_Logic
Extensibility_Signals -- "extends/communicates with" --> Item_Processing_Pipelines
Extensibility_Signals -- "extends/communicates with" --> Configuration_Settings
click Crawler_Orchestration href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Crawler Orchestration.md" "Details"
click Execution_Engine href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Execution Engine.md" "Details"
click Request_Management href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Request Management.md" "Details"
click Downloader_System href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Downloader System.md" "Details"
click Scraping_Parsing href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Scraping & Parsing.md" "Details"
click Spider_Logic href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Spider Logic.md" "Details"
click Item_Processing_Pipelines href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Item Processing Pipelines.md" "Details"
click Configuration_Settings href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Configuration & Settings.md" "Details"
click Extensibility_Signals href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/scrapy/Extensibility & Signals.md" "Details"
Scrapy is an open-source web crawling framework designed for fast, high-level screen scraping and web crawling. It allows users to define how to crawl websites and extract structured data using Spiders, which interact with a core Execution Engine that manages requests, responses, and item processing through a series of pluggable components like Downloader Middlewares and Item Pipelines. The framework is highly extensible via signals and extensions, and its behavior is controlled through a comprehensive settings system.
Manages the overall lifecycle of crawling processes, including initiating, executing, and terminating crawls. It provides the user interface for starting and managing Scrapy projects and individual spiders.
Related Classes/Methods:
scrapy.crawler.Crawler(58:322)scrapy.crawler.CrawlerRunnerBase(325:375)scrapy.crawler.CrawlerRunner(378:464)scrapy.crawler.AsyncCrawlerRunner(467:560)scrapy.crawler.CrawlerProcessBase(563:632)scrapy.crawler.CrawlerProcess(635:707)scrapy.crawler.AsyncCrawlerProcess(710:788)scrapy.cmdline(full file reference)scrapy.commands.bench.Command(19:33)scrapy.commands.check.Command(42:115)scrapy.commands.crawl.Command(12:34)scrapy.commands.edit.Command(10:47)scrapy.commands.fetch.Command(20:97)scrapy.commands.genspider.Command(48:226)scrapy.commands.list.Command(12:24)scrapy.commands.parse.Command(38:410)scrapy.commands.runspider.Command(32:64)scrapy.commands.settings.Command(8:64)scrapy.commands.shell.Command(24:101)scrapy.commands.startproject.Command(35:141)scrapy.commands.version.Command(8:35)scrapy.commands.view.Command(11:28)scrapy.commands.ScrapyCommand(full file reference)scrapy.commands.BaseRunSpiderCommand(full file reference)
The central orchestrator of the Scrapy framework. It manages the flow of requests and responses between the Scheduler, Downloader, and Scraper, ensuring efficient and concurrent crawling operations.
Related Classes/Methods:
scrapy.core.engine.ExecutionEngine(89:520)scrapy.core.engine._Slot(52:86)scrapy.utils.engine(full file reference)
Handles the queuing, prioritization, and deduplication of web requests. It ensures that only unique and relevant requests are processed by the Downloader, utilizing specific data structures for HTTP communication.
Related Classes/Methods:
scrapy.core.scheduler.BaseScheduler(55:127)scrapy.core.scheduler.Scheduler(130:498)scrapy.pqueues.ScrapyPriorityQueue(52:233)scrapy.pqueues.DownloaderAwarePriorityQueue(254:367)scrapy.dupefilters.BaseDupeFilter(29:61)scrapy.dupefilters.RFPDupeFilter(64:155)scrapy.http.request.Request(full file reference)scrapy.http.request.form.FormRequest(38:93)scrapy.http.request.json_request.JsonRequest(22:77)scrapy.http.request.rpc.XmlRpcRequest(23:40)scrapy.http.response.Response(full file reference)scrapy.http.response.text.TextResponse(42:294)scrapy.http.response.html.HtmlResponse(11:12)scrapy.http.response.json.JsonResponse(11:12)scrapy.http.response.xml.XmlResponse(11:12)scrapy.http.headers.Headers(23:130)scrapy.http.cookies.CookieJar(27:108)scrapy.utils.request(full file reference)scrapy.utils.response(full file reference)scrapy.utils.url(full file reference)scrapy.utils.httpobj(full file reference)
Responsible for fetching web content from various sources, handling different protocols, and applying a chain of middlewares to requests before sending them and to responses upon receiving them.
Related Classes/Methods:
scrapy.core.downloader.Downloader(full file reference)scrapy.core.downloader.handlers.DownloadHandlers(full file reference)scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler(67:125)scrapy.core.downloader.handlers.http2.H2DownloadHandler(29:51)scrapy.core.downloader.handlers.ftp.FTPDownloadHandler(85:151)scrapy.core.downloader.handlers.file.FileDownloadHandler(16:24)scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler(15:28)scrapy.core.downloader.webclient.ScrapyHTTPClientFactory(94:239)scrapy.core.downloader.webclient.ScrapyHTTPPageGetter(24:87)scrapy.core.http2.agent.H2Agent(118:158)scrapy.core.http2.agent.H2ConnectionPool(32:115)scrapy.core.http2.protocol.H2ClientProtocol(85:434)scrapy.core.http2.stream.Stream(78:495)scrapy.core.downloader.middleware.DownloaderMiddlewareManager(27:118)scrapy.downloadermiddlewares.stats.DownloaderStats(37:84)scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware(49:193)scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware(34:157)scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware(21:39)scrapy.downloadermiddlewares.offsite.OffsiteMiddleware(23:93)scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware(25:104)scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware(26:83)scrapy.downloadermiddlewares.redirect.BaseRedirectMiddleware(81:138)scrapy.downloadermiddlewares.redirect.RedirectMiddleware(141:177)scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware(180:207)scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware(35:158)scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware(24:51)scrapy.downloadermiddlewares.retry.RetryMiddleware(125:174)scrapy.downloadermiddlewares.useragent.UserAgentMiddleware(17:37)scrapy.downloadermiddlewares.cookies.CookiesMiddleware(39:182)scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware(24:38)scrapy.robotstxt.PythonRobotParser(68:85)scrapy.robotstxt.RerpRobotParser(88:105)scrapy.robotstxt.ProtegoRobotParser(108:124)scrapy.resolver.CachingThreadedResolver(33:70)scrapy.resolver.CachingHostnameResolver(104:148)
Processes downloaded responses, extracts structured data (items) based on spider logic, and can generate new requests for further crawling. It also includes tools for selecting data from HTML/XML.
Related Classes/Methods:
scrapy.core.scraper.Scraper(99:453)scrapy.core.scraper.Slot(55:96)scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(164:284)scrapy.linkextractors.lxmlhtml.LxmlParserLinkExtractor(60:157)scrapy.selector.unified.Selector(39:101)scrapy.selector.unified.SelectorList(32:36)scrapy.loader.ItemLoader(full file reference)scrapy.exporters.BaseItemExporter(38:108)scrapy.exporters.JsonLinesItemExporter(111:121)scrapy.exporters.JsonItemExporter(124:162)scrapy.exporters.XmlItemExporter(165:219)scrapy.exporters.CsvItemExporter(222:292)scrapy.exporters.PickleItemExporter(295:303)scrapy.exporters.MarshalItemExporter(306:320)scrapy.exporters.PprintItemExporter(323:330)scrapy.exporters.PythonItemExporter(333:373)scrapy.shell.Shell(36:212)scrapy.contracts.Contract(full file reference)scrapy.contracts.ContractsManager(full file reference)
Defines the custom crawling behavior for specific websites. Spiders specify how to initiate requests, parse responses, and yield extracted items or new requests. Spider middlewares allow for pre- and post-processing of spider input/output.
Related Classes/Methods:
scrapy.spiders.Spider(full file reference)scrapy.spiders.feed.XMLFeedSpider(23:108)scrapy.spiders.feed.CSVFeedSpider(111:161)scrapy.spiders.crawl.CrawlSpider(93:219)scrapy.spiders.sitemap.SitemapSpider(26:132)scrapy.spiders.init.InitSpider(16:63)scrapy.core.spidermw.SpiderMiddlewareManager(52:524)scrapy.spidermiddlewares.offsite.OffsiteMiddleware(37:112)scrapy.spidermiddlewares.depth.DepthMiddleware(29:97)scrapy.spidermiddlewares.base.BaseSpiderMiddleware(17:110)scrapy.spidermiddlewares.httperror.HttpErrorMiddleware(37:81)scrapy.spidermiddlewares.referer.RefererMiddleware(329:403)scrapy.spidermiddlewares.urllength.UrlLengthMiddleware(26:55)scrapy.spiderloader.get_spider_loader(25:30)scrapy.spiderloader.SpiderLoader(51:131)scrapy.spiderloader.DummySpiderLoader(135:149)
A series of components that process extracted items sequentially. This includes data validation, cleaning, persistence to various storage backends, and handling associated media files (images, files).
Related Classes/Methods:
scrapy.pipelines.images.ImagesPipeline(43:274)scrapy.pipelines.files.FilesPipeline(414:746)scrapy.pipelines.media.MediaPipeline(48:336)scrapy.pipelines.files.FSFilesStore(104:152)scrapy.pipelines.files.S3FilesStore(155:274)scrapy.pipelines.files.GCSFilesStore(277:349)scrapy.pipelines.ItemPipelineManager(full file reference)
Centralized management of all configurable parameters for a Scrapy project. It allows users to customize the behavior of various components through a hierarchical settings system.
Related Classes/Methods:
scrapy.settings.Settings(full file reference)scrapy.settings.BaseSettings(full file reference)scrapy.utils.conf(full file reference)
Provides a flexible system for extending Scrapy's core functionality and enabling communication between different components through a robust signal dispatching mechanism. Extensions can hook into various stages of the crawling process.
Related Classes/Methods:
scrapy.extension.ExtensionManager(18:23)scrapy.signalmanager.SignalManager(12:106)scrapy.signals(full file reference)scrapy.extensions.telnet.TelnetConsole(41:118)scrapy.extensions.memusage.MemoryUsage(37:163)scrapy.extensions.httpcache.DummyPolicy(38:59)scrapy.extensions.httpcache.RFC2616Policy(62:247)scrapy.extensions.httpcache.DbmCacheStorage(250:310)scrapy.extensions.httpcache.FilesystemCacheStorage(313:392)scrapy.extensions.postprocessing.PostProcessingManager(118:166)scrapy.extensions.spiderstate.SpiderState(18:51)scrapy.extensions.throttle.AutoThrottle(21:129)scrapy.extensions.periodic_log.PeriodicLog(30:163)scrapy.extensions.debug.StackTraceDump(33:68)scrapy.extensions.logstats.LogStats(26:103)scrapy.extensions.statsmailer.StatsMailer(25:48)scrapy.extensions.corestats.CoreStats(20:60)scrapy.extensions.closespider.CloseSpider(36:153)scrapy.extensions.feedexport.FeedExporter(450:744)scrapy.extensions.feedexport.S3FeedStorage(205:275)scrapy.extensions.feedexport.GCSFeedStorage(278:323)scrapy.extensions.feedexport.FTPFeedStorage(326:369)scrapy.extensions.memdebug.MemoryDebugger(24:47)scrapy.utils.signal(full file reference)scrapy.utils.log(full file reference)scrapy.mail.MailSender(55:245)scrapy.logformatter.LogFormatter(37:202)scrapy.statscollectors.StatsCollector(22:68)scrapy.statscollectors.MemoryStatsCollector(71:77)