You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/fetching/dynamic.md
+80-79Lines changed: 80 additions & 79 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,7 +77,7 @@ Scrapling provides many options with this fetcher and its session classes. To ma
77
77
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
78
78
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
79
79
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._| ✔️ |
80
-
| google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
80
+
| google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
81
81
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._| ✔️ |
82
82
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
83
83
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
@@ -89,12 +89,12 @@ Scrapling provides many options with this fetcher and its session classes. To ma
89
89
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
90
90
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
91
91
| blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
92
-
|block_ads | Block requests to ~3,500 known ad/tracking domains. Can be combined with `blocked_domains`. | ✔️ |
92
+
| block_ads| Block requests to ~3,500 known ad/tracking domains. Can be combined with `blocked_domains`. | ✔️ |
93
93
| dns_over_https | Route DNS queries through Cloudflare's DNS-over-HTTPS to prevent DNS leaks when using proxies. | ✔️ |
94
94
| proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
95
95
| retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
96
96
| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
97
-
| capture_xhr | Pass a regex URL pattern string to capture XHR/fetch requests matching it during page load. Captured responses are available via `response.captured_xhr`. Defaults to `None` (disabled). | ✔️ |
97
+
| capture_xhr | Pass a regex URL pattern string to capture XHR/fetch requests matching it during page load. Captured responses are available via `response.captured_xhr`. Defaults to `None` (disabled). | ✔️ |
98
98
| executable_path | Absolute path to a custom browser executable to use instead of the bundled Chromium. Useful for non-standard installations or custom browser builds. | ✔️ |
99
99
100
100
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `page_setup`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
@@ -107,6 +107,65 @@ In session classes, all these arguments can be set globally for the session. Sti
107
107
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
108
108
109
109
110
+
## Session Management
111
+
112
+
To keep the browser open until you make multiple requests with the same configuration, use `DynamicSession`/`AsyncDynamicSession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session.
113
+
114
+
```python
115
+
from scrapling.fetchers import DynamicSession
116
+
117
+
# Create a session with default configuration
118
+
with DynamicSession(
119
+
headless=True,
120
+
disable_resources=True,
121
+
real_chrome=True
122
+
) as session:
123
+
# Make multiple requests with the same browser instance
124
+
page1 = session.fetch('https://example1.com')
125
+
page2 = session.fetch('https://example2.com')
126
+
page3 = session.fetch('https://dynamic-site.com')
127
+
128
+
# All requests reuse the same tab on the same browser instance
129
+
```
130
+
131
+
### Async Session Usage
132
+
133
+
```python
134
+
import asyncio
135
+
from scrapling.fetchers import AsyncDynamicSession
136
+
137
+
asyncdefscrape_multiple_sites():
138
+
asyncwith AsyncDynamicSession(
139
+
network_idle=True,
140
+
timeout=30000,
141
+
max_pages=3
142
+
) as session:
143
+
# Make async requests with shared browser configuration
144
+
pages =await asyncio.gather(
145
+
session.fetch('https://spa-app1.com'),
146
+
session.fetch('https://spa-app2.com'),
147
+
session.fetch('https://dynamic-content.com')
148
+
)
149
+
return pages
150
+
```
151
+
152
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
153
+
154
+
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
155
+
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
156
+
157
+
This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
158
+
159
+
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
160
+
161
+
### Session Benefits
162
+
163
+
-**Browser reuse**: Much faster subsequent requests by reusing the same browser instance.
164
+
-**Cookie persistence**: Automatic cookie and session state handling as any browser does automatically.
165
+
-**Consistent fingerprint**: Same browser fingerprint across all requests.
166
+
-**Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
167
+
168
+
110
169
## Examples
111
170
It's easier to understand with examples, so let's take a look.
Remember that by default, all browser-based fetchers and sessions use a persistent browser context with a pool of tabs. However, since browsers can't set a proxy per tab, when you use a `ProxyRotator`, the fetcher will automatically open a separate context for each proxy, with one tab per context. Once the tab's job is done, both the tab and its context are closed.
To keep the browser open until you make multiple requests with the same configuration, use `DynamicSession`/`AsyncDynamicSession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session.
334
+
### Proxy Rotation
303
335
304
336
```python
305
-
from scrapling.fetchers import DynamicSession
337
+
from scrapling.fetchers import DynamicSession, ProxyRotator
306
338
307
-
# Create a session with default configuration
308
-
with DynamicSession(
309
-
headless=True,
310
-
disable_resources=True,
311
-
real_chrome=True
312
-
) as session:
313
-
# Make multiple requests with the same browser instance
339
+
# Set up proxy rotation
340
+
rotator = ProxyRotator([
341
+
"http://proxy1:8080",
342
+
"http://proxy2:8080",
343
+
"http://proxy3:8080",
344
+
])
345
+
346
+
# Use with session - rotates proxy automatically with each request
347
+
with DynamicSession(proxy_rotator=rotator, headless=True) as session:
314
348
page1 = session.fetch('https://example1.com')
315
349
page2 = session.fetch('https://example2.com')
316
-
page3 = session.fetch('https://dynamic-site.com')
317
-
318
-
# All requests reuse the same tab on the same browser instance
319
-
```
320
350
321
-
### Async Session Usage
322
-
323
-
```python
324
-
import asyncio
325
-
from scrapling.fetchers import AsyncDynamicSession
326
-
327
-
asyncdefscrape_multiple_sites():
328
-
asyncwith AsyncDynamicSession(
329
-
network_idle=True,
330
-
timeout=30000,
331
-
max_pages=3
332
-
) as session:
333
-
# Make async requests with shared browser configuration
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
343
-
344
-
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
345
-
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
346
-
347
-
This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
348
-
349
-
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
350
-
351
-
### Session Benefits
355
+
!!! warning
352
356
353
-
-**Browser reuse**: Much faster subsequent requests by reusing the same browser instance.
354
-
-**Cookie persistence**: Automatic cookie and session state handling as any browser does automatically.
355
-
-**Consistent fingerprint**: Same browser fingerprint across all requests.
356
-
-**Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
357
+
Remember that by default, all browser-based fetchers and sessions use a persistent browser context with a pool of tabs. However, since browsers can't set a proxy per tab, when you use a `ProxyRotator`, the fetcher will automatically open a separate context for each proxy, with one tab per context. Once the tab's job is done, both the tab and its context are closed.
0 commit comments