fix: hash request kwargs and headers correctly#255
Conversation
|
I did a quick local check against this patch and the behavior matches the PR description: from scrapling.spiders.request import Request
assert Request("https://example.com", timeout=1).update_fingerprint(include_kwargs=True) != Request("https://example.com", timeout=2).update_fingerprint(include_kwargs=True)
assert Request("https://example.com", headers={"X-Test": "A"}).update_fingerprint(include_headers=True) != Request("https://example.com", headers={"X-Test": "a"}).update_fingerprint(include_headers=True)
One small suggestion: it would be worth adding these as regression tests in |
|
Hey @samrusani, I have went ahead and added tests. Thanks for the suggestions! |
| kwargs = (key.lower() for key in self._session_kwargs.keys() if key.lower() not in ("data", "json")) | ||
| data["kwargs"] = "".join(set(_convert_to_bytes(key).hex() for key in kwargs)) | ||
| filtered_kwargs = { | ||
| key.lower(): str(value) |
There was a problem hiding this comment.
Using str(value) for kwarg values is fragile if someone passes a non-primitive.
There was a problem hiding this comment.
Thanks! Just fixed this.
|
Good catch as always @yetval |
Summary
Request fingerprint collisions:
Request.update_fingerprint()previously hashed only kwarg names and lowercased header values. Distinct requests could collapse to the same fingerprint whenfp_include_kwargsorfp_include_headerswere enabled, which can silently break scheduler deduplication, cache replay, and checkpoint restore. Fix: hash kwarg names together with their values and preserve header values as-is.Repro
from scrapling.spiders.request import Request
r1 = Request("https://example.com", timeout=1)
r2 = Request("https://example.com", timeout=2)
assert r1.update_fingerprint(include_kwargs=True) != r2.update_fingerprint(include_kwargs=True)
Files changed
scrapling/spiders/request.py — fingerprint kwargs/header handling