You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After calling Dataset.exportToJSON(fileName) we sometimes end up with the same URL twice. Once with a statusCode: 200 and once with statusCode: 0. I assume that this happens when a request takes very long, almost until the timeout, say 29.5 seconds in the default case. Now the framework calls requestHandler if the pushData takes longer than 0.5 seconds, the framework will call failedRequestHandler.
I think this is the corresponding code in the timeout package where you can see that the requestHandler invocation is part of the 30 seconds timeout:
Hello @HJK181 and thanks for reporting this! It is indeed possible for Crawlee to behave like this - the request handler does not get interrupted when it reaches a timeout, but it is considered a failure.
We should probably just block storage accesses from request handlers that timed out.
If you need a quick workaround, you can import tryCancel from @apify/timeout and call it before your pushData call. That will ensure that no data will be pushed after the timeout.
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/core
Issue description
I have no code to reproduce, but I have some suspicion about how this might happen. We crawl a static list of unique URLs with
and the following handlers:
After calling
Dataset.exportToJSON(fileName)
we sometimes end up with the same URL twice. Once with astatusCode: 200
and once withstatusCode: 0
. I assume that this happens when a request takes very long, almost until the timeout, say 29.5 seconds in the default case. Now the framework callsrequestHandler
if thepushData
takes longer than 0.5 seconds, the framework will callfailedRequestHandler
.I think this is the corresponding code in the timeout package where you can see that the requestHandler invocation is part of the 30 seconds timeout:
Code sample
Package version
3.11.5
Node.js version
The one from apify/actor-node-playwright-chrome:20
Operating system
apify/actor-node-playwright-chrome:20 Docker
Apify platform
I have tested this on the
next
releaseNo response
Other context
No response
The text was updated successfully, but these errors were encountered: