Skip to content

Race condition in requestHandler/failedRequestHandler #2889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
HJK181 opened this issue Mar 19, 2025 · 1 comment
Open
1 task

Race condition in requestHandler/failedRequestHandler #2889

HJK181 opened this issue Mar 19, 2025 · 1 comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@HJK181
Copy link

HJK181 commented Mar 19, 2025

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

I have no code to reproduce, but I have some suspicion about how this might happen. We crawl a static list of unique URLs with

return crawler
      .run(URLs)

and the following handlers:

async requestHandler({ pushData, request, response }) {
          let statusCode = response.statusCode;
          if (
            response.statusCode === 200 &&
            decodeURI(request.url) !== decodeURI(request.loadedUrl)
          ) {
            log.debug(
              `Request URL: ${request.url} doesn't match loaded URL: ${request.loadedUrl}, treating as 302...`,
            );
            statusCode = 302;
          }

          await pushData({
            url: request.url,
            statusCode,
          });
        },
        async failedRequestHandler({ pushData, request, response }) {
          log.error(`Request for URL "${request.url}" failed.`);
          await pushData({
            url: request.url,
            statusCode: response?.statusCode ?? 0,
          });
        },
        async errorHandler(_req, { message }) {
          log.error(`Request failed with ${message}`);
        }

After calling Dataset.exportToJSON(fileName) we sometimes end up with the same URL twice. Once with a statusCode: 200 and once with statusCode: 0. I assume that this happens when a request takes very long, almost until the timeout, say 29.5 seconds in the default case. Now the framework calls requestHandler if the pushData takes longer than 0.5 seconds, the framework will call failedRequestHandler.

I think this is the corresponding code in the timeout package where you can see that the requestHandler invocation is part of the 30 seconds timeout:

async function addTimeoutToPromise(handler, timeoutMillis, errorMessage) {
  const context = storage.getStore() ?? {
    cancelTask: new AbortController()
  };
  let returnValue;
  const wrap = /* @__PURE__ */ __name(async () => {
    try {
      returnValue = await handler();
    } catch (e) {
      if (!(e instanceof InternalTimeoutError)) {
        throw e;
      }
    }
  }, "wrap");
  return new Promise((resolve, reject) => {
    const timeout = setTimeout(() => {
      context.cancelTask.abort();
      const error = errorMessage instanceof Error ? errorMessage : new TimeoutError(errorMessage);
      reject(error);
    }, timeoutMillis);
    storage.run(context, () => {
      wrap().then(() => resolve(returnValue)).catch(reject).finally(() => clearTimeout(timeout));
    });
  });
}

Code sample

Package version

3.11.5

Node.js version

The one from apify/actor-node-playwright-chrome:20

Operating system

apify/actor-node-playwright-chrome:20 Docker

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@HJK181 HJK181 added the bug Something isn't working. label Mar 19, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 19, 2025
@janbuchar
Copy link
Contributor

Hello @HJK181 and thanks for reporting this! It is indeed possible for Crawlee to behave like this - the request handler does not get interrupted when it reaches a timeout, but it is considered a failure.

We should probably just block storage accesses from request handlers that timed out.

If you need a quick workaround, you can import tryCancel from @apify/timeout and call it before your pushData call. That will ensure that no data will be pushed after the timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants