feat: add `respect_robots_txt_file` option #1162

Mantisus · 2025-04-17T23:35:30Z

Description

This PR implements automatic skipping of requests based on the robots.txt file. It works based on a new boolean flag in the crawler options called respect_robots_txt_file.

Issues

Related: Implement respectRobotsTxtFile crawler option #1144

Testing

Add tests to check respect_robots_txt_file functioning in ‘EnqueueLinksFunction’ for crawlers
Add tests for RobotsTxtFile

Copilot

Pull Request Overview

This PR introduces a new boolean flag, respect_robots_txt_file, to automatically skip crawling disallowed URLs based on a site's robots.txt rules. Key changes include the addition of tests for robots.txt handling across multiple crawler implementations, integration of robots.txt checking in the crawling pipeline, and the implementation of a RobotsTxtFile utility.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/unit/server_endpoints.py	Added a static ROBOTS_TXT response to simulate a robots.txt file.
tests/unit/server.py	Introduced a new endpoint to serve robots.txt and updated routing logic.
tests/unit/crawlers/_playwright/test_playwright_crawler.py	Added tests verifying that the PlaywrightCrawler correctly respects robots.txt.
tests/unit/crawlers/_parsel/test_parsel_crawler.py	Introduced tests for the ParselCrawler to validate robots.txt respect.
tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py	Added tests to ensure BeautifulSoupCrawler adheres to robots.txt rules.
tests/unit/_utils/test_robots.py	New tests for generating, parsing, and validating robots.txt file behavior.
src/crawlee/crawlers/_playwright/_playwright_crawler.py	Integrated robots.txt enforcement in the link extraction logic.
src/crawlee/crawlers/_basic/_basic_crawler.py	Updated request adding and session handling to respect robots.txt directives.
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py	Added robots.txt checking in link extraction for HTTP-based crawling.
src/crawlee/_utils/robots.py	Implemented the RobotsTxtFile class for parsing and handling robots.txt data.
pyproject.toml	Added dependency for protego to support robots.txt parsing.

src/crawlee/crawlers/_playwright/_playwright_crawler.py

src/crawlee/crawlers/_basic/_basic_crawler.py

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

janbuchar · 2025-04-23T11:03:07Z

pyproject.toml

@@ -40,6 +40,7 @@ dependencies = [
    "eval-type-backport>=0.2.0",
    "httpx[brotli,http2,zstd]>=0.27.0",
    "more-itertools>=10.2.0",
+    "protego>=0.4.0",


It's fun to see another scrapy project here, but I guess that it guarantees some stability, so... all good.

Yes, I was planning to use RobotFileParser, but it doesn't support Google's specification. 😞

janbuchar · 2025-04-23T11:04:32Z

src/crawlee/_utils/robots.py

+        self._robots = robots
+        self._original_url = URL(url).origin()
+
+    @staticmethod


I'd prefer using @classmethod and the Self return type annotation

src/crawlee/crawlers/_basic/_basic_crawler.py

janbuchar · 2025-04-23T11:10:42Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+        robots_txt_file = await self._get_robots_txt_file_for_url(url)
+        return not robots_txt_file or robots_txt_file.is_allowed(url)
+
+    async def _get_robots_txt_file_for_url(self, url: str) -> RobotsTxtFile | None:


I believe we should use some synchronization mechanism so that we don't fetch the same robots.txt file multiple times in parallel.

Co-authored-by: Jan Buchar <Teyras@gmail.com>

vdusek

Nice! I have a few details... And also, could you please write a new guide/example regarding this feature?

vdusek · 2025-04-23T17:44:41Z

src/crawlee/_utils/robots.py

+
+from typing import TYPE_CHECKING
+
+from protego import Protego  # type: ignore[import-untyped]


Could we please rather update the project toml rather than using a type ignore?

src/crawlee/_utils/robots.py

vdusek · 2025-04-23T17:47:10Z

src/crawlee/_utils/robots.py

+        """Create a RobotsTxtFile instance from the given content.
+
+        Args:
+            url: the URL of the robots.txt file


Could you please use sentences in arg description? (applies to all occurencies)

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Mantisus added 9 commits April 17, 2025 15:43

basic_robots_allow

427b00a

add respect robots_txt_file

638b5be

update load

33be1c8

change RobotFileParser to Protego

a44dff1

add tests

538672e

fix

b9b35be

update tests

a49ab66

update TODO comments

46a2356

update docstrings

10077b6

Mantisus requested a review from Copilot April 17, 2025 23:41

Copilot AI reviewed Apr 17, 2025

View reviewed changes

src/crawlee/crawlers/_playwright/_playwright_crawler.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Outdated Show resolved Hide resolved

Apply suggestions from code review

b3e9789

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Mantisus self-assigned this Apr 18, 2025

Mantisus mentioned this pull request Apr 21, 2025

feat: add on_skipped_request decorator, to process links skipped according to robots.txt rules #1166

Draft

Mantisus marked this pull request as ready for review April 21, 2025 22:24

Mantisus requested review from janbuchar and vdusek April 21, 2025 22:24

janbuchar reviewed Apr 23, 2025

View reviewed changes

Mantisus and others added 3 commits April 23, 2025 14:40

Update src/crawlee/crawlers/_basic/_basic_crawler.py

4f4529e

Co-authored-by: Jan Buchar <Teyras@gmail.com>

fix docstrings

8973618

change staticmethod to classmethod

73a7bc6

vdusek requested changes Apr 23, 2025

View reviewed changes

Mantisus and others added 4 commits April 23, 2025 22:40

Update src/crawlee/_utils/robots.py

8039fb5

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

add _robots_txt_locks_cache

125804c

update pyproject.toml

4b7346b

update docstrings

e6099ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `respect_robots_txt_file` option #1162

feat: add `respect_robots_txt_file` option #1162

Mantisus commented Apr 17, 2025

Copilot AI left a comment

janbuchar Apr 23, 2025

Mantisus Apr 23, 2025

janbuchar Apr 23, 2025

janbuchar Apr 23, 2025

vdusek left a comment

vdusek Apr 23, 2025

vdusek Apr 23, 2025


		from typing import TYPE_CHECKING

		from protego import Protego # type: ignore[import-untyped]

feat: add respect_robots_txt_file option #1162

Are you sure you want to change the base?

feat: add respect_robots_txt_file option #1162

Conversation

Mantisus commented Apr 17, 2025

Description

Issues

Testing

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

janbuchar Apr 23, 2025

Choose a reason for hiding this comment

Mantisus Apr 23, 2025

Choose a reason for hiding this comment

janbuchar Apr 23, 2025

Choose a reason for hiding this comment

janbuchar Apr 23, 2025

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

vdusek Apr 23, 2025

Choose a reason for hiding this comment

vdusek Apr 23, 2025

Choose a reason for hiding this comment

feat: add `respect_robots_txt_file` option #1162

feat: add `respect_robots_txt_file` option #1162