Docs Scraper

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	pinned	license
Docs Scraper	📙	yellow	red	gradio	5.0.1	app.py	false	mit

Docs Scraper

This project offers a Python script designed to map and scrape all URLs from a specified website using the Firecrawl API. The scraped content is saved into markdown files, which can then be provided to AI systems to offer context. This is particularly beneficial for AI code editors that need to gather comprehensive information from various websites. By analyzing the scraped content, the AI can better understand the structure and details of the information, thereby enhancing its ability to provide accurate code suggestions and improvements.

Types of sites that would be useful to scrape include:

Documentation websites
API reference sites
Technical blogs
Tutorials and guides
Knowledge bases

The scraped content is saved into a markdown file named after the domain of the base URL, making it easy to reference and utilize.

Prerequisites

Python 3.11
Firecrawl API key
Virtual environment (recommended)

Setup

Clone the Repository

Clone this repository to your local machine:

git clone https://github.com/patrickacraig/docs-scraper.git
cd docs-scraper

Create a Virtual Environment

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

Install Dependencies

Install the required Python packages:
```
pip install -r requirements.txt
```

Set Up Environment Variables

Rename the .env.example file to .env and enter your own variables:

FIRECRAWL_API_KEY=your_actual_api_key  # Your unique API key for accessing the Firecrawl API. Obtain it from your Firecrawl account settings.

BASE_URL="https://docs.example.com/"  # The starting point URL of the website you want to scrape. Replace with your target URL.

LIMIT_RATE=True  # Set to "True" to enable rate limiting (10 scrapes per minute), or "False" to disable it.

Rate Limiting

The script is designed to adhere to a rate limit of 10 scrapes per minute in adherence with the Firecrawl API free tier. To disable it, set the LIMIT_RATE environment variable to False in your .env file.

Usage

Run the Script

Execute the script to start mapping and scraping the URLs:
```
python core.py
```
Output

The script will generate a markdown file named after the domain of the base URL (e.g., example.com.md) containing the scraped content.

Alternative Usage: Web UI

Run the Script

Alternatively, you can execute the script to run the web-based interface using Gradio:
```
python app.py
```
This will launch a web interface in your default browser where you can enter the base URL, your Firecrawl API key, and choose whether to enable rate limiting. The output will be displayed directly in the browser.

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request for any improvements or bug fixes.

Contact

For any questions or issues, please open an issue in the repository.

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
images		images
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
core.py		core.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docs Scraper

Prerequisites

Setup

Rate Limiting

Usage

Alternative Usage: Web UI

License

Contributing

Contact

About

Releases

Packages

Languages

patrickacraig/docs-scraper

Folders and files

Latest commit

History

Repository files navigation

Docs Scraper

Prerequisites

Setup

Rate Limiting

Usage

Alternative Usage: Web UI

License

Contributing

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages