Skip to content

kairos-company/internal_link_auditor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Link Auditor

Automated internal link auditing tool for websites. Detects broken links, tracking parameters, empty filters, and other URL issues.

Features

  • Broken link detection - Identifies 404 errors and unreachable pages
  • Tracking parameter detection - Flags URLs with UTM, analytics, or custom tracking params
  • Empty filter detection - Finds URLs with empty filter values
  • Useless parameter cleanup - Detects URLs with unnecessary parameters
  • Sitemap support - Parse XML sitemaps (including sitemap indexes)
  • Multiple notification channels - Slack and Microsoft Teams webhooks
  • Configurable - Custom selectors, excluded domains, parameter patterns

Installation

# Clone the repository
git clone https://github.com/yourusername/link-auditor.git
cd link-auditor

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Copy environment file
cp .env.example .env

Configuration

Environment Variables

Edit .env to configure webhook URLs:

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
TEAMS_WEBHOOK_URL=https://outlook.office.com/webhook/YOUR/WEBHOOK/URL

Config File

Edit config/config.json to customize:

  • timeout - HTTP request timeout in seconds
  • content_selectors - CSS selectors for main content areas
  • excluded_selectors - CSS selectors to exclude (nav, footer, etc.)
  • excluded_domains - Domains to ignore (social media, etc.)
  • tracking_params - Parameters to flag as tracking
  • filter_params_prefixes - Prefixes for filter parameters

Usage

Audit a single URL

python main.py --url https://example.com/page

Audit multiple URLs

python main.py --urls https://example.com/page1 https://example.com/page2

Audit from a file

python main.py --urls-file urls.txt

Audit from sitemap

python main.py --sitemap https://example.com/sitemap.xml

Filter sitemap URLs

python main.py --sitemap https://example.com/sitemap.xml --sitemap-filter "/blog/" --sitemap-limit 100

Send report to Slack

python main.py --url https://example.com/page --slack

Send report to Microsoft Teams

python main.py --url https://example.com/page --teams

Save JSON output

python main.py --url https://example.com/page --output audit.json

Full example

python main.py --sitemap https://example.com/sitemap.xml \
    --sitemap-filter "/products/" \
    --sitemap-limit 50 \
    --slack --teams \
    --output audit_report.json \
    --log-level DEBUG \
    --log-file audit.log

CLI Options

Option Description
--url Single URL to analyze
--urls List of URLs to analyze
--urls-file Text file containing URLs (one per line)
--sitemap XML sitemap URL to parse
--sitemap-filter Regex pattern to filter sitemap URLs
--sitemap-limit Maximum number of URLs from sitemap
--config Path to JSON config file
--output JSON output file path
--slack Send report to Slack
--teams Send report to Microsoft Teams
--log-level Logging level (DEBUG, INFO, WARNING, ERROR)
--log-file Log file path

Issue Types

The auditor detects the following issue types:

  1. Tracking Parameters - Links containing tracking parameters (utm_*, fbclid, gclid, etc.)
  2. Broken Links - Links returning 404 or unreachable
  3. Empty Filters - Links with filter parameters but empty values
  4. Useless Parameters - Links with unnecessary or malformed parameters

Output Example

{
  "domain_name": "example.com",
  "success": true,
  "data": {
    "problematic_links": [
      {
        "source_page": "https://example.com/blog/article",
        "internal_link": "/products/item?utm_source=blog",
        "full_url": "https://example.com/products/item?utm_source=blog",
        "anchor_text": "Check out this product",
        "context": "In this article we discuss...",
        "issue_type": "Link with tracking parameters (clean URL recommended)",
        "http_status": null,
        "scan_date": "2024-01-15"
      }
    ],
    "stats": {
      "total_links_analyzed": 150,
      "internal_links_count": 150,
      "problematic_links_count": 3,
      "pages_analyzed": 10,
      "pages_success": 10,
      "pages_failed": 0,
      "issues_by_type": {
        "Link with tracking parameters (clean URL recommended)": 2,
        "Broken link (page not found)": 1
      },
      "duration_seconds": 45.23
    }
  },
  "metadata": {
    "source": "link_auditor",
    "version": "1.0.0",
    "timestamp": "2024-01-15T14:30:00"
  }
}

Project Structure

link-auditor/
|-- main.py              # CLI entry point
|-- config/
|   |-- config.json      # Default configuration
|-- core/
|   |-- __init__.py
|   |-- models.py        # Pydantic models
|   |-- scraper.py       # Link auditor logic
|   |-- sitemap_parser.py
|   |-- formatters/
|       |-- __init__.py
|       |-- json_formatter.py
|       |-- slack_formatter.py
|       |-- teams_formatter.py
|-- .env.example         # Environment template
|-- requirements.txt     # Python dependencies
|-- LICENSE              # MIT License
|-- README.md

Requirements

  • Python 3.8+
  • requests
  • beautifulsoup4
  • pydantic
  • tenacity
  • python-dotenv

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

About

Automated internal link auditing tool for websites

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages