Mastering Website Crawling: An In-Depth Look at the Spider Scanning Tool

In the realm of website analysis and security, the Spider Scanning Tool stands as a foundational instrument. Imagine a digital spider meticulously traversing the web, not to spin webs, but to weave a comprehensive map of a website’s resources. This is precisely the function of a spider tool, also known as a web crawler or web spider. Starting with a seed list of URLs, this automated tool embarks on a journey to discover and catalog every nook and cranny of a website, following hyperlinks and uncovering valuable information.

The effectiveness of a spider scanning tool lies in its ability to automatically discover new resources (URLs) within a website. The process begins with initial URLs, the “seeds,” which guide the spider’s starting point. From these seeds, the spider systematically visits each URL, meticulously parsing the page’s content to identify all hyperlinks embedded within. These newly discovered hyperlinks are then added to the queue of URLs to be visited, creating a recursive process that continues as long as new resources are unearthed.

To initiate and customize your web crawling expedition, the spider scanning tool typically offers a configuration dialogue, allowing users to tailor the scan to their specific needs.

How the Spider Scanning Tool Processes Different Response Types

During its exploration, the spider scanning tool encounters various types of server responses. To effectively map a website, it’s crucial to understand how the tool handles each response type:

HTML Processing

HTML, the backbone of most web pages, is meticulously analyzed by the spider scanning tool. It dissects specific HTML tags to extract links pointing to new resources. This process includes:

Base Tag Handling: Ensuring correct resolution of relative URLs using the <base> tag.
Link Extraction from Common Tags: Identifying URLs within the href attribute of tags like <a>, <link>, <area>, and <base>.
Source Attribute Analysis: Extracting URLs from the src attribute of multimedia and embedding tags such as <applet>, <audio>, <embed>, <iframe>, <input>, <script>, <img>, and <video>.
Citation URLs: Recognizing URLs in the cite attribute of the <blockquote> tag.
Meta Tag Directives: Processing <meta> tags for redirects (http-equiv for ‘location’, ‘refresh’) and security policies (Content-Security-Policy), as well as application configurations (name for ‘msapplication-config’).
Applet Attributes: Analyzing codebase and archive attributes in <applet> tags.
Image Attributes: Parsing longdesc, lowsrc, dynsrc, and srcset attributes within <img> tags.
Form Action URLs: Handling <form> tags, including both GET and POST methods, and intelligently generating valid field values, encompassing HTML 5.0 input types. It also respects attributes like form, formaction, and formmethod associated with buttons.
Comment Analysis (Optional): If configured, the spider can even delve into HTML comments to find valid tags, as specified in the tool’s options.
Import Directives: Processing the implementation attribute in <import> tags.
Inline String Parsing: Scanning inline text within tags like <p>, <title>, <li>, <h1> to <h6>, and <blockquote> for potential URLs.

Robots.txt File Analysis

The robots.txt file, often found at the root of a website, provides instructions to web crawlers about which parts of the site should not be accessed. A sophisticated spider scanning tool can optionally analyze this file to identify potential resources. It’s important to note that while a spider can analyze robots.txt, it might not always obey the directives, especially if configured for comprehensive site mapping or security auditing where bypassing restrictions might be necessary to identify hidden or unprotected areas.

Sitemap.xml File Processing

sitemap.xml files are designed to help search engines understand the structure of a website. A spider scanning tool can leverage these sitemaps, if configured, to efficiently discover website resources. By parsing the XML structure, the tool can quickly identify and add URLs listed in the sitemap to its crawling queue, potentially speeding up the discovery process and ensuring comprehensive coverage.

Metadata File Analysis (SVN, Git, .DS_Store)

For development and version control contexts, spider scanning tools can be configured to parse metadata files from SVN (.svn), Git (.git), and even macOS .DS_Store files. These files, if exposed on a web server, can inadvertently reveal sensitive information about the website’s structure, internal organization, and potentially even security vulnerabilities. Analyzing these files can be a valuable aspect of a comprehensive security assessment.

OData Atom Format Support

Websites utilizing the OData protocol with Atom format for data exchange are also within the scope of a capable spider scanning tool. The tool can parse OData Atom feeds, extracting both relative and absolute links to further explore the data exposed through the OData service.

SVG File Analysis

Scalable Vector Graphics (SVG) files, commonly used for images and icons on the web, can also contain hyperlinks. A spider scanning tool can parse SVG files to identify HREF attributes within SVG elements, extracting and resolving any links embedded in these vector graphics.

Non-HTML Text Response Handling

When encountering text responses that are not HTML, a basic spider scanning tool can still be configured to scan for URL patterns within the text content. This allows for the discovery of resources even in plain text files or responses that are not structured as HTML documents.

Non-Text Response Limitation

Currently, many spider scanning tools are primarily designed to process text-based responses and may not be equipped to handle non-textual resources directly. This means binary files, images (unless links are extracted from HTML or SVG), or other non-textual content might not be processed for further link discovery by the spider itself, although they would be noted as resources of the website.

Further Considerations for Effective Spider Scanning

URL Handling Configuration: The behavior of the spider scanning tool when checking if a URL has already been visited is configurable. Options typically include how URL parameters are treated – whether to ignore them, treat them as distinct URLs, or use specific rules. This is crucial for managing session IDs and tracking parameters.
Parameter Ignoring: To optimize crawling efficiency and avoid redundant visits, spider scanning tools often ignore common session-related or tracking parameters like jsessionid, phpsessid, aspsessionid, and utm_* when determining if a URL has been previously processed.
Cookie Handling: The way a spider scanning tool manages cookies depends on its configuration and how it’s initiated. Options usually range from completely ignoring cookies to storing and sending them like a regular web browser. Proper cookie handling is essential for crawling websites that rely on sessions or authentication.

The configuration and behavior of the spider scanning tool are typically managed through an options screen, allowing users to fine-tune the crawling process to their specific requirements.

Official Videos

ZAP In Ten: Explore Your Applications (10:36)
ZAP Deep Dive: Exploring Applications: Standard Spider (34:35)