The crawl phase of a scan involves navigating around the application, following links, submitting forms, and logging in where necessary, to catalog the content of the application and the navigational paths within it. This seemingly simple task presents a variety of challenges that Burp's crawler is able to meet, to create an accurate map of the application.
By default, Burp's crawler navigates around a target application like a user with a browser, clicking links and submitting input where possible. It constructs a map of the application's content and functionality in the form of a directed graph, representing the different locations in the application and the links between those locations:
The crawler makes no assumptions about the URL structure used by the application. Locations are identified (and re-identified later) based on their contents, not the URL that was used to reach them. This enables the crawler to reliably handle modern applications that place ephemeral data, such as CSRF tokens or cache busters, into URLs. Even if the entire URL within each link changes on every occasion, the crawler still constructs an accurate map:
The approach also allows the crawler to handle applications that use the same URL to reach different locations based on the state of the application or the user's interaction with it:
As the crawler navigates around and builds up coverage of the target application, it tracks the edges in the graph that have not been completed. These represent the links (or other navigational transitions) that have been observed within the application but not yet visited. But the crawler never "jumps" to a pending link and visits it out of context. Instead, it either navigates directly from its current location, or reverts to the start location and navigates from there. This replicates as closely as possible the actions of a normal user browsing the website:
Crawling in a way that makes no assumptions about URL structure is highly effective in dealing with modern web applications, but can potentially lead to problems in seeing "too much" content. Modern web sites often contain a mass of superfluous navigational paths (via page footers, burger menus, etc.), meaning that everything is directly linked to everything else. Burp's crawler employs a variety of techniques to address this issue: it builds up fingerprints of links to already visited locations to avoid visiting them redundantly; it crawls in a breadth-first order that prioritizes discovery of new content; and it has configurable cutoffs that constrain the extent of the crawl. These measures also help to deal correctly with "infinite" applications, such as calendars.
As Burp's crawler navigates around a target application like a user, it is able to automatically deal with practically any session-handling mechanism that modern browsers can. There is no need to record macros or configure session-handling rules telling Burp how to obtain a session or verify that the current session is valid.
The crawler employs multiple crawler "agents" to parallelize its work. Each agent represents a distinct user of the application navigating around with their own browser. Each agent has its own cookie jar, which is updated when the application issues it with a cookie. When an agent returns to the start location to begin crawling from there, its cookie jar is cleared, to simulate a completely fresh browser session.
The requests that the crawler makes as it navigates around are constructed dynamically based on the preceding response, so CSRF tokens in URLs or form fields are handled automatically. This allows the crawler to correctly navigate functions that use complex session-handling, with zero configuration by the user:
Modern web applications are heavily stateful, and it is common for the same application function to return different content on different occasions, as a result of actions that were performed by the user in the meantime. Burp's crawler is able to detect changes in application state that result from actions that it has performed during crawling.
In the example below, navigating the path
BC causes the application to transition from state 1 to state 2. Link D goes to a logically different location in state 1 versus state 2. So the path
AD goes to the empty cart, while
ABCD goes to the populated cart. Rather than just concluding that link D is non-deterministic, the crawler is able to identify the state-changing path that link D depends on. This allows the crawler to reliably reach the populated cart location in future, to access the other functions that are available from there:
Burp's crawler begins with an unauthenticated phase in which no credentials are submitted. When this is complete, Burp will have discovered any login and self-registration functions within the application.
If the application supports self-registration, Burp will attempt to register a user. You can also configure the crawler to use one or more pre-existing logins.
The crawler then proceeds to an authenticated phase. It will visit the login function multiple times and submit:
For each set of credentials submitted to the login, Burp will then crawl the content that is discovered behind the login. This allows the crawler to capture the different functions that are available to different types of user:
Modern web applications frequently contain volatile content, where the "same" location or function will return responses that differ substantially on different occasions, not necessarily as the result of any action by the user. This behavior can result from factors, such as feeds from social media channels or user comments, inline advertising, or genuinely randomized content (message of the day, A/B testing, etc.).
Burp's crawler is able to identify many instances of volatile content, and correctly re-identify the same location on different visits, despite the differing responses. This allows the crawler to focus attention on the "core" elements within a set of application responses, which is likely to be the most important in terms of discovering the key navigational paths to interesting application content and functionality:
In some cases, visiting a given link on different occasions will return responses that just differ too much to be treated as the "same". In this situation, Burp's crawler will capture both versions of the response as two different locations, and will plot a non-deterministic edge in the graph. Provided the extent of non-determinism across the application is not too great, Burp can still crawl the associated content, and reliably find its way to content that is behind the non-deterministic link: