Skip to main content
Version: 1.7

RequestQueue

Represents a queue of URLs to crawl.

Can be used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.

Each URL is represented using an instance of the Request class. The queue can only contain unique URLs. More precisely, it can only contain request dictionaries with distinct uniqueKey properties. By default, uniqueKey is generated from the URL, but it can also be overridden. To add a single URL multiple times to the queue, corresponding request dictionary will need to have different uniqueKey properties.

Do not instantiate this class directly, use the Actor.open_request_queue() function instead.

RequestQueue stores its data either on local disk or in the Apify cloud, depending on whether the APIFY_LOCAL_STORAGE_DIR or APIFY_TOKEN environment variables are set.

If the APIFY_LOCAL_STORAGE_DIR environment variable is set, the data is stored in the local directory in the following files:

{APIFY_LOCAL_STORAGE_DIR}/request_queues/{QUEUE_ID}/{REQUEST_ID}.json

Note that {QUEUE_ID} is the name or ID of the request queue. The default request queue has ID: default, unless you override it by setting the APIFY_DEFAULT_REQUEST_QUEUE_ID environment variable. The {REQUEST_ID} is the id of the request.

If the APIFY_TOKEN environment variable is set but APIFY_LOCAL_STORAGE_DIR is not, the data is stored in the Apify Request Queue cloud storage.

Index

Methods

add_request

  • async add_request(request, *, forefront, keep_url_fragment, use_extended_unique_key): dict
  • Adds a request to the RequestQueue while managing deduplication and positioning within the queue.

    The deduplication of requests relies on the uniqueKey field within the request dictionary. If uniqueKey exists, it remains unchanged; if it does not, it is generated based on the request's url, method, and payload fields. The generation of uniqueKey can be influenced by the keep_url_fragment and use_extended_unique_key flags, which dictate whether to include the URL fragment and the request's method and payload, respectively, in its computation.

    The request can be added to the forefront (beginning) or the back of the queue based on the forefront parameter. Information about the request's addition to the queue, including whether it was already present or handled, is returned in an output dictionary.

    Returns: A dictionary containing information about the operation, including:

    • requestId (str): The ID of the request.
    • uniqueKey (str): The unique key associated with the request.
    • wasAlreadyPresent (bool): Indicates whether the request was already in the queue.
    • wasAlreadyHandled (bool): Indicates whether the request was already processed.

    Parameters

    • request: dict

      The request object to be added to the queue. Must include at least the url key. Optionaly it can include the method, payload and uniqueKey keys.

    • optionalkeyword-onlyforefront: bool = False

      If True, adds the request to the forefront of the queue; otherwise, adds it to the end.

    • optionalkeyword-onlykeep_url_fragment: bool = False

      Determines whether the URL fragment (the part of the URL after '#') should be retained in the unique key computation.

    • optionalkeyword-onlyuse_extended_unique_key: bool = False

      Determines whether to use an extended unique key, incorporating the request's method and payload into the unique key computation.

    Returns dict

drop

  • async drop(): None
  • Remove the request queue either from the Apify cloud storage or from the local directory.


    Returns None

fetch_next_request

  • async fetch_next_request(): dict | None
  • Return the next request in the queue to be processed.

    Once you successfully finish processing of the request, you need to call RequestQueue.mark_request_as_handled to mark the request as handled in the queue. If there was some error in processing the request, call RequestQueue.reclaim_request instead, so that the queue will give the request to some other consumer in another call to the fetch_next_request method.

    Note that the None return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use RequestQueue.is_finished instead.


    Returns dict | None

    dict, optional: The request or None if there are no more pending requests.

get_info

  • async get_info(): dict | None
  • Get an object containing general information about the request queue.


    Returns dict | None

    dict: Object returned by calling the GET request queue API endpoint.

get_request

  • async get_request(request_id): dict | None
  • Retrieve a request from the queue.


    Parameters

    • request_id: str

      ID of the request to retrieve.

    Returns dict | None

    dict, optional: The retrieved request, or None, if it does not exist.

is_empty

  • async is_empty(): bool
  • Check whether the queue is empty.


    Returns bool

    bool: True if the next call to RequestQueue.fetchNextRequest would return None, otherwise False.

is_finished

  • async is_finished(): bool
  • Check whether the queue is finished.

    Due to the nature of distributed storage used by the queue, the function might occasionally return a false negative, but it will never return a false positive.


    Returns bool

    bool: True if all requests were already handled and there are no more left. False otherwise.

mark_request_as_handled

  • async mark_request_as_handled(request): dict | None
  • Mark a request as handled after successful processing.

    Handled requests will never again be returned by the RequestQueue.fetch_next_request method.


    Parameters

    • request: dict

      The request to mark as handled.

    Returns dict | None

    dict, optional: Information about the queue operation with keys requestId, uniqueKey, wasAlreadyPresent, wasAlreadyHandled. None if the given request was not in progress.

open

  • Open a request queue.

    Request queue represents a queue of URLs to crawl, which is stored either on local filesystem or in the Apify cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders.


    Parameters

    • optionalkeyword-onlyid: str | None = None

      ID of the request queue to be opened. If neither id nor name are provided, the method returns the default request queue associated with the actor run. If the request queue with the given ID does not exist, it raises an error.

    • optionalkeyword-onlyname: str | None = None

      Name of the request queue to be opened. If neither id nor name are provided, the method returns the default request queue associated with the actor run. If the request queue with the given name does not exist, it is created.

    • optionalkeyword-onlyforce_cloud: bool = False

      If set to True, it will open a request queue on the Apify Platform even when running the actor locally. Defaults to False.

    • optionalkeyword-onlyconfig: Configuration | None = None

      A Configuration instance, uses global configuration if omitted.

    Returns RequestQueue

    RequestQueue: An instance of the RequestQueue class for the given ID or name.

reclaim_request

  • async reclaim_request(request, forefront): dict | None
  • Reclaim a failed request back to the queue.

    The request will be returned for processing later again by another call to RequestQueue.fetchNextRequest.


    Parameters

    • request: dict

      The request to return to the queue.

    • optionalforefront: bool = False

      Whether to add the request to the head or the end of the queue

    Returns dict | None

    dict, optional: Information about the queue operation with keys requestId, uniqueKey, wasAlreadyPresent, wasAlreadyHandled. None if the given request was not in progress.