Web Spider Version 4
Learn to use the queue to update our spider application.
We'll cover the following
Now that we have our generic queue to execute tasks in a limited parallel flow, let’s use it straightaway to refactor our web spider application.
We’re going to use an instance of TaskQueue
as a work backlog; every URL that we want to crawl needs to be appended to the queue as a task. The starting URL will be added as the first task, then every other URL discovered during the crawling process will be added as well. The queue will manage all the scheduling for us, making sure that the number of tasks in progress (that is, the number of pages being downloaded or read from the filesystem) at any given time is never greater than the concurrency limit configured for the given TaskQueue
instance.
Adding queue
as a new parameter
We’ve already defined the logic to crawl a given URL inside our spider()
function. We can consider this to be our generic crawling task. For more clarity, it’s best to rename this function to the spiderTask()
function.
Get hands-on with 1400+ tech skills courses.