Configure the Scraping Pipeline
Learn to configure the broadway pipeline and use the handle_message function.
Pipeline configuration
We’ll use the processors of Broadway
to refactor the logic that checks each website. For this, we have to define :processors
in start_link/1
, and use handle_message/3
:
def start_link(_args) do
options = [
name: ScrapingPipeline,
producer: [
module: {PageProducer, []},
transformer: {ScrapingPipeline, :transform, []}
],
processors: [
default: [max_demand: 1, concurrency: 2]
]
]
Broadway.start_link(__MODULE__, options)
end
def handle_message(_processor, message, _context) do
if Scraper.online?(message.data) do
# To do...
else
Broadway.Message.failed(message, "offline")
end
end
We can discard offline websites using Broadway.Message.failed/2
. Successful messages go to the next step, which would do the scraping work.
Define a batcher and batch processor
This is where batchers from Broadway
come in handy. To maintain our previous logic, we define a batcher with :batch_size
set to 1
, and two batch processors in the following way:
Get hands-on with 1300+ tech skills courses.