Tutorial¶

The primary benefit of working with mirai is the ability to write asynchronous code much the same way you already write synchronous code. We’ll illustrate this by writing a simple web scraper, step-by-step, with and without mirai.

Fetching a Page¶

We begin with the most basic task for any web scraper – fetching a single web page. Rather than directly returning the page’s contents, we’ll return a Success container indicating that our request went through successfully. Similarly, we’ll return an Error container if the request failed.

Using a function urlget(), which returns a response if a request succeeds and raises an exception if a request fails, we can start with the following two fetch functions,

from commons import *


def fetch_sync(url):
  try:
    response = urlget(url)
    return Success(url, response)
  except Exception as e:
    return Error(url, e)


def fetch_async(url):
  return (
    Promise
    .call  (urlget, url)
    .map   (lambda response : Success(url, response))
    .handle(lambda error    : Error  (url, error   ))
  )

Retrying on Failure¶

Sometimes, an fetch failure is simply transient; that is to say, if we simply retry we may be able to fetch the page. Using recursion, let’s add an optional retries argument to our fetch functions,

from commons import *


def fetch_sync(url, retries=3):
  try:
    response = urlget(url)
    return Success(url, response)
  except Exception as e:
    if retries > 0:
      return fetch_sync(url, retries-1)
    else:
      return Error(url, e)


def fetch_async(url, retries=3):
  return (
    Promise
    .call     (urlget, url)
    .map      (lambda response : Success(url, response))
    .rescue   (lambda error    :
      fetch_async(url, retries-1)
      if   retries > 0
      else Promise.value(Error(url, error))
    )
  )

Handling Timeouts¶

Another common concern is time – if a page takes too long to fetch, we may rather consider it a loss rather than wait for it to finish downloading. Let’s construct a new container called Timeout that will indicate that a page took too long to retrieve. We’ll give fetch a finish_by argument specifying when, in time, we want the function to return by.

You’ll notice that rather than telling the fetch functions how much time they have, we give them a deadline by which they must finish. This is because relative durations become easily muddled in asynchronous code when functions are called with a delay.

from commons import *


def fetch_sync(url, finish_by, retries=3):
  remaining = finish_by - time.time()

  if remaining <= 0:
    return Timeout(url, None)

  try:
    response = urlget(url, finish_by)
    return Success(url, response)
  except Exception as e:
    if retries > 0:
      return fetch_sync(url, finish_by, retries-1)
    else:
      if isinstance(e, requests.exceptions.Timeout):
        return Timeout(url, e)
      else:
        return Error(url, e)


def fetch_async(url, finish_by, retries=3):
  remaining = finish_by - time.time()

  if remaining < 0:
    return Promise.value(Timeout(url, None))

  return (
    Promise
    .call     (urlget, url, finish_by)
    .map      (lambda response : Success(url, response))
    .rescue   (lambda error    :
      fetch_async(url, finish_by, retries-1)
      if   retries > 0
      else Promise.value(Timeout(url, error))
           if   isinstance(error, requests.exceptions.Timeout)
           else Promise.value(Error(url, error))
    )
  )

Scraping Links¶

Finally, let’s complete our scraper by following each page’s links. To keep our code from running forever, we’ll only follow links up to a fixed maximum depth. Moreover, we’ll add a finish_by to limit the amount of time until the function returns.

This is where mirai‘s asynchronous nature really shines. While the synchronous version must fetch each page one at a time, mirai makes it easy to fetch pages in parallel with minimal change to the source,

from commons import *
from tutorial04 import fetch_sync, fetch_async


def scrape_sync(url, finish_by, retries=3, maxdepth=0):
  remaining = finish_by - time.time()
  if   remaining <= 0:
    return [Timeout(url, None)]
  elif maxdepth  == 0:
    return [fetch_sync(url, finish_by, retries)]
  elif maxdepth   < 0:
    return []
  else:
    status   = fetch_sync(url, finish_by, retries)
    if isinstance(status, Success):
      linkset  = links(url, status.response.text)
      children = [
        scrape_sync(link, finish_by, retries, maxdepth-1)
        for link in linkset
      ]
      return fu.cat([[status]] + children)
    else:
      return [status]


def scrape_async(url, finish_by, retries=3, maxdepth=0):
  remaining = finish_by - time.time()
  if   remaining <= 0:
    return Promise.value([Timeout(url, None)])
  elif maxdepth  == 0:
    return (
      fetch_async(url, finish_by, retries)
      .map(lambda status: [status])
    )
  elif maxdepth   < 0:
    return Promise.value([])
  else:
    status  = fetch_async(url, finish_by, retries)

    children = (
      status
      .map(lambda status: \
        links(url, status.response.text)
        if   isinstance(status, Success)
        else []
      )
      .map(lambda linkset: [
        scrape_async(link, finish_by, retries, maxdepth-1)
        for link in linkset
      ])
      .flatmap(Promise.collect)
    )

    return (
      status.join_(children)
      .map(lambda (status, children): [[status]] + children)
      .map(fu.cat)
    )

Wrapping Up¶

We now have a fully functional web scraper, capable of handling timeouts and retrying on failure. To try this scraper out for yourself, download the code in the [tutorial folder](https://github.com/duckworthd/mirai/tree/develop/docs/_tutorial) and see for yourself how mirai can make your life easier!