Tutorial

The primary benefit of working with mirai is the ability to write asynchronous code much the same way you already write synchronous code. We’ll illustrate this by writing a simple web scraper, step-by-step, with and without mirai.

Fetching a Page

We begin with the most basic task for any web scraper – fetching a single web page. Rather than directly returning the page’s contents, we’ll return a Success container indicating that our request went through successfully. Similarly, we’ll return an Error container if the request failed.

Using a function urlget(), which returns a response if a request succeeds and raises an exception if a request fails, we can start with the following two fetch functions,

from commons import *


def fetch_sync(url):
  try:
    response = urlget(url)
    return Success(url, response)
  except Exception as e:
    return Error(url, e)


def fetch_async(url):
  return (
    Promise
    .call  (urlget, url)
    .map   (lambda response : Success(url, response))
    .handle(lambda error    : Error  (url, error   ))
  )

Retrying on Failure

Sometimes, an fetch failure is simply transient; that is to say, if we simply retry we may be able to fetch the page. Using recursion, let’s add an optional retries argument to our fetch functions,

from commons import *


def fetch_sync(url, retries=3):
  try:
    response = urlget(url)
    return Success(url, response)
  except Exception as e:
    if retries > 0:
      return fetch_sync(url, retries-1)
    else:
      return Error(url, e)


def fetch_async(url, retries=3):
  return (
    Promise
    .call     (urlget, url)
    .map      (lambda response : Success(url, response))
    .rescue   (lambda error    :
      fetch_async(url, retries-1)
      if   retries > 0
      else Promise.value(Error(url, error))
    )
  )

Handling Timeouts

Another common concern is time – if a page takes too long to fetch, we may rather consider it a loss rather than wait for it to finish downloading. Let’s construct a new container called Timeout that will indicate that a page took too long to retrieve. We’ll give fetch a finish_by argument specifying when, in time, we want the function to return by.

You’ll notice that rather than telling the fetch functions how much time they have, we give them a deadline by which they must finish. This is because relative durations become easily muddled in asynchronous code when functions are called with a delay.

from commons import *


def fetch_sync(url, finish_by, retries=3):
  remaining = finish_by - time.time()

  if remaining <= 0:
    return Timeout(url, None)

  try:
    response = urlget(url, finish_by)
    return Success(url, response)
  except Exception as e:
    if retries > 0:
      return fetch_sync(url, finish_by, retries-1)
    else:
      if isinstance(e, requests.exceptions.Timeout):
        return Timeout(url, e)
      else:
        return Error(url, e)


def fetch_async(url, finish_by, retries=3):
  remaining = finish_by - time.time()

  if remaining < 0:
    return Promise.value(Timeout(url, None))

  return (
    Promise
    .call     (urlget, url, finish_by)
    .map      (lambda response : Success(url, response))
    .rescue   (lambda error    :
      fetch_async(url, finish_by, retries-1)
      if   retries > 0
      else Promise.value(Timeout(url, error))
           if   isinstance(error, requests.exceptions.Timeout)
           else Promise.value(Error(url, error))
    )
  )

Wrapping Up

We now have a fully functional web scraper, capable of handling timeouts and retrying on failure. To try this scraper out for yourself, download the code in the [tutorial folder](https://github.com/duckworthd/mirai/tree/develop/docs/_tutorial) and see for yourself how mirai can make your life easier!