RUMEs

One of epymorph’s main features is the ability to run a model starting from some initial state and proceeding forward through time. This is known as “forward simulation”. The configuration of our model is set before we begin and remains fixed throughout the simulation. (This is in contrast with other features like parameter fitting, where algorithms dynamically adjust the model as it proceeds through time in order to fit model configuration to known data.)

What’s In a RUME?

We already saw a simple forward simulation in Getting Started; let’s revisit that in more detail.

from epymorph.kit import *
from epymorph.adrio import us_tiger


rume = SingleStrataRUME.build(
    ipm=ipm.SIRS(),
    mm=mm.Centroids(),
    scope=StateScope.in_states(["AZ", "CO", "NM", "UT"], year=2020),
    time_frame=TimeFrame.rangex("2020-01-01", "2021-01-01"),
    init=init.SingleLocation(location=0, seed_size=100),
    params={
        # intentionally left empty for now
    },
)

A Runnable Modeling Experiment (RUME) is epymorph’s basic building block. You can think of it as the complete specification for your model, broken up into modular components each with their own specialization.

Single-strata vs. Multi-strata RUMEs

A single-strata RUME is the simplest kind of RUME, as opposed to a multi-strata RUME. Strata are used to subdivide populations into distinct groups, and model differences and interactions between the groups. A stratified model could represent diseases that affect young patients differently from old patients, smokers from non-smokers, or even multi-species systems like those involving humans, birds, and mosquitos. To keep things simple we’ll stick with single-strata models for the time being.

The Intra-population Model (IPM) is our model of disease progression. It determines what stages the disease progresses through and at what rates. If you’re familiar with the concept of compartmental models from epidemiology, the IPM encodes a compartmental model.

The Movement Model (MM) controls how individuals move among the locations present in our model geography. If we’re modeling US States, for instance, we need to know how many people should commute from California to Oregon and from Oregon to California on a given time step. These movement dynamics give rise to the spread of pathogens over distance, as infectious individuals interact with susceptible individuals in other populations. A Movement Model encodes a function (or several) describing these movement vectors. The MM can also break up our simulation day into fractions called tau steps. This allows it to model intra-day movement patterns like commuting back and forth to work. Each fractional day is called a simulation step, tau step, or sometimes tick for short (like the ticking of a clock). Multiply the number of simulation days by the number of tau steps per day and you get the total number of time steps that will be simulated.

The GeoScope determines exactly which locations we would like to model. Each location (or geo node) is modeled as a well-mixed population containing a certain number of individuals. In their most basic form a GeoScope is nothing more than a list of unique identifiers, one for each node. To make it convenient to model real world locations, however, we provide GeoScope implementations that model real world geography. (At the moment, we have implemented US Census Bureau delineations – States, Counties, Tracts, and Block Groups.) This example uses a StateScope to select the states Arizona, Colorado, New Mexico, and Utah. Four states, four locations, four nodes. This geo scope implies a geographic area and a sense of granularity, but know that this is only “true” for epymorph if all of the data obeys that implication.

The TimeFrame describes how many days should be simulated and where those fall in the calendar. Calendar dates by themselves don’t directly influence the disease simulation, but may influence the workings of the movement model or the data which in turn drives the simulation (especially if the data is being fetched automatically from a time-series source, like weather data). If you don’t have a particular calendar date in mind – say you want to model something in the abstract – you can usually safely make one up.

The Initializer is a function that sets the starting conditions of our simulation. The logic involved can be entirely arbitrary, but we have some basic implementations in our initializer library for common use-cases. This SingleLocation initializer does two things: 1. set the initial susceptible population of each location according to a population data value (we’ll talk more about data requirements later), and then 2. convert a fixed number of people in a single place from susceptible to infected. I’ve selected location index 0 (Arizona) and 100 individuals, so Arizona will start out with 100 Infectious and the rest of its population Susceptible. When the initializer is done with its job, we know for all locations and for all of our IPM compartments exactly how many individuals are in each.

Constructing a RUME therefore is a matter of making component selections and wrapping them all up into a single configuration. This modular design gives rise to one of epymorph’s super powers – rapid experimentation through variation. In other words, it’s super easy to take a RUME and swap its IPM while leaving all of the other components as they were. You could try a dozen different IPMs modeling subtly different hypotheses of disease transmission and evaluate which one produces better results. Same goes for the movement model. Or different geographies. We believe this modularity will power innovation in epidemiology.

epymorph includes a library of built-in IPMs, MMs, and Initializers for you to get started with, but you can also create custom implementations to fit your use-case.

RUME Parameters

We’ve brushed past one last important detail. Various epymorph components will require externally provided data attributes – a.k.a., parameters – in order to work.

The IPM we chose is an SIRS model, short for Susceptible, Infectious, Recovered. The final “S” indicates that Recovered individuals can eventually become Susceptible again, perhaps because of waning immunity. The rate at which individuals transition from the Susceptible to the Infectious state is described (in this IPM) by the equation:

\[\Delta_{S \to I} = \frac{\beta S I}{N}\]

\(S\) is the number of Susceptible individuals, \(I\) is the number of Infectious individuals, \(N\) is the total number of individuals. \(\beta\) on the other hand is commonly called the infectivity parameter. It describes (generally) how infectious the disease is. When a Susceptible individual comes into close contact with an Infectious individual, a large \(\beta\)-value suggests there is a high likelihood of transmission. A low \(\beta\)-value suggests transmission is less likely. For this model, “beta” is one of the parameters we need to provide. It wouldn’t be very reusable if the IPM encoded a predetermined value for beta. Different diseases, even different strains, are likely to have unique beta values. And beta may not even be constant – it could vary over time, or from place to place, or both. There are many ways to provide a value, but the first challenge is knowing exactly which values we must provide. If we don’t know, we can ask the RUME.

print(rume.params_description())
ipm::beta (type: float, shape: TxN)
    infectivity

ipm::gamma (type: float, shape: TxN)
    progression from infected to recovered

ipm::xi (type: float, shape: TxN)
    progression from recovered to susceptible

mm::population (type: int, shape: N)
    The total population at each node.

mm::centroid (type: [(longitude, float), (latitude, float)], shape: N)
    The centroids for each node as (longitude, latitude) tuples.

mm::phi (type: float, shape: S, default: 40.0)
    Influences the distance that movers tend to travel.

mm::commuter_proportion (type: float, shape: S, default: 0.1)
    The proportion of the total population which commutes.

init::population (type: int, shape: N)
    The population at each geo node.

The RUME’s params_description method lists all of the data attributes required by our assembled RUME. Notice “ipm::beta” is there right at the top. This tells us that our IPM is asking for a value called beta, it accepts floating point numbers as values, and requires a shape described as “TxN”. I mentioned beta could vary over time and from place to place, and that’s what TxN indicates – time-by-node – beta can be a two-dimensional array of values where the first axis is time (the number of simulation days) and the second axis is geo node (the number of simulated locations). The term shape is borrowed from numpy’s terminology and refers to the dimensions of possibly-multi-dimensional arrays of data. epymorph tries to be flexible with inputs so although this is the maximally-allowed shape, you can also provide compatible shapes which are “smaller”, dimensionally speaking, than TxN. The simplest value we can provide here is a scalar value, a single number, which would then be used for all days and all places for the whole simulation.

Running quickly through the rest of the attributes here, there’s “gamma” and “xi” which are needed by the IPM similar to “beta”.

Then there’s the movement model parameters. “population” is the number of individuals at each node so it takes integer values of shape N. “centroid” is a longitude/latitude coordinate marking the effective center of each location; the movement model uses this to determine the distances between locations. Its type expression is a bit more complicated: it wants tuples of floats where the first value in the tuple is longitude and the second is latitude. It also needs N values. “phi” and “commuter_proportion” are parameters that modulate the movement activity in this model, like a pair of dials you can turn up or down. Notice they are shape S (short for scalar), meaning this value cannot vary in time or location. They also both have default values. If an attribute has a default value you don’t have to override it.

Lastly, there is an initialization parameter, also called “population”. This is interesting because both our movement model and our initializer want to know the starting population of all of our geo nodes. Typically, especially with something like population, we would provide one value that satisifes both requirements. This works as long as the type/shape requirements are compatible. But if we really needed to we could provide separate values. epymorph is flexible, you get to decide how to use it.

Providing Parameter Values

If I try to use this RUME in a simulation now, I’m going to get an error:

try:
    BasicSimulator(rume).run()
except Exception as e:
    print(e)
RUME attribute requirements were not met. See errors:
- Cannot evaluate parameters, there are missing values:
gpm:all::ipm::beta
gpm:all::ipm::gamma
gpm:all::ipm::xi
gpm:all::mm::population
gpm:all::mm::centroid
gpm:all::init::population

I didn’t provide any values! This RUME is partially specified. It’s valid to construct such a RUME, but if we want to use it to run a simulation we’re going to have to fill in the blanks. One way to do that is to re-create our RUME, now that we know more about our required parameters.

rume_full = SingleStrataRUME.build(
    ipm=ipm.SIRS(),
    mm=mm.Centroids(),
    init=init.SingleLocation(location=0, seed_size=100),
    scope=StateScope.in_states(["AZ", "CO", "NM", "UT"], year=2020),
    time_frame=TimeFrame.rangex("2020-01-01", "2021-01-01"),
    params={
        "beta": 0.3,
        "gamma": 1/5,
        "xi": 1/90,
        "population": [
            7_151_502, # Arizona
            5_773_714, # Colorado
            2_117_522, # New Mexico
            3_271_616, # Utah
        ],
        "centroid": us_tiger.InternalPoint(),
    },
)

I’ve chosen to provide scalar values for beta, gamma, and xi.

For population I entered an N-valued list; my scope has four nodes in it, so I gave four values. Furthermore, I have to ensure that the ordering of values matches the ordering of nodes in the geo scope. That can be a bit tricky, but you can always print the node IDs in your scope if you’re not sure.

print(rume_full.scope.node_ids)
['04' '08' '35' '49']

Note that the US-Census-based scopes use FIPS codes (a.k.a., GEOID) even if you construct the scope using names or postal codes as we’ve done in this example. “04” is the FIPS code for Arizona, and so on, so I can check that my population values are in the right order.

I did something different for “centroid” though. Instead of providing hard-coded values, I’ve passed in an object that contains logic for fetching data from an outside source. These dynamic data fetching classes are called ADRIOs, short for Abstract Dynamic Resource Interface Object. This ADRIO fetches a so-called “internal point”: a Census-determined coordinate which is inside each US State and is approximately in the center. It may not be the same as the geometric center, but it’s close enough for our purposes. The advantage of using ADRIOs is that they are smart enough to look at the context of our RUME and use that to produce appropriate values. This one looks at the RUME scope and figures out that it needs to retrieve four internal points for the four states I’ve requested, and ensures they are returned in the correct order. If I changed the geo scope by adding Nevada and reconstructed the RUME, the InternalPoint ADRIO would adapt automatically to fetch five centroids instead. epymorph provides a library of useful ADRIOs. In fact, I could load population numbers from the US Census data by replacing my list with the Population ADRIO that interfaces with ACS5 data.

If a RUME uses ADRIOs for some of its parameter values, the actual data fetching will happen when we use it to run a simulation. Of course that means we have to be connected to the internet at the time, and those data sources have to be available. We may also need API keys or other credentials. In addition, fetching data can take quite a while if there’s a lot to fetch. Some ADRIOs will attempt to cache data to your machine to speed up repeat usage. With the estimate_data method you can ask a RUME to estimate the amount of data it will need to fetch, report whether or not it will be sourced from your machine cache, and approximate the amount of time it will take to download.

rume_full.estimate_data()
ADRIO data usage estimation:
- epymorph.adrio.us_tiger.InternalPoint will be pulled from cache
In total we will:
- Download no additional data
- Write no new data to disk cache

It’s worth noting that these are very rough estimates, and are not available for all ADRIOs at this time.