Geography and Data

As mentioned, geo scopes themselves don’t contain much data about the geography. However many types of data we are interested in are spatial in nature. How many people live and work in Kansas? How much rain fell in Grant County, Indiana every day last year? How many people commute to work from San Francisco County, California to New York County, New York and vice versa? These data are potentially useful for epidemiological modeling, and all have a geographic component. We must understand how epymorph makes use of geographic data in order to correctly provide such data for our simulation workflows.

RUMEs and GeoScopes

In previous chapters we assembled RUMEs and one of the things we included with a RUME was a GeoScope instance. The geo scope is a critical part of our “simulation context” – the context in which a simulation takes place. Many systems in epymorph refer to the geo scope in order to do their job. One place that shows up is in a RUME’s data attribute requirements.

Let’s revisit our example from the Getting Started chapter and print its requirements:

from epymorph.kit import *
from epymorph.adrio import us_tiger


rume = SingleStrataRUME.build(
    ipm=ipm.SIRS(),
    mm=mm.Centroids(),
    init=init.SingleLocation(location=0, seed_size=100),
    scope=StateScope.in_states(["AZ", "CO", "NM", "UT"], year=2020),
    time_frame=TimeFrame.rangex("2020-01-01", "2021-01-01"),
    params={
        "beta": 0.3,
        "gamma": 1/5,
        "xi": 1/90,
        "population": [
            7_174_064, # Arizona
            5_684_926, # Colorado
            2_097_021, # New Mexico
            3_151_239, # Utah
        ],
        "centroid": us_tiger.InternalPoint(),
    },
)

print(rume.params_description())
ipm::beta (type: float, shape: TxN)
    infectivity

ipm::gamma (type: float, shape: TxN)
    progression from infected to recovered

ipm::xi (type: float, shape: TxN)
    progression from recovered to susceptible

mm::population (type: int, shape: N)
    The total population at each node.

mm::centroid (type: [(longitude, float), (latitude, float)], shape: N)
    The centroids for each node as (longitude, latitude) tuples.

mm::phi (type: float, shape: S, default: 40.0)
    Influences the distance that movers tend to travel.

mm::commuter_proportion (type: float, shape: S, default: 0.1)
    The proportion of the total population which commutes.

init::population (type: int, shape: N)
    The population at each geo node.

Focus on the “population” requirement. Both our movement model and our initializer want to know how many people live in the places included in our geo scope. The shape of the requirement is “N”, which is short-hand for the number of nodes in our scope. In other words: when we provide a value for population, we need to provide an array containing one number per node. In this case our geo scope has declared we have four nodes: the states of Arizona, Colorado, New Mexico, and Utah. Provide more or fewer values and you’ll get an error when you attempt to run it:

bad_rume = SingleStrataRUME.build(
    # ...
    params={
        "population": [1, 2, 3, 4, 5],  # too many values
        # ...
    },
)

BasicSimulator(bad_rume).run()
RUME attribute requirements were not met. See errors:
- Attribute 'gpm:all::mm::population' (parameter value '*::*::population') is not properly specified: Not a compatible shape.
- Attribute 'gpm:all::init::population' (parameter value '*::*::population') is not properly specified: Not a compatible shape.

So we know we need to provide four values in this example, but there’s one more critical detail. The order in which we put the values is also critically important! This is how epymorph makes the association between the nodes in the geo scope and the data attributes in arrays with N-shaped axes. It assumes they’re in the same order.

How do we know the correct order? Examining the node IDs is the best source of truth:

print(rume.scope.node_ids)
['04' '08' '35' '49']

We do have to know our FIPS codes, but this is in fact the order mentioned: Arizona (04), Colorado (08), New Mexico (35), and Utah (49). You may also notice that these are in alphanumeric sort order by FIPS code. This is always true for all CensusScopes.

If I had gotten mixed up and flipped the values around, like so:

    # ...
        "population": [
            2_097_021, # New Mexico
            5_684_926, # Colorado
            3_151_239, # Utah
            7_174_064, # Arizona
        ],
    # ...

the simulation would still run, but we would get invalid results.

ADRIOs

Much of the time, though, we won’t need to provide an array of values “by hand” like this. ADRIOs were designed to load data for our simulation context and they already do the work of sorting things into the proper order.

For example, rather than typing in a bunch of population values, I could use the ACS5 Population ADRIO.

from epymorph.adrio import acs5

acs5.Population().with_context(scope=rume.scope).evaluate()
array([7174064, 5684926, 2097021, 3151239])

Those numbers should look familiar!

Notice we had to pass in the scope to evaluate this ADRIO outside of a RUME. While a RUME contains a “whole” context, most ADRIOs really only use part of the context. For Population, literally just the scope. If we don’t provide all of the context needed, we’ll get an error:

acs5.Population().with_context().evaluate()
Invalid context for epymorph.adrio.acs5.Population: Missing function context 'scope' during evaluation.
The simulation function tried to access 'scope' but this has not been provided. Call `with_context()` first, providing all context that is required by this function. Then call `evaluate()` on the returned object to compute the value.

(You can see that not just any old scope will do in this case, it has to be a CensusScope.)

Other shapes

N-shaped data — data that varies over space — is pretty straight-forward, but some data attributes are more complicated. There’s TxN-shaped data which varies over time and space, like precipitation total. And there’s NxN-shaped data which represents data associated with pairs of locations, like commuting data. (Just to name the most common shapes.)

In any case, these are all treated using the same set of principles. Any axis which relates to N should be in the same order as the nodes in the geo scope and contain the same number of values as the number of nodes in the geo scope.

For example, TxN-data might be arranged like this (“days” on the rows, “nodes” on the columns):

             |  AZ |  CO |  NM |  UT |
|------------|-----|-----|-----|-----|
| 2020-01-01 | 1.0 | 0.0 | 9.1 | 0.0 |
| 2020-01-02 | 1.3 | 0.0 | 3.2 | 2.1 |
| 2020-01-03 | 0.2 | 0.3 | 0.0 | 0.0 |
| ...        | ... | ... | ... | ... |

And NxN-data might be arranged like this (“origin” on the rows, “destination” on the columns):

     |  AZ |  CO |  NM |  UT |
|----|-----|-----|-----|-----|
| AZ |  10 |  32 |   7 |  21 |
| CO |  13 |  27 |  34 |  62 |
| NM |   2 |  19 |   9 |  12 |
| UT |  17 |  35 |   6 |   8 |