epymorph.adrio.processing
Data processing utilities for ADRIOs.
FixLikeInt
module-attribute
A value which can be coerced into a Fix
for integers.
FixLikeFloat
module-attribute
A value which can be coerced into a Fix
for floats.
FillLikeInt
module-attribute
A value which can be coerced into a Fill
for integers.
FillLikeFloat
module-attribute
A value which can be coerced into a Fill
for floats.
HasRandomness
Fix
A method for fixing data issues as part of a DataPipeline
. Fix
instances act as
functions (they have call semantics).
Fix
is generic in the dtype of the data it fixes (DataT
).
__call__
abstractmethod
__call__(
rng: HasRandomness,
replace: DataT,
columns: tuple[str, ...],
data_df: DataFrame,
) -> DataFrame
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
replace
(DataT
) –The value to replace.
-
columns
(tuple[str, ...]
) –The names of the columns to fix.
-
data_df
(DataFrame
) –The data to fix.
Returns:
-
DataFrame
–The data with the fix applied (a copy if modified).
of_int64
staticmethod
of_int64(fix: FixLikeInt) -> Fix[int64]
Convenience constructor for a Fix
for int64
data. The type of Fix
returned depends on the type of argument provided.
Parameters:
-
fix
(FixLikeInt
) –A value which implies the type of fix to apply:
Fix[np.int64]
: is returned unchanged (no-op)int
: returns aConstantFix
, to replace bad values with a constantCallable[[], int]
: return aFunctionFix
, to replace bad values with values obtained from the given callableFalse
: return aDontFix
, indicating not to replace bad values
Returns:
of_float64
staticmethod
of_float64(fix: FixLikeFloat) -> Fix[float64]
Convenience constructor for a Fix
for float64
data. The type of Fix
returned depends on the type of argument provided.
Parameters:
-
fix
(FixLikeFloat
) –A value which implies the type of fix to apply.
Fix[np.float64]
: is returned unchanged (no-op)float
: returns aConstantFix
, to replace bad values with a constantCallable[[], float]
: return aFunctionFix
, to replace bad values with values obtained from the given callableFalse
: return aDontFix
, indicating not to replace bad values
Returns:
ConstantFix
dataclass
ConstantFix(with_value: DataT)
A Fix
which replaces values with a constant value.
ConstantFix
is generic in the dtype of the data it fixes (DataT
).
Parameters:
-
with_value
(DataT
) –The value to use to replace bad values.
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
replace
(DataT
) –The value to replace.
-
columns
(tuple[str, ...]
) –The names of the columns to fix.
-
data_df
(DataFrame
) –The data to fix.
Returns:
-
DataFrame
–The data with the fix applied (a copy if modified).
FunctionFix
dataclass
A Fix
which replaces values with values generated by the given function.
FunctionFix
is generic in the dtype of the data it fixes (DataT
).
Parameters:
with_function
instance-attribute
The function that generates replacement values.
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
replace
(DataT
) –The value to replace.
-
columns
(tuple[str, ...]
) –The names of the columns to fix.
-
data_df
(DataFrame
) –The data to fix.
Returns:
-
DataFrame
–The data with the fix applied (a copy if modified).
apply
staticmethod
apply(
data_df: DataFrame,
replace: DataT,
columns: tuple[str, ...],
with_function: Callable[[], DataT],
) -> DataFrame
A static method that performs the work of a FunctionFix
. This method can be
useful in creating other Fix
instances, when their replacement value logic
can be expressed as a no-parameter function.
Parameters:
-
data_df
(DataFrame
) –The data to fix.
-
replace
(DataT
) –The value to replace.
-
with_function
(Callable[[], DataT]
) –The function used to generate replacement values.
Returns:
-
DataFrame
–A copy of the data with bad values fixed.
RandomFix
dataclass
A Fix
which replaces values with randomly-generated values.
RandomFix
is generic in the dtype of the data it fixes (DataT
).
Parameters:
-
with_random
(Callable[[Generator], DataT]
) –A function for generating replacement values using the given numpy random number generator.
with_random
instance-attribute
A function for generating replacement values using the given numpy random number generator.
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
replace
(DataT
) –The value to replace.
-
columns
(tuple[str, ...]
) –The names of the columns to fix.
-
data_df
(DataFrame
) –The data to fix.
Returns:
-
DataFrame
–The data with the fix applied (a copy if modified).
from_range
staticmethod
from_range_float
staticmethod
Convenience constructor for a RandomFix
which replaces values
with values sampled uniformly from a continuous range.
Parameters:
-
low
(float
) –The low end of the range of replacement values.
-
high
(float
) –The high end of the range of replacement values. (Not included in the possible values.)
Returns:
DontFix
dataclass
A special Fix
which does not fix values and simply returns the data as-is (no-op).
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
replace
(DataT
) –The value to replace.
-
columns
(tuple[str, ...]
) –The names of the columns to fix.
-
data_df
(DataFrame
) –The data to fix.
Returns:
-
DataFrame
–The data with the fix applied (a copy if modified).
Fill
A method for filling-in missing data as part of a DataPipeline
. Fill
instances
act as functions (they have call semantics).
Fill
is generic in the dtype of the data it fixes (DataT
).
__call__
abstractmethod
__call__(
rng: HasRandomness,
data_np: NDArray[DataT],
missing_mask: NDArray[bool_],
) -> tuple[NDArray[DataT], NDArray[bool_] | None]
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
data_np
(NDArray[DataT]
) –The data to fix.
-
missing_mask
(NDArray[bool_]
) –A mask indicating values which should be considered missing.
Returns:
of_int64
staticmethod
of_int64(fill: FillLikeInt) -> Fill[int64]
Convenience constructor for a Fill
for int64
data. The type of Fill
returned depends on the type of argument provided.
Parameters:
-
fill
(FillLikeInt
) –A value which implies the type of fix to apply.
Fill[np.int64]
: is returned unchanged (no-op)int
: returns aConstantFill
, to replace missing values with a constantCallable[[], int]
: return aFunctionFill
, to replace missing values with values obtained from the given callableFalse
: return aDontFill
, indicating not to replace missing values
Returns:
of_float64
staticmethod
of_float64(fill: FillLikeFloat) -> Fill[float64]
Convenience constructor for a Fill
for float64
data. The type of Fill
returned depends on the type of argument provided.
Parameters:
-
fill
(FillLikeFloat
) –A value which implies the type of fix to apply.
Fill[np.float64]
: is returned unchanged (no-op)float
orint
: returns aConstantFill
, to replace missing values with a constantCallable[[], float]
: return aFunctionFill
, to replace missing values with values obtained from the given callableFalse
: return aDontFill
, indicating not to replace missing values
Returns:
ConstantFill
dataclass
ConstantFill(with_value: DataT)
A Fill
which replaces missing values with a constant value.
ConstantFill
is generic in the dtype of the data it fixes (DataT
).
Parameters:
-
with_value
(DataT
) –The value to use to replace missing values.
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
data_np
(NDArray[DataT]
) –The data to fix.
-
missing_mask
(NDArray[bool_]
) –A mask indicating values which should be considered missing.
Returns:
FunctionFill
dataclass
A Fill
which replaces missing values with values generated by the given function.
FunctionFill
is generic in the dtype of the data it fixes (DataT
).
Parameters:
with_function
instance-attribute
The function that generates replacement values.
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
data_np
(NDArray[DataT]
) –The data to fix.
-
missing_mask
(NDArray[bool_]
) –A mask indicating values which should be considered missing.
Returns:
apply
staticmethod
apply(
data_np: NDArray[DataT],
missing_mask: NDArray[bool_],
with_function: Callable[[], DataT],
) -> tuple[NDArray[DataT], NDArray[bool_] | None]
A static method that performs the work of a FunctionFill
. This method can be
useful in creating other Fill
instances, when their replacement value logic
can be expressed as a no-parameter function.
Parameters:
-
data_np
(NDArray[DataT]
) –The data to fix.
-
missing_mask
(NDArray[bool_]
) –A mask indicating values which should be considered missing.
-
with_function
(Callable[[], DataT]
) –The function used to generate replacement values.
Returns:
RandomFill
dataclass
A Fill
which replaces missing values with randomly-generated values.
RandomFill
is generic in the dtype of the data it fixes (DataT
).
Parameters:
-
with_random
(Callable[[Generator], DataT]
) –A function for generating replacement values using the given numpy random number generator.
with_random
instance-attribute
A function for generating replacement values using the given numpy random number generator.
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
data_np
(NDArray[DataT]
) –The data to fix.
-
missing_mask
(NDArray[bool_]
) –A mask indicating values which should be considered missing.
Returns:
from_range
staticmethod
from_range(low: int, high: int) -> RandomFill[int64]
Convenience constructor for a RandomFill
which replaces values
with values sampled uniformly from a discrete range of integers.
Parameters:
Returns:
-
RandomFill[int64]
–The fill instance.
from_range_float
staticmethod
from_range_float(
low: float, high: float
) -> RandomFill[float64]
Convenience constructor for a RandomFill
which replaces values
with values sampled uniformly from a continuous range.
Parameters:
-
low
(float
) –The low end of the range of replacement values.
-
high
(float
) –The high end of the range of replacement values. (Not included in the possible values.)
Returns:
-
RandomFill[float64]
–The fill instance.
DontFill
dataclass
A special Fill
which does not replace missing values and simply returns the data
as-is (no-op).
DontFill
is generic in the dtype of the data it fixes (DataT
).
__call__
Apply this fix to some data.
Parameters:
-
rng
(HasRandomness
) –A source of randomness.
-
data_np
(NDArray[DataT]
) –The data to fix.
-
missing_mask
(NDArray[bool_]
) –A mask indicating values which should be considered missing.
Returns:
PipelineResult
dataclass
An object containing the result of processing data through a DataPipeline
.
PipelineResult
is generic in the dtype of the resulting numpy array (DataT
).
Parameters:
-
value
(NDArray[DataT]
) –The resulting numpy array. In this form, the array will never masked, even if there are issues. If you want a masked array, see the
value_as_masked
property. -
issues
(Mapping[str, NDArray[bool_]]
) –The set of outstanding issues in the underlying data, with issue-specific masks.
value
instance-attribute
The resulting numpy array. In this form, the array will never masked, even if
there are issues. If you want a masked array, see the value_as_masked
property.
issues
instance-attribute
The set of outstanding issues in the underlying data, with issue-specific masks.
value_as_masked
property
The resulting numpy array which will be masked if-and-only-if there are issues. The mask is computed as the logical union of the individual issue masks.
with_issue
Updates the result by adding a data issue.
Parameters:
-
issue_name
(str
) –The name of the issue.
-
issue_mask
(NDArray[bool_] | None
) –The mask indicating which values are affected by the issue. For convenience, the mask may be None or "no mask" to indicate the data does not have the named issue in fact, in which case the issue will not be added.
Returns:
-
Self
–The updated copy of the result.
to_date_value
to_date_value(
dates: NDArray[datetime64],
) -> PipelineResult[DateValueType]
Converts the result to a date-value-tuple array.
Parameters:
-
dates
(NDArray[datetime64]
) –The one-dimensional array of dates.
Returns:
-
PipelineResult[DateValueType]
–The updated copy of the result.
See Also
epymorph.util.to_date_value_array for more detail on how dates and values are combined, and epymorph.util.extract_date_value for a convenient way to separate the dates and values when needed.
sum
staticmethod
sum(
left: PipelineResult[DataT],
right: PipelineResult[DataT],
*,
left_prefix: str,
right_prefix: str,
) -> PipelineResult[DataT]
Combines two PipelineResults
by summing unmasked data values.
The result will include both lists of data issues by prefixing the issue names.
Parameters:
-
left
(PipelineResult[DataT]
) –The first addend.
-
right
(PipelineResult[DataT]
) –The second addend.
-
left_prefix
(str
) –A prefix to assign to any left-side issues.
-
right_prefix
(str
) –A prefix to assign to any right-side issues.
Returns:
-
PipelineResult[DataT]
–The combined result as a new instance.
PivotAxis
Bases: NamedTuple
Describes an axis on which a DataFrame
will be pivoted to become a numpy array.
values
instance-attribute
The set of values we expect to find in the column. This will be used to expand and reorder the resulting pivot table. If values are in this set and not in the data, the table will contain missing values -- which is better than not knowing which values are missing!
DataPipeline
dataclass
DataPipeline(
axes: tuple[PivotAxis, PivotAxis],
ndims: Literal[1, 2],
dtype: type[DataT],
rng: HasRandomness,
pipeline_steps: Sequence[_PipelineStep] = list(),
)
DataPipeline
is a factory class for assembling data processing pipelines.
Using builder-style syntax you define the processing steps that the data should
flow through. Finalizing the pipeline yields a function that takes a DataFrame
,
executes the pipeline steps in sequence, and returns a PipelineResult
containing
the processed data and any unresolved data issues discovered along the way.
The DataPipeline
instance itself can be discarded after the processing function
is finalized.
DataPipeline
was designed to produce arrays with one or two dimensions. When
there is more than one value in the "columns" dimension, it's obvious we should
have a 2D array. But when there's only one column, a 1D or 2D array layout are
both valid. Because of this ambiguity, it's up to you to provide the number
of dimensions you expect. If you specify ndims
as 1 and the data has more than
one column, this will result in an error.
DataPipeline
generic in the dtype of the data it processes (DataT
).
Parameters:
-
axes
(tuple[PivotAxis, PivotAxis]
) –The definition of the axes which will be used to tabulate the data. The first axis represents rows in the result, and the second columns.
-
ndims
(Literal[1, 2]
) –The number of dimensions expected in the result: 1 or 2.
-
dtype
(type[DataT]
) –The dtype of the data values in the result.
-
rng
(HasRandomness
) –A source of randomness.
-
pipeline_steps
(Sequence[_PipelineStep]
, default:list()
) –The accumulated pipeline steps.
Examples:
This example uses a DataPipeline
to process an simple DataFrame
:
axes
instance-attribute
The definition of the axes which will be used to tabulate the data. The first axis represents rows in the result, and the second columns.
ndims
instance-attribute
ndims: Literal[1, 2]
The number of dimensions expected in the result: 1 or 2.
pipeline_steps
class-attribute
instance-attribute
The accumulated pipeline steps.
map_series
Add a pipeline step that transforms a column of the DataFrame
by applying a mapping function to the series.
Parameters:
-
column
(str
) –The name of the column to transform.
-
map_fn
(Callable[[Series], Series] | None
, default:None
) –The series mapping function. As a convenience you may pass
None
, in which case this is a no-op.
Returns:
-
Self
–A copy of this pipeline with the step added.
map_column
Add a pipeline step that transforms a column of the DataFrame
by applying a mapping function to all values in the series.
Parameters:
-
column
(str
) –The name of the column to transform.
-
map_fn
(Callable | None
, default:None
) –The value mapping function. As a convenience you may pass
None
, in which case this is a no-op.
Returns:
-
Self
–A copy of this pipeline with the step added.
strip_sentinel
Add a pipeline step for dealing with sentinel values in the DataFrame
.
First we apply the given Fix
, then check for any remaining sentinel values.
If sentinel values still remain in the data, these are recorded as a data
issue with an associated mask.
Parameters:
-
sentinel_name
(str
) –The name used for the data issue if any sentinel values remain.
-
sentinel_value
(DataT
) –The value considered a sentinel.
-
fix
(Fix[DataT]
) –The fix to apply to attempt to replace sentinel values.
Returns:
-
Self
–A copy of this pipeline with the step added.
strip_na_as_sentinel
strip_na_as_sentinel(
column: str,
sentinel_name: str,
sentinel_value: DataT,
fix: Fix[DataT],
) -> Self
Add a pipeline step for dealing with NaN/NA/null values in the DataFrame
.
First replace NA values with a user-defined sentinel value, then apply the
given Fix
. Finally check for any remaining such values. If sentinel values
still remain in the data, these are recorded as a data issue with an associated
mask.
Parameters:
-
column
(str
) –The name of the column to transform.
-
sentinel_name
(str
) –The name used for the data issue if any NA/sentinel values remain.
-
sentinel_value
(DataT
) –The value to use to replace NA values. We want to replace NAs so that we can universally convert the data column to the desired type -- np.int64 doesn't support NA values like np.float64 does, so this allows the input DataFrame to start with something like Pandas' "Int64" data type while the pipeline produces np.int64 results. The sentinel value chosen for this must not already exist in the data.
-
fix
(Fix[DataT]
) –The fix to apply to attempt to replace NA/sentinel values.
Returns:
-
Self
–A copy of this pipeline with the step added.
Raises:
-
Exception
–If the data naturally contains the chosen sentinel value.