epymorph.adrio.processing
Data processing utilities for ADRIOs.
FixLikeInt
module-attribute
A value which can be coerced into a Fix for integers.
FixLikeFloat
module-attribute
A value which can be coerced into a Fix for floats.
FillLikeInt
module-attribute
A value which can be coerced into a Fill for integers.
FillLikeFloat
module-attribute
A value which can be coerced into a Fill for floats.
HasRandomness
Fix
A method for fixing data issues as part of a DataPipeline. Fix instances act as
functions (they have call semantics).
Fix is generic in the dtype of the data it fixes (DataT).
__call__
abstractmethod
__call__(
rng: HasRandomness,
replace: DataT,
columns: tuple[str, ...],
data_df: DataFrame,
) -> DataFrame
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
replace(DataT) –The value to replace.
-
columns(tuple[str, ...]) –The names of the columns to fix.
-
data_df(DataFrame) –The data to fix.
Returns:
-
DataFrame–The data with the fix applied (a copy if modified).
of_int64
staticmethod
of_int64(fix: FixLikeInt) -> Fix[int64]
Construct for a Fix for int64 data. The type of Fix
returned depends on the type of argument provided.
Parameters:
-
fix(FixLikeInt) –A value which implies the type of fix to apply:
Fix[np.int64]: is returned unchanged (no-op)int: returns aConstantFix, to replace bad values with a constantCallable[[], int]: return aFunctionFix, to replace bad values with values obtained from the given callableFalse: return aDontFix, indicating not to replace bad values
Returns:
of_float64
staticmethod
of_float64(fix: FixLikeFloat) -> Fix[float64]
Construct for a Fix for float64 data. The type of Fix
returned depends on the type of argument provided.
Parameters:
-
fix(FixLikeFloat) –A value which implies the type of fix to apply.
Fix[np.float64]: is returned unchanged (no-op)float: returns aConstantFix, to replace bad values with a constantCallable[[], float]: return aFunctionFix, to replace bad values with values obtained from the given callableFalse: return aDontFix, indicating not to replace bad values
Returns:
ConstantFix
dataclass
ConstantFix(with_value: DataT)
A Fix which replaces values with a constant value.
ConstantFix is generic in the dtype of the data it fixes (DataT).
Parameters:
-
with_value(DataT) –The value to use to replace bad values.
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
replace(DataT) –The value to replace.
-
columns(tuple[str, ...]) –The names of the columns to fix.
-
data_df(DataFrame) –The data to fix.
Returns:
-
DataFrame–The data with the fix applied (a copy if modified).
FunctionFix
dataclass
A Fix which replaces values with values generated by the given function.
FunctionFix is generic in the dtype of the data it fixes (DataT).
Parameters:
with_function
instance-attribute
The function that generates replacement values.
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
replace(DataT) –The value to replace.
-
columns(tuple[str, ...]) –The names of the columns to fix.
-
data_df(DataFrame) –The data to fix.
Returns:
-
DataFrame–The data with the fix applied (a copy if modified).
apply
staticmethod
apply(
data_df: DataFrame,
replace: DataT,
columns: tuple[str, ...],
with_function: Callable[[], DataT],
) -> DataFrame
Apply a FunctionFix to a data frame.
This method can be useful in creating other Fix instances, when their
replacement value logic can be expressed as a no-parameter function.
Parameters:
-
data_df(DataFrame) –The data to fix.
-
replace(DataT) –The value to replace.
-
columns(tuple[str, ...]) –The data columns to fix.
-
with_function(Callable[[], DataT]) –The function used to generate replacement values.
Returns:
-
DataFrame–A copy of the data with bad values fixed.
RandomFix
dataclass
A Fix which replaces values with randomly-generated values.
RandomFix is generic in the dtype of the data it fixes (DataT).
Parameters:
-
with_random(Callable[[Generator], DataT]) –A function for generating replacement values using the given numpy random number generator.
with_random
instance-attribute
A function for generating replacement values using the given numpy random number generator.
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
replace(DataT) –The value to replace.
-
columns(tuple[str, ...]) –The names of the columns to fix.
-
data_df(DataFrame) –The data to fix.
Returns:
-
DataFrame–The data with the fix applied (a copy if modified).
from_range
staticmethod
from_range_float
staticmethod
DontFix
dataclass
A special Fix which simply returns the data as-is (no-op).
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
replace(DataT) –The value to replace.
-
columns(tuple[str, ...]) –The names of the columns to fix.
-
data_df(DataFrame) –The data to fix.
Returns:
-
DataFrame–The data with the fix applied (a copy if modified).
Fill
A method for filling-in missing data as part of a DataPipeline. Fill instances
act as functions (they have call semantics).
Fill is generic in the dtype of the data it fixes (DataT).
__call__
abstractmethod
__call__(
rng: HasRandomness,
data_np: NDArray[DataT],
missing_mask: NDArray[bool_],
) -> tuple[NDArray[DataT], NDArray[bool_] | None]
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
data_np(NDArray[DataT]) –The data to fix.
-
missing_mask(NDArray[bool_]) –A mask indicating values which should be considered missing.
Returns:
of_int64
staticmethod
of_int64(fill: FillLikeInt) -> Fill[int64]
Construct for a Fill for int64 data. The type of Fill
returned depends on the type of argument provided.
Parameters:
-
fill(FillLikeInt) –A value which implies the type of fix to apply.
Fill[np.int64]: is returned unchanged (no-op)int: returns aConstantFill, to replace missing values with a constantCallable[[], int]: return aFunctionFill, to replace missing values with values obtained from the given callableFalse: return aDontFill, indicating not to replace missing values
Returns:
of_float64
staticmethod
of_float64(fill: FillLikeFloat) -> Fill[float64]
Construct for a Fill for float64 data. The type of Fill
returned depends on the type of argument provided.
Parameters:
-
fill(FillLikeFloat) –A value which implies the type of fix to apply.
Fill[np.float64]: is returned unchanged (no-op)floatorint: returns aConstantFill, to replace missing values with a constantCallable[[], float]: return aFunctionFill, to replace missing values with values obtained from the given callableFalse: return aDontFill, indicating not to replace missing values
Returns:
ConstantFill
dataclass
ConstantFill(with_value: DataT)
A Fill which replaces missing values with a constant value.
ConstantFill is generic in the dtype of the data it fixes (DataT).
Parameters:
-
with_value(DataT) –The value to use to replace missing values.
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
data_np(NDArray[DataT]) –The data to fix.
-
missing_mask(NDArray[bool_]) –A mask indicating values which should be considered missing.
Returns:
FunctionFill
dataclass
A Fill which replaces missing values with values generated by the given function.
FunctionFill is generic in the dtype of the data it fixes (DataT).
Parameters:
with_function
instance-attribute
The function that generates replacement values.
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
data_np(NDArray[DataT]) –The data to fix.
-
missing_mask(NDArray[bool_]) –A mask indicating values which should be considered missing.
Returns:
apply
staticmethod
apply(
data_np: NDArray[DataT],
missing_mask: NDArray[bool_],
with_function: Callable[[], DataT],
) -> tuple[NDArray[DataT], NDArray[bool_] | None]
Apply a FunctionFill to numpy data.
This method can be useful in creating other Fill instances, when their
replacement value logic can be expressed as a no-parameter function.
Parameters:
-
data_np(NDArray[DataT]) –The data to fix.
-
missing_mask(NDArray[bool_]) –A mask indicating values which should be considered missing.
-
with_function(Callable[[], DataT]) –The function used to generate replacement values.
Returns:
RandomFill
dataclass
A Fill which replaces missing values with randomly-generated values.
RandomFill is generic in the dtype of the data it fixes (DataT).
Parameters:
-
with_random(Callable[[Generator], DataT]) –A function for generating replacement values using the given numpy random number generator.
with_random
instance-attribute
A function for generating replacement values using the given numpy random number generator.
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
data_np(NDArray[DataT]) –The data to fix.
-
missing_mask(NDArray[bool_]) –A mask indicating values which should be considered missing.
Returns:
from_range
staticmethod
from_range(low: int, high: int) -> RandomFill[int64]
Construct for a RandomFill which replaces values
with values sampled uniformly from a discrete range of integers.
Parameters:
Returns:
-
RandomFill[int64]–The fill instance.
from_range_float
staticmethod
from_range_float(
low: float, high: float
) -> RandomFill[float64]
Construct for a RandomFill which replaces values
with values sampled uniformly from a continuous range.
Parameters:
-
low(float) –The low end of the range of replacement values.
-
high(float) –The high end of the range of replacement values. (Not included in the possible values.)
Returns:
-
RandomFill[float64]–The fill instance.
DontFill
dataclass
A special Fill which does not replace missing values and simply returns the data
as-is (no-op).
DontFill is generic in the dtype of the data it fixes (DataT).
__call__
Apply this fix to some data.
Parameters:
-
rng(HasRandomness) –A source of randomness.
-
data_np(NDArray[DataT]) –The data to fix.
-
missing_mask(NDArray[bool_]) –A mask indicating values which should be considered missing.
Returns:
PipelineResult
dataclass
An object containing the result of processing data through a DataPipeline.
PipelineResult is generic in the dtype of the resulting numpy array (DataT).
Parameters:
-
value(NDArray[DataT]) –The resulting numpy array. In this form, the array will never masked, even if there are issues. If you want a masked array, see the
value_as_maskedproperty. -
issues(Mapping[str, NDArray[bool_]]) –The set of outstanding issues in the underlying data, with issue-specific masks.
value
instance-attribute
The resulting numpy array. In this form, the array will never masked, even if
there are issues. If you want a masked array, see the value_as_masked
property.
issues
instance-attribute
The set of outstanding issues in the underlying data, with issue-specific masks.
value_as_masked
property
The resulting numpy array which will be masked if-and-only-if there are issues. The mask is computed as the logical union of the individual issue masks.
with_issue
Update the result by adding a data issue.
Parameters:
-
issue_name(str) –The name of the issue.
-
issue_mask(NDArray[bool_] | None) –The mask indicating which values are affected by the issue. For convenience, the mask may be None or "no mask" to indicate the data does not have the named issue in fact, in which case the issue will not be added.
Returns:
-
Self–The updated copy of the result.
to_date_value
to_date_value(
dates: NDArray[datetime64],
) -> PipelineResult[DateValueType]
Convert the result to a date-value-tuple array.
Parameters:
-
dates(NDArray[datetime64]) –The one-dimensional array of dates.
Returns:
-
PipelineResult[DateValueType]–The updated copy of the result.
See Also
epymorph.util.to_date_value_array for more detail on how dates and values are combined, and epymorph.util.extract_date_value for a convenient way to separate the dates and values when needed.
sum
staticmethod
sum(
left: PipelineResult[DataT],
right: PipelineResult[DataT],
*,
left_prefix: str,
right_prefix: str,
) -> PipelineResult[DataT]
Combine two PipelineResults by summing unmasked data values.
The result will include both lists of data issues by prefixing the issue names.
Parameters:
-
left(PipelineResult[DataT]) –The first addend.
-
right(PipelineResult[DataT]) –The second addend.
-
left_prefix(str) –A prefix to assign to any left-side issues.
-
right_prefix(str) –A prefix to assign to any right-side issues.
Returns:
-
PipelineResult[DataT]–The combined result as a new instance.
PivotAxis
Bases: NamedTuple
Describes an axis on which to pivot a DataFrame to become a numpy array.
values
instance-attribute
The set of values we expect to find in the column. This will be used to expand and reorder the resulting pivot table. If values are in this set and not in the data, the table will contain missing values -- which is better than not knowing which values are missing!
DataPipeline
dataclass
DataPipeline(
axes: tuple[PivotAxis, PivotAxis],
ndims: Literal[1, 2],
dtype: type[DataT],
rng: HasRandomness,
pipeline_steps: Sequence[_PipelineStep] = list(),
)
DataPipeline is a factory class for assembling data processing pipelines.
Using builder-style syntax you define the processing steps that the data should
flow through. Finalizing the pipeline yields a function that takes a DataFrame,
executes the pipeline steps in sequence, and returns a PipelineResult containing
the processed data and any unresolved data issues discovered along the way.
The DataPipeline instance itself can be discarded after the processing function
is finalized.
DataPipeline was designed to produce arrays with one or two dimensions. When
there is more than one value in the "columns" dimension, it's obvious we should
have a 2D array. But when there's only one column, a 1D or 2D array layout are
both valid. Because of this ambiguity, it's up to you to provide the number
of dimensions you expect. If you specify ndims as 1 and the data has more than
one column, this will result in an error.
DataPipeline generic in the dtype of the data it processes (DataT).
Parameters:
-
axes(tuple[PivotAxis, PivotAxis]) –The definition of the axes which will be used to tabulate the data. The first axis represents rows in the result, and the second columns.
-
ndims(Literal[1, 2]) –The number of dimensions expected in the result: 1 or 2.
-
dtype(type[DataT]) –The dtype of the data values in the result.
-
rng(HasRandomness) –A source of randomness.
-
pipeline_steps(Sequence[_PipelineStep], default:list()) –The accumulated pipeline steps.
Examples:
This example uses a DataPipeline to process an simple DataFrame:
axes
instance-attribute
The definition of the axes which will be used to tabulate the data. The first axis represents rows in the result, and the second columns.
ndims
instance-attribute
ndims: Literal[1, 2]
The number of dimensions expected in the result: 1 or 2.
pipeline_steps
class-attribute
instance-attribute
The accumulated pipeline steps.
map_series
Add a pipeline step that transforms a column of the DataFrame
by applying a mapping function to the series.
Parameters:
-
column(str) –The name of the column to transform.
-
map_fn(Callable[[Series], Series] | None, default:None) –The series mapping function. As a convenience you may pass
None, in which case this is a no-op.
Returns:
-
Self–A copy of this pipeline with the step added.
map_column
Add a pipeline step that transforms a column of the DataFrame
by applying a mapping function to all values in the series.
Parameters:
-
column(str) –The name of the column to transform.
-
map_fn(Callable | None, default:None) –The value mapping function. As a convenience you may pass
None, in which case this is a no-op.
Returns:
-
Self–A copy of this pipeline with the step added.
strip_sentinel
Add a pipeline step for dealing with sentinel values in the DataFrame.
First we apply the given Fix, then check for any remaining sentinel values.
If sentinel values still remain in the data, these are recorded as a data
issue with an associated mask.
Parameters:
-
sentinel_name(str) –The name used for the data issue if any sentinel values remain.
-
sentinel_value(DataT) –The value considered a sentinel.
-
fix(Fix[DataT]) –The fix to apply to attempt to replace sentinel values.
Returns:
-
Self–A copy of this pipeline with the step added.
strip_na_as_sentinel
strip_na_as_sentinel(
column: str,
sentinel_name: str,
sentinel_value: DataT,
fix: Fix[DataT],
) -> Self
Add a pipeline step for dealing with NaN/NA/null values in the DataFrame.
First replace NA values with a user-defined sentinel value, then apply the
given Fix. Finally check for any remaining such values. If sentinel values
still remain in the data, these are recorded as a data issue with an associated
mask.
Parameters:
-
column(str) –The name of the column to transform.
-
sentinel_name(str) –The name used for the data issue if any NA/sentinel values remain.
-
sentinel_value(DataT) –The value to use to replace NA values. We want to replace NAs so that we can universally convert the data column to the desired type -- np.int64 doesn't support NA values like np.float64 does, so this allows the input DataFrame to start with something like Pandas' "Int64" data type while the pipeline produces np.int64 results. The sentinel value chosen for this must not already exist in the data.
-
fix(Fix[DataT]) –The fix to apply to attempt to replace NA/sentinel values.
Returns:
-
Self–A copy of this pipeline with the step added.
Raises:
-
Exception–If the data naturally contains the chosen sentinel value.