preprocessing
boxcox(method='mle')
Applies the Box-Cox transformation to numeric columns in a panel DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method |
str
|
The method used to determine the lambda parameter of the Box-Cox transformation. Supported methods:
|
'mle'
|
coerce_dtypes(schema)
Coerces the column datatypes of a DataFrame using the provided schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
schema |
Mapping[str, DataType]
|
A dictionary-like object mapping column names to the desired data types. |
required |
deseasonalize_fourier(sp, K, robust=False)
Removes seasonality via residualized regression with Fourier terms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sp |
int
|
Seasonal period. |
required |
K |
int
|
Maximum order(s) of Fourier terms.
Must be less than |
required |
Note: part of this transformer uses sklearn under-the-hood: it is not pure Polars and lazy.
detrend(freq, method='linear')
Removes mean or linear trend from numeric columns in a panel DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
freq |
str
|
Offset alias supported by Polars. |
required |
method |
str
|
If |
'linear'
|
diff(order, sp=1, fill_strategy=None)
Difference time-series in panel data given order and seasonal period.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
order |
int
|
The order to difference. |
required |
sp |
int
|
Seasonal periodicity. |
1
|
fill_strategy |
Optional[str]
|
Strategy to fill nulls by. Nulls are not filled if None. Supported strategies include: ["backward", "forward", "mean", "zero"]. |
None
|
fractional_diff(d, min_weight=None, window_size=None)
Compute the fractional differential of a time series.
This particular functionality is referenced in Advances in Financial Machine Learning by Marcos Lopez de Prado (2018).
For feature creation purposes, it is suggested that the minimum value of d is used that removes stationarity from the time series. This can be achieved by running the augmented dickey-fuller test on the time series for different values of d and selecting the minimum value that makes the time series stationary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d |
float
|
The fractional order of the differencing operator. |
required |
min_weight |
float
|
The minimum weight to use for calculations. If specified, the window size is computed from this value and not needed. |
None
|
window_size |
int
|
The window size of the fractional differencing operator. If specified, the minimum weight is not needed. |
None
|
impute(method)
Performs missing value imputation on numeric columns of a DataFrame grouped by entity.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method |
Union[str, int, float]
|
The imputation method to use. Supported methods are:
|
required |
lag(lags, fill_strategy=None)
Applies lag transformation to a LazyFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lags |
List[int]
|
A list of lag values to apply. |
required |
fill_strategy |
Optional[str]
|
Strategy to fill nulls by. Nulls are not filled if None. Supported strategies include: ["backward", "forward", "mean", "zero"]. |
None
|
one_hot_encode(drop_first=False)
Encode categorical features as a one-hot numeric array.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
drop_first |
bool
|
Drop the first one hot feature. |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
if X passed into |
reindex(drop_duplicates=False)
Reindexes the entity and time columns to have every possible combination of (entity, time).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
drop_duplicates |
bool
|
Defaults to False. If True, duplicates are dropped before reindexing. |
False
|
resample(freq, agg_method, impute_method)
Resamples and transforms a DataFrame using the specified frequency, aggregation method, and imputation method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
freq |
str
|
Offset alias supported by Polars. |
required |
agg_method |
str
|
The aggregation method to use for resampling. Supported values are 'sum', 'mean', and 'median'. |
required |
impute_method |
Union[str, int, float]
|
The method used for imputing missing values. If a string, supported values are 'ffill' (forward fill) and 'bfill' (backward fill). If an int or float, missing values will be filled with the provided value. |
required |
roll(window_sizes, stats, freq, fill_strategy=None)
Performs rolling window calculations on specified columns of a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
window_sizes |
List[int]
|
A list of integers representing the window sizes for the rolling calculations. |
required |
stats |
List[Literal['mean', 'min', 'max', 'mlm', 'sum', 'std', 'cv']]
|
A list of statistical measures to calculate for each rolling window. Supported values are:
|
required |
freq |
str
|
Offset alias supported by Polars. |
required |
fill_strategy |
Optional[str]
|
Strategy to fill nulls by. Nulls are not filled if None. Supported strategies include: ["backward", "forward", "mean", "zero"]. |
None
|
scale(use_mean=True, use_std=True, rescale_bool=False)
Performs scaling and rescaling operations on the numeric columns of a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
use_mean |
bool
|
Whether to subtract the mean from the numeric columns. Defaults to True. |
True
|
use_std |
bool
|
Whether to divide the numeric columns by the standard deviation. Defaults to True. |
True
|
rescale_bool |
bool
|
Whether to rescale boolean columns to the range [-1, 1]. Defaults to False. |
False
|
time_to_arange(eager=False)
Coerces time column into arange per entity.
Assumes even-spaced time-series and homogenous start dates.
trim(direction='both')
Trims time-series in panel to have the same start or end dates as the shortest time-series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
direction |
Literal['both', 'left', 'right']
|
Defaults to "both". If "left" trims from start date of the shortest time series); if "right" trims up to the end date of the shortest time-series; or otherwise "both" trims between start and end dates of the shortest time-series |
'both'
|
yeojohnson(brack=(-2, 2))
Applies the Yeo-Johnson transformation to numeric columns in a panel DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
brack |
2 - tuple
|
The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket. |
(-2, 2)
|