causal_curve package

causal_curve.core module

Core classes (with basic methods) that will be invoked when other, model classes are defined

class causal_curve.core.Core

Bases: object

Base class for causal_curve module

static calculate_z_score(ci)

Calculates the critical z-score for a desired two-sided, confidence interval width.

Parameters:
ci: float, the confidence interval width (e.g. 0.95)
Returns:
Float, critical z-score value
static clip_negatives(number)

Helper function to clip negative numbers to zero

Parameters:
number: int or float, any number that needs a floor at zero
Returns:
Int or float of modified value
get_params()

Returns a dict of all of the object’s user-facing parameters

Parameters:
None
Returns:
self: object
if_verbose_print(string)

Prints the input statement if verbose is set to True

Parameters:
string: str, some string to be printed
Returns:
None
static rand_seed_wrapper(random_seed=None)

Sets the random seed using numpy

Parameters:
random_seed: int, random seed number
Returns:
None

causal_curve.gps_core module

Defines the Generalized Prospensity Score (GPS) Core model class

class causal_curve.gps_core.GPS_Core(gps_family=None, treatment_grid_num=100, lower_grid_constraint=0.01, upper_grid_constraint=0.99, spline_order=3, n_splines=30, lambda_=0.5, max_iter=100, random_seed=None, verbose=False)

Bases: causal_curve.core.Core

In a multi-stage approach, this computes the generalized propensity score (GPS) function, and uses this in a generalized additive model (GAM) to correct treatment prediction of the outcome variable. Assumes continuous treatment, but the outcome variable may be continuous or binary.

WARNING:

-This algorithm assumes you’ve already performed the necessary transformations to categorical covariates (i.e. these variables are already one-hot encoded and one of the categories is excluded for each set of dummy variables).

-Please take care to ensure that the “ignorability” assumption is met (i.e. all strong confounders are captured in your covariates and there is no informative censoring), otherwise your results will be biased, sometimes strongly so.

Parameters:
gps_family: str, optional (default = None)

Is used to determine the family of the glm used to model the GPS function. Look at the distribution of your treatment variable to determine which family is more appropriate. Possible values:

  • ‘normal’
  • ‘lognormal’
  • ‘gamma’
  • None : (best-fitting family automatically chosen)
treatment_grid_num: int, optional (default = 100)

Takes the treatment, and creates a quantile-based grid across its values. For instance, if the number 6 is selected, this means the algorithm will only take the 6 treatment variable values at approximately the 0, 20, 40, 60, 80, and 100th percentiles to estimate the causal dose response curve. Higher value here means the final curve will be more finely estimated, but also increases computation time. Default is usually a reasonable number.

lower_grid_constraint: float, optional(default = 0.01)

This adds an optional constraint of the lower side of the treatment grid. Sometimes data near the minimum values of the treatment are few in number and thus generate unstable estimates. By default, this clips the bottom 1 percentile or lower of treatment values. This can be as low as 0, indicating there is no lower limit to how much treatment data is considered.

upper_grid_constraint: float, optional (default = 0.99)

See above parameter. Just like above, but this is an upper constraint. By default, this clips the top 99th percentile or higher of treatment values. This can be as high as 1.0, indicating there is no upper limit to how much treatment data is considered.

spline_order: int, optional (default = 3)

Order of the splines to use fitting the final GAM. Must be integer >= 1. Default value creates cubic splines.

n_splines: int, optional (default = 30)

Number of splines to use for the treatment and GPS in the final GAM. Must be integer >= 2. Must be non-negative.

lambda_: int or float, optional (default = 0.5)

Strength of smoothing penalty. Must be a positive float. Larger values enforce stronger smoothing.

max_iter: int, optional (default = 100)

Maximum number of iterations allowed for the maximum likelihood algo to converge.

random_seed: int, optional (default = None)

Sets the random seed.

verbose: bool, optional (default = False)

Determines whether the user will get verbose status updates.

References

Galagate, D. Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response function with Applications. PhD thesis, 2016.

Moodie E and Stephens DA. Estimation of dose–response functions for longitudinal data using the generalised propensity score. In: Statistical Methods in Medical Research 21(2), 2010, pp.149–166.

Hirano K and Imbens GW. The propensity score with continuous treatments. In: Gelman A and Meng XL (eds) Applied bayesian modeling and causal inference from incomplete-data perspectives. Oxford, UK: Wiley, 2004, pp.73–84.

Examples

>>> # With continuous outcome
>>> from causal_curve import GPS_Regressor
>>> gps = GPS_Regressor(treatment_grid_num = 200, random_seed = 512)
>>> gps.fit(T = df['Treatment'], X = df[['X_1', 'X_2']], y = df['Outcome'])
>>> gps_results = gps.calculate_CDRC(0.95)
>>> point_estimate = gps.point_estimate(np.array([5.0]))
>>> point_estimate_interval = gps.point_estimate_interval(np.array([5.0]), 0.95)
>>> # With binary outcome
>>> from causal_curve import GPS_Classifier
>>> gps = GPS_Classifier()
>>> gps.fit(T = df['Treatment'], X = df[['X_1', 'X_2']], y = df['Binary_Outcome'])
>>> gps_results = gps.calculate_CDRC(0.95)
>>> log_odds = gps.estimate_log_odds(np.array([5.0]))
Attributes:
grid_values: array of shape (treatment_grid_num, )

The gridded values of the treatment variable. Equally spaced.

best_gps_family: str

If no gps_family is specified and the algorithm chooses the best glm family, this is the name of the family that was chosen.

gps_deviance: float

The GPS model deviance

gps: array of shape (number of observations, )

The calculated GPS for each observation

gam_results: `pygam.LinearGAM` class

trained model of LinearGAM class, from pyGAM library

Methods

fit: (self, T, X, y) Fits the causal dose-response model.
calculate_CDRC: (self, ci) Calculates the CDRC (and confidence interval) from trained model.
print_gam_summary: (self) Prints pyGAM text summary of GAM predicting outcome from the treatment and the GPS.
calculate_CDRC(ci=0.95)

Using the results of the fitted model, this generates a dataframe of point estimates for the CDRC at each of the values of the treatment grid. Connecting these estimates will produce the overall estimated CDRC. Confidence interval is returned as well.

Parameters:
ci: float (default = 0.95)

The desired confidence interval to produce. Default value is 0.95, corresponding to 95% confidence intervals. bounded (0, 1.0).

Returns:
dataframe: Pandas dataframe

Contains treatment grid values, the CDRC point estimate at that value, and the associated lower and upper confidence interval bounds at that point.

self: object
fit(T, X, y)

Fits the GPS causal dose-response model. For now, this only accepts pandas columns. While the treatment variable must be continuous (or ordinal with many levels), the outcome variable may be continuous or binary. You must provide at least one covariate column.

Parameters:
T: array-like, shape (n_samples,)

A continuous treatment variable.

X: array-like, shape (n_samples, m_features)

Covariates, where n_samples is the number of samples and m_features is the number of features. Features can be a mix of continuous and nominal/categorical variables.

y: array-like, shape (n_samples,)

Outcome variable. May be continuous or binary. If continuous, this must be a series of type float, if binary must be a series of type integer.

Returns:
self : object
print_gam_summary()

Prints the GAM model summary (uses pyGAM’s output)

Parameters:
None
Returns:
self: object

causal_curve.gps_regressor module

Defines the Generalized Prospensity Score (GPS) regressor model class

class causal_curve.gps_regressor.GPS_Regressor(gps_family=None, treatment_grid_num=100, lower_grid_constraint=0.01, upper_grid_constraint=0.99, spline_order=3, n_splines=30, lambda_=0.5, max_iter=100, random_seed=None, verbose=False)

Bases: causal_curve.gps_core.GPS_Core

A GPS tool that handles continuous outcomes. Inherits the GPS_core base class. See that base class code its docstring for more details.

Methods

point_estimate: (self, T) Calculates point estimate within the CDRC given treatment values. Can only be used when outcome is continuous.
point_estimate_interval: (self, T, ci) Calculates the prediction confidence interval associated with a point estimate within the CDRC given treatment values. Can only be used when outcome is continuous.
point_estimate(T)

Calculates point estimate within the CDRC given treatment values. Can only be used when outcome is continuous. Can be estimate for a single data point or can be run in batch for many observations. Extrapolation will produce untrustworthy results; the provided treatment should be within the range of the training data.

Parameters:
T: Numpy array, shape (n_samples,)

A continuous treatment variable.

Returns:
array: Numpy array

Contains a set of CDRC point estimates

point_estimate_interval(T, ci=0.95)

Calculates the prediction confidence interval associated with a point estimate within the CDRC given treatment values. Can only be used when outcome is continuous. Can be estimate for a single data point or can be run in batch for many observations. Extrapolation will produce untrustworthy results; the provided treatment should be within the range of the training data.

Parameters:
T: Numpy array, shape (n_samples,)

A continuous treatment variable.

ci: float (default = 0.95)

The desired confidence interval to produce. Default value is 0.95, corresponding to 95% confidence intervals. bounded (0, 1.0).

Returns:
array: Numpy array

Contains a set of CDRC prediction intervals ([lower bound, higher bound])

causal_curve.gps_classifier module

Defines the Generalized Prospensity Score (GPS) classifier model class

class causal_curve.gps_classifier.GPS_Classifier(gps_family=None, treatment_grid_num=100, lower_grid_constraint=0.01, upper_grid_constraint=0.99, spline_order=3, n_splines=30, lambda_=0.5, max_iter=100, random_seed=None, verbose=False)

Bases: causal_curve.gps_core.GPS_Core

A GPS tool that handles binary outcomes. Inherits the GPS_core base class. See that base class code its docstring for more details.

Methods

estimate_log_odds: (self, T) Calculates the predicted log odds of the highest integer class. Can only be used when the outcome is binary.
estimate_log_odds(T)

Calculates the estimated log odds of the highest integer class. Can only be used when the outcome is binary. Can be estimate for a single data point or can be run in batch for many observations. Extrapolation will produce untrustworthy results; the provided treatment should be within the range of the training data.

Parameters:
T: Numpy array, shape (n_samples,)

A continuous treatment variable.

Returns:
array: Numpy array

Contains a set of log odds

causal_curve.tmle_core module

Defines the Targetted Maximum likelihood Estimation (TMLE) model class

class causal_curve.tmle_core.TMLE_Core(treatment_grid_num=100, lower_grid_constraint=0.01, upper_grid_constraint=0.99, n_estimators=200, learning_rate=0.01, max_depth=3, bandwidth=0.5, random_seed=None, verbose=False)

Bases: causal_curve.core.Core

Constructs a causal dose response curve via a modified version of Targetted Maximum Likelihood Estimation (TMLE) across a grid of the treatment values. Gradient boosting is used for prediction of the Q model and G models, simple kernel regression is used processing those model results, and a generalized additive model is used in the final step to contruct the final curve. Assumes continuous treatment and outcome variable.

WARNING:

-The treatment values should be roughly normally-distributed for this tool to work. Otherwise you may encounter internal math errors.

-This algorithm assumes you’ve already performed the necessary transformations to categorical covariates (i.e. these variables are already one-hot encoded and one of the categories is excluded for each set of dummy variables).

-Please take care to ensure that the “ignorability” assumption is met (i.e. all strong confounders are captured in your covariates and there is no informative censoring), otherwise your results will be biased, sometimes strongly so.

Parameters:
treatment_grid_num: int, optional (default = 100)

Takes the treatment, and creates a quantile-based grid across its values. For instance, if the number 6 is selected, this means the algorithm will only take the 6 treatment variable values at approximately the 0, 20, 40, 60, 80, and 100th percentiles to estimate the causal dose response curve. Higher value here means the final curve will be more finely estimated, but also increases computation time. Default is usually a reasonable number.

lower_grid_constraint: float, optional(default = 0.01)

This adds an optional constraint of the lower side of the treatment grid. Sometimes data near the minimum values of the treatment are few in number and thus generate unstable estimates. By default, this clips the bottom 1 percentile or lower of treatment values. This can be as low as 0, indicating there is no lower limit to how much treatment data is considered.

upper_grid_constraint: float, optional (default = 0.99)

See above parameter. Just like above, but this is an upper constraint. By default, this clips the top 99th percentile or higher of treatment values. This can be as high as 1.0, indicating there is no upper limit to how much treatment data is considered.

n_estimators: int, optional (default = 200)

Optional argument to set the number of learners to use when sklearn creates TMLE’s Q and G models.

learning_rate: float, optional (default = 0.01)

Optional argument to set the sklearn’s learning rate for TMLE’s Q and G models.

max_depth: int, optional (default = 3)

Optional argument to set sklearn’s maximum depth when creating TMLE’s Q and G models.

bandwidth: float, optional (default = 0.5)

Optional argument to set the bandwidth parameter of the internal kernel density estimation and kernel regression methods.

random_seed: int, optional (default = None)

Sets the random seed.

verbose: bool, optional (default = False)

Determines whether the user will get verbose status updates.

References

Kennedy EH, Ma Z, McHugh MD, Small DS. Nonparametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society, Series B. 79(4), 2017, pp.1229-1245.

van der Laan MJ and Rubin D. Targeted maximum likelihood learning. In: The International Journal of Biostatistics, 2(1), 2006.

van der Laan MJ and Gruber S. Collaborative double robust penalized targeted maximum likelihood estimation. In: The International Journal of Biostatistics 6(1), 2010.

Examples

>>> # With continuous outcome
>>> from causal_curve import TMLE_Regressor
>>> tmle = TMLE_Regressor()
>>> tmle.fit(T = df['Treatment'], X = df[['X_1', 'X_2']], y = df['Outcome'])
>>> tmle_results = tmle.calculate_CDRC(0.95)
>>> point_estimate = tmle.point_estimate(np.array([5.0]))
>>> point_estimate_interval = tmle.point_estimate_interval(np.array([5.0]), 0.95)
Attributes:
grid_values: array of shape (treatment_grid_num, )

The gridded values of the treatment variable. Equally spaced.

final_gam: `pygam.LinearGAM` class

trained final model of LinearGAM class, from pyGAM library

pseudo_out: array of shape (observations, )

Adjusted, pseudo-outcome observations

Methods

fit: (self, T, X, y) Fits the causal dose-response model
calculate_CDRC: (self, ci, CDRC_grid_num) Calculates the CDRC (and confidence interval) from TMLE estimation
calculate_CDRC(ci=0.95)

Using the results of the fitted model, this generates a dataframe of CDRC point estimates at each of the values of the treatment grid. Connecting these estimates will produce the overall estimated CDRC. Confidence interval is returned as well.

Parameters:
ci: float (default = 0.95)

The desired confidence interval to produce. Default value is 0.95, corresponding to 95% confidence intervals. bounded (0, 1.0).

Returns:
dataframe: Pandas dataframe

Contains treatment grid values, the CDRC point estimate at that value, and the associated lower and upper confidence interval bounds at that point.

self: object
fit(T, X, y)

Fits the TMLE causal dose-response model. For now, this only accepts pandas columns. You must provide at least one covariate column.

Parameters:
T: array-like, shape (n_samples,)

A continuous treatment variable

X: array-like, shape (n_samples, m_features)

Covariates, where n_samples is the number of samples and m_features is the number of features

y: array-like, shape (n_samples,)

Outcome variable

Returns:
self : object
one_dim_estimate_density(series)

Takes in a numpy array, returns grid values for KDE and predicted probabilities

pred_from_loess(train_x, train_y, x_to_pred)

Trains simple loess regression and returns predictions

causal_curve.tmle_regressor module

Defines the Targetted Maximum likelihood Estimation (TMLE) regressor model class

class causal_curve.tmle_regressor.TMLE_Regressor(treatment_grid_num=100, lower_grid_constraint=0.01, upper_grid_constraint=0.99, n_estimators=200, learning_rate=0.01, max_depth=3, bandwidth=0.5, random_seed=None, verbose=False)

Bases: causal_curve.tmle_core.TMLE_Core

A TMLE tool that handles continuous outcomes. Inherits the TMLE_core base class. See that base class code its docstring for more details.

Methods

point_estimate: (self, T) Calculates point estimate within the CDRC given treatment values. Can only be used when outcome is continuous.
point_estimate(T)

Calculates point estimate within the CDRC given treatment values. Can only be used when outcome is continuous. Can be estimate for a single data point or can be run in batch for many observations. Extrapolation will produce untrustworthy results; the provided treatment should be within the range of the training data.

Parameters:
T: Numpy array, shape (n_samples,)

A continuous treatment variable.

Returns:
array: Numpy array

Contains a set of CDRC point estimates

point_estimate_interval(T, ci=0.95)

Calculates the prediction confidence interval associated with a point estimate within the CDRC given treatment values. Can only be used when outcome is continuous. Can be estimate for a single data point or can be run in batch for many observations. Extrapolation will produce untrustworthy results; the provided treatment should be within the range of the training data.

Parameters:
T: Numpy array, shape (n_samples,)

A continuous treatment variable.

ci: float (default = 0.95)

The desired confidence interval to produce. Default value is 0.95, corresponding to 95% confidence intervals. bounded (0, 1.0).

Returns:
array: Numpy array

Contains a set of CDRC prediction intervals ([lower bound, higher bound])

causal_curve.mediation module

Defines the Mediation test class

class causal_curve.mediation.Mediation(treatment_grid_num=10, lower_grid_constraint=0.01, upper_grid_constraint=0.99, bootstrap_draws=500, bootstrap_replicates=100, spline_order=3, n_splines=5, lambda_=0.5, max_iter=100, random_seed=None, verbose=False)

Bases: causal_curve.core.Core

Given three continuous variables (a treatment or independent variable of interest, a potential mediator, and an outcome variable of interest), Mediation provides a method to determine the average direct and indirect effect.

Parameters:
treatment_grid_num: int, optional (default = 10)

Takes the treatment, and creates a quantile-based grid across its values. For instance, if the number 6 is selected, this means the algorithm will only take the 6 treatment variable values at approximately the 0, 20, 40, 60, 80, and 100th percentiles to estimate the causal dose response curve. Higher value here means the final curve will be more finely estimated, but also increases computation time. Default is usually a reasonable number.

lower_grid_constraint: float, optional(default = 0.01)

This adds an optional constraint of the lower side of the treatment grid. Sometimes data near the minimum values of the treatment are few in number and thus generate unstable estimates. By default, this clips the bottom 1 percentile or lower of treatment values. This can be as low as 0, indicating there is no lower limit to how much treatment data is considered.

upper_grid_constraint: float, optional (default = 0.99)

See above parameter. Just like above, but this is an upper constraint. By default, this clips the top 99th percentile or higher of treatment values. This can be as high as 1.0, indicating there is no upper limit to how much treatment data is considered.

bootstrap_draws: int, optional (default = 500)

Bootstrapping is used as part of the mediation test. The parameter determines the number of draws from the original data to create a single bootstrap replicate.

bootstrap_replicates: int, optional (default = 100)

Bootstrapping is used as part of the mediation test. The parameter determines the number of bootstrapping runs to perform / number of new datasets to create.

spline_order: int, optional (default = 3)

Order of the splines to use fitting the final GAM. Must be integer >= 1. Default value creates cubic splines.

n_splines: int, optional (default = 5)

Number of splines to use for the mediation and outcome GAMs. Must be integer >= 2. Must be non-negative.

lambda_: int or float, optional (default = 0.5)

Strength of smoothing penalty. Must be a positive float. Larger values enforce stronger smoothing.

max_iter: int, optional (default = 100)

Maximum number of iterations allowed for the maximum likelihood algo to converge.

random_seed: int, optional (default = None)

Sets the random seed.

verbose: bool, optional (default = False)

Determines whether the user will get verbose status updates.

References

Imai K., Keele L., Tingley D. A General Approach to Causal Mediation Analysis. Psychological Methods. 15(4), 2010, pp.309–334.

Examples

>>> from causal_curve import Mediation
>>> med = Mediation(treatment_grid_num = 200, random_seed = 512)
>>> med.fit(T = df['Treatment'], M = df['Mediator'], y = df['Outcome'])
>>> med_results = med.calculate_effects(0.95)
Attributes:
grid_values: array of shape (treatment_grid_num, )

The gridded values of the treatment variable. Equally spaced.

Methods

fit: (self, T, M, y) Fits the trio of relevant variables using generalized additive models.
calculate_effects: (self, ci) Calculates the average direct and indirect effects.
calculate_mediation(ci=0.95)

Conducts mediation analysis on the fit data

Parameters:
ci: float (default = 0.95)

The desired bootstrap confidence interval to produce. Default value is 0.95, corresponding to 95% confidence intervals. bounded (0, 1.0).

Returns:
dataframe: Pandas dataframe

Contains the estimate of the direct and indirect effects and the proportion of indirect effects across the treatment grid values. The bootstrap confidence interval that is returned might not be symmetric.

self : object
fit(T, M, y)

Fits models so that mediation analysis can be run. For now, this only accepts pandas columns.

Parameters:
T: array-like, shape (n_samples,)

A continuous treatment variable

M: array-like, shape (n_samples,)

A continuous mediation variable

y: array-like, shape (n_samples,)

A continuous outcome variable

Returns:
self : object

Module contents

causal_curve module