【python因果库实战13】因果生存分析2
加权生存分析
参见《因果推断》一书第17.4节(“边际结构模型的逆概率加权”)。
我们可以通过使用 causallib 的 WeightEstimator
(如 IPW)生成一个加权的伪人群来调整混淆因子,使得戒烟者和非戒烟者变得可互换。然后,我们可以在这个加权人群中计算生存曲线以得到无混淆效应。
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression
from causallib.estimation import IPW
from causallib.evaluation import evaluate
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=ConvergenceWarning)
# Fit an inverse propensity model
ipw = IPW(learner=LogisticRegression(max_iter=800))
ipw.fit(X, a)
# Evaluate
evaluation_results = evaluate(ipw, X, a, y, metrics_to_evaluate={})
f, ax = plt.subplots(figsize=(8, 8))
evaluation_results.plot_covariate_balance(kind="love", ax=ax, phase="train");
我们现在有了一个 IPW 模块,它实现了良好的特征平衡(加权后所有特征的标准均值差 SMD < 0.1)。让我们将 IPW 与生存分析结合起来,使用 causalib.survival.WeightedSurvival
模块。
from causallib.survival.weighted_survival import WeightedSurvival
# Compute adjusted survival curves
weighted_survival = WeightedSurvival(weight_model=ipw)
weighted_survival.fit(
X, a
) # fit weight model (we can actually skip this since it was already fitted above)
population_averaged_survival_curves = weighted_survival.estimate_population_outcome(X, a, t, y)
plot_survival_curves(
population_averaged_survival_curves,
labels=["non-quitters", "quitters"],
title="IPW-adjusted survival of smoke quitters vs. non-quitters in a 10 years observation period",
)
调整后的10年生存率差异减小(可能减少到了不显著的程度。这一点可以通过自助抽样等方法确定)。
这些曲线是使用内部的、非参数默认 Kaplan-Meier 估计器生成的。它们可以通过使用参数化风险模型来进一步平滑。需要注意的是,加权风险模型只基于时间条件进行建模,例如,它不考虑协变量(不像标准化,见下文)。
# Compute adjusted survival curves with a parametric hazards model
weighted_survival = WeightedSurvival(
weight_model=ipw, survival_model=LogisticRegression()
) # note the survival_model param (use a parametric hazards model)
weighted_survival.fit(X, a)
population_averaged_survival_curves = weighted_survival.estimate_population_outcome(X, a, t, y)
plot_survival_curves(
population_averaged_survival_curves,
labels=["non-quitters", "quitters"],
title="IPW-adjusted survival of smoke quitters vs. non-quitters in a 10 years observation period, parametric curves",
)
这些曲线可能过于平滑,因为它们是用线性风险模型拟合的。我们可以使用一些特征工程来构建更具表现力的替代方案:
# Compute adjusted survival curves with a parametric hazards model and feature engineering
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
# Init sklearn pipeline with feature transformation and logistic regression
pipeline = Pipeline(
[("transform", PolynomialFeatures(degree=2)), ("LR", LogisticRegression(max_iter=1000))]
)
weighted_survival = WeightedSurvival(
weight_model=ipw, survival_model=pipeline
) # note the survival_model param (use a parametric hazards model)
weighted_survival.fit(X, a)
population_averaged_survival_curves = weighted_survival.estimate_population_outcome(X, a, t, y)
plot_survival_curves(
population_averaged_survival_curves,
labels=["non-quitters", "quitters"],
title="IPW-adjusted survival of smoke quitters vs. non-quitters in a 10 years observation period, parametric curves ver. 2",
)
或者,我们可以插入 lifelines
包中的任何 UnivariateFitter
类,例如 PiecewiseExponentialFitter
或 WeibullFitter
:
# Compute adjusted survival curves with a lifelines PieacewiseExponentialFitter
weighted_survival = WeightedSurvival(
weight_model=ipw,
survival_model=lifelines.PiecewiseExponentialFitter(breakpoints=range(1, 120, 30)),
)
weighted_survival.fit(X, a)
population_averaged_survival_curves = weighted_survival.estimate_population_outcome(X, a, t, y)
plot_survival_curves(
population_averaged_survival_curves,
labels=["non-quitters", "quitters"],
title="IPW-adjusted survival of smoke quitters vs. non-quitters in a 10 years observation period, lifelines PiecewiseExponentialFitter",
)