【python因果推断库9】工具变量回归与使用 pymc 验证工具变量2
目录
数据:教育回报
数据:教育回报
我们使用的是1976年美国国家青年男性纵向调查(NLSYM)的数据集。这些数据来源于 R 语言的 `ivreg` 包。最初,这些数据被用来重现 Card 在1995年的著名论文《Using Geographical Variation in College Proximity to Estimate the Return to Schooling》中的部分内容,他在该文中旨在提炼教育对工资的因果影响。card1995returns
df = cp.load_data("schoolReturns")
def poly(x, p): # replicate R's poly decompostion function
X = np.transpose(np.vstack([x**k for k in range(p + 1)]))
return np.linalg.qr(X)[0][:, 1:]
df["log_wage"] = np.log(df["wage"])
df[["experience_1", "experience_2"]] = pd.DataFrame(poly(df["experience"].values, 2))
df[["age_1", "age_2"]] = poly(df["age"].values, 2)
df["nearcollege_indicator"] = np.where(
df["nearcollege"] == "yes", 1, 0
) # 4 year college
df["nearcollege2_indicator"] = np.where(
df["nearcollege2"] == "yes", 1, 0
) # 2 year college
df["nearcollege4_indicator"] = np.where(
df["nearcollege4"] == "yes", 1, 0
) # 4 year public or private college
df["south_indicator"] = np.where(df["south"] == "yes", 1, 0) # southern US
df["smsa_indicator"] = np.where(
df["smsa"] == "yes", 1, 0
) # standard metropolitan statistical area
df["ethnicity_indicator"] = np.where(
df["ethnicity"] == "afam", 1, 0
) # African-American or other
df[["wage", "education", "experience", "ethnicity", "nearcollege"]].head(5)
我们可以按子群体划分人口,并仅仅观察结果如何随着人口分层的不同水平而变化。
strata_df = (
df.assign(experience_cut=lambda x: pd.qcut(x["experience_1"], 4))
.groupby(["nearcollege", "smsa", "ethnicity", "south", "experience_cut"])[
["log_wage"]
]
.mean()
.reset_index()
.pivot(
index=["smsa", "ethnicity", "south", "experience_cut"],
columns="nearcollege",
values="log_wage",
)
.reset_index()
.assign(diff=lambda x: x["no"] - x["yes"])
.sort_values("diff", ascending=False)
)
strata_df.head(10).style.background_gradient(axis=0, subset="diff")
strata_df["diff"].mean()
-0.06568303099960848
这应该能突出显示在保持其他条件不变的情况下,当我们简单地改变工具变量状态时,我们可能期望的结果变量的变化幅度。接下来,我们想可视化教育和工资之间的关系,并理想情况下看看大学的接近程度是否以及在何处对教育程度产生了影响。
def rand_jitter(arr):
stdev = 0.01 * (max(arr) - min(arr))
return arr + np.random.randn(len(arr)) * stdev
custom_lines = [
Line2D([0], [0], color="red", lw=4),
Line2D([0], [0], color="blue", lw=4),
]
fig, axs = plt.subplots(2, 1, figsize=(10, 9))
axs = axs.flatten()
ax = axs[0]
ax1 = axs[1]
ax.set_title("Log Wage ~ Education")
col = np.where(df.nearcollege_indicator, "b", "r")
ax.scatter(
rand_jitter(df["education"]), df["log_wage"], c=col, alpha=0.3, rasterized=True
)
ax.legend(custom_lines, ["Not NearCollege", "Near College"])
ax.set_xlabel("Education")
ax.set_ylabel("Log Wage")
x = np.linspace(0, 20, 100)
b, a = np.polyfit(df["education"], df["log_wage"], 1)
y = a + b * x
ax.plot(x, y, color="k", label="Regression Line", rasterized=True)
ellipse = Ellipse(
(17.5, 6.5),
width=4,
height=2.5,
angle=90,
facecolor=None,
alpha=0.1,
color="purple",
linestyle="--",
rasterized=True,
)
ax.add_patch(ellipse)
custom_lines = [
Line2D([0], [0], color="green", lw=4),
Line2D([0], [0], color="purple", lw=4),
]
ax1.set_title("Log Wage ~ Education")
col = np.where(df.nearcollege2_indicator, "g", "purple")
ax1.plot(x, y, color="k", label="Regression Line", rasterized=True)
ellipse = Ellipse(
(16, 6.5),
width=4,
height=1.5,
angle=90,
facecolor=None,
alpha=0.1,
color="blue",
linestyle="--",
rasterized=True,
)
ax1.add_patch(ellipse)
ax1.scatter(
rand_jitter(df["education"]), df["log_wage"], c=col, alpha=0.6, rasterized=True
)
ax1.legend(custom_lines, ["Not Near 2 year College", "Near 2 year College"])
ax1.set_xlabel("Education")
ax1.set_ylabel("Log Wage");
这里至少有一些迹象表明,大学的接近程度对教育程度有一定影响,而且教育似乎推动了工资的增长。通常的方法尝试正式量化工具变量的强度。但从视觉上看,大学接近程度会诱导更高的教育水平这一点似乎很有说服力。例如,我们可以清楚地看到教育程度和工资获取之间存在明显的上升趋势关系。此外,在第一个图中,我们可以看到在接近四年制大学较差的人群中,达到最高教育成果的人数稀少,即椭圆中的红色点比蓝色点少。第二个图讲述了类似的故事,但聚焦于假设接近两年制学院时的最大教育程度。这些观察应该使我们倾向于基于大学接近程度来调整对工资增长的条件期望。