DoWhy 因果 API Demo¶

我们展示了一个添加因果扩展的简单示例 to any dataframe.

[1]:

import os, sys
sys.path.append(os.path.abspath("../../../"))

import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS

%matplotlib inline

获得因果模型和数据¶

通过 dowhy 的数据模块，获得一个数据框以及生成该数据的因果模型，然后实现 $do$-Calculus 生成 Counterfactuals 来研究相关性质。

[2]:

data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df))
# Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
# data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df

[2]:

	W0	v0	y
0	0.240697	True	4.928414
1	0.000084	True	5.173862
2	0.950475	True	5.002643
3	1.418750	True	4.310699
4	-0.332002	False	1.383824
...	...	...	...
995	1.875115	True	4.329813
996	0.538284	True	5.484955
997	-0.634770	False	1.436272
998	0.594890	True	5.336496
999	0.382532	True	5.720229

1000 rows × 3 columns

[3]:

# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
             variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
             outcome=outcome,
             common_causes=[common_cause],
             proceed_when_unidentifiable=True)\
          .groupby(treatment).mean().plot(y=outcome, kind='bar')

WARNING:dowhy.causal_model:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['U', 'W0']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.

[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c29d94990>

../_images/example_notebooks_dowhy_causal_api_4_2.png

[4]:

df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True)\
         .groupby(treatment).mean().plot(y=outcome, kind='bar')

WARNING:dowhy.causal_model:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['U', 'W0']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.

[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x1c29eae710>

../_images/example_notebooks_dowhy_causal_api_5_2.png

[5]:

cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['U', 'W0']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['v0'] on outcome ['y']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['U', 'W0']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.

[6]:

cdf_0

[6]:

	W0	v0	y	propensity_score	weight
0	-0.701821	False	-0.183813	0.827329	1.208708
1	0.566814	False	-0.378917	0.223957	4.465139
2	-0.072888	False	-0.444856	0.543380	1.840333
3	-1.369989	False	-0.646138	0.954634	1.047522
4	0.407697	False	1.122292	0.291034	3.436021
...	...	...	...	...	...
995	1.481427	False	-1.116987	0.036675	27.266477
996	0.378184	False	0.581096	0.304702	3.281893
997	1.291855	False	0.696985	0.054761	18.261136
998	1.003645	False	1.358347	0.098840	10.117335
999	0.898615	False	-1.507249	0.121578	8.225183

1000 rows × 5 columns

[7]:

cdf_1

[7]:

	W0	v0	y	propensity_score	weight
0	1.645192	True	3.027437	0.974193	1.026490
1	0.263406	True	3.968999	0.638951	1.565066
2	1.868578	True	4.441714	0.984104	1.016152
3	0.830910	True	4.575607	0.861479	1.160794
4	2.290609	True	4.205689	0.993697	1.006343
...	...	...	...	...	...
995	1.852278	True	5.378302	0.983530	1.016746
996	2.451362	True	3.075562	0.995576	1.004443
997	1.513352	True	4.964406	0.965743	1.035473
998	1.085572	True	4.875919	0.916186	1.091482
999	1.944332	True	6.514778	0.986526	1.013658

1000 rows × 5 columns

对比线性回归模型¶

First, estimating the effect using the causal dataframe, and the 95% confidence interval.

[8]:

(cdf_1['y'] - cdf_0['y']).mean()

[8]:

$\displaystyle 4.748886311602873$

[9]:

1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))

[9]:

$\displaystyle 0.08828099517898987$

Comparing to the estimate from OLS.

[10]:

model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()

[10]:

OLS Regression Results
Dep. Variable:	y	R-squared (uncentered):	0.951
Model:	OLS	Adj. R-squared (uncentered):	0.951
Method:	Least Squares	F-statistic:	9757.
Date:	Thu, 19 Mar 2020	Prob (F-statistic):	0.00
Time:	21:28:02	Log-Likelihood:	-1421.4
No. Observations:	1000	AIC:	2847.
Df Residuals:	998	BIC:	2857.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	0.1327	0.038	3.493	0.000	0.058	0.207
x2	4.9106	0.060	82.020	0.000	4.793	5.028

Omnibus:	2.153	Durbin-Watson:	1.943
Prob(Omnibus):	0.341	Jarque-Bera (JB):	2.129
Skew:	-0.113	Prob(JB):	0.345
Kurtosis:	2.995	Cond. No.	3.34

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.