FAQ:

  • weighting do sampler dowhy.do_samplers.weighting_sampler.WeightingSampler 是什么?应该是一个使用倾向得分估计(Logistic Regression) 的判别模型。

Do-sampler 简介

— by Adam Kelleher, Heyang Gong 编译

The “do-sampler” is a new feature in DoWhy. 尽管大多数以潜在结果为导向的估算器都专注于估计 the specific contrast \(E[Y_0 - Y_1]\), Pearlian inference 专注于更基本的因果量,如反事实结果的分布\(P(Y^x = y)\), 它可以用来得出其他感兴趣的统计信息。

通常,很难非参数地表示概率分布。即使可以,您也不想 gloss over finite-sample problems with you data you used to generate it. 考虑到这些问题,我们决定通过使用称为“ do-sampler”的对象从它们中进行采样来表示干预性分布。利用这些样本,我们可以希望 compute finite-sample statistics of our interventional data. 如果我们 bootstrap 许多这样的样本,我们甚至可以期待得到这些统计量的 good sampling distributions.

The user should not 这仍然是一个活跃的研究领域,so you should be careful about being too confident in bootstrapped error bars from do-samplers.

Note that do samplers sample from the outcome distribution, and so will vary significantly from sample to sample. To use them to compute outcomes, 我们推荐 generate several such samples to get an idea of the posterior variance of your statistic of interest.

Pearlian 干预

Following the notion of an intervention in a Pearlian causal model, 我们的 do-samplers 顺序执行如下步骤:

  1. Disrupt causes

  2. Make Effective

  3. Propagate and sample

在第一阶段,我们设想 cutting the in-edges to all of the variables we’re intervening on. 在第二阶段,我们将这些变量的值设置为 their interventional quantities。在第三阶段,我们通过模型向前传播该值 to compute interventional outcomes with a sampling procedure.

在实践中,我们可以通过多种方式来实现这些步骤。 They’re most explicit when we build the model as a linear bayesian network in PyMC3, which is what underlies the MCMC do sampler. In that case, we fit one bayesian network to the data, then construct a new network representing the interventional network. The structural equations are set with the parameters fit in the initial network, and we sample from that new network to get our do sample.

In the weighting do sampler, we abstractly think of “disrupting the causes” by accounting for selection into the causal state through propensity score estimation. These scores contain the information used to block back-door paths, and so have the same statistics effect as cutting edges into the causal state. We make the treatment effective by selecting the subset of our data set with the correct value of the causal state. Finally, we generated a weighted random sample using inverse propensity weighting to get our do sample.

您可以通过其他方法来实现这三个步骤, but the formula is the same. We’ve abstracted them out as abstract class methods which you should override if you’d like to create your own do sampler!

我们实现的 do sampler 有三个特点: Statefulness, Integration 和 Specifying interventions.

Statefulness

The do sampler when accessed through the high-level pandas API is stateless by default. This makes it intuitive to work with, and you can generate different samples with repeated calls to the pandas.DataFrame.causal.do. It can be made stateful, which is sometimes useful.

我们之前提到的三阶段流程已 is implemented by passing an internal pandas.DataFrame through each of the three stages, but regarding it as temporary. The internal dataframe is reset by default before returning the result.

It can be much more efficient to maintain state in the do sampler between generating samples. This is especially true when step 1 requires fitting an expensive model, as is the case with the MCMC do sampler, the kernel density sampler, and the weighting sampler.

(只拟合一次模型) Instead of re-fitting the model for each sample, you’d like to fit it once, and then generate many samples from the do sampler. You can do this by setting the kwarg stateful=True when you call the pandas.DataFrame.causal.do method. To reset the state of the dataframe (deleting the model as well as the internal dataframe), you can call the pandas.DataFrame.causal.reset method.

Through the lower-level API, the sampler 默认是无需申明的。 The assumption is that a “power user” who is using the low-level API will want more control over the sampling process. In this case, state is carried by internal dataframe self._df, which is a copy of the dataframe passed on instantiation. The original dataframe is kept in self._data, and is used when the user resets state.

Integration

The do-sampler is built on top of the identification abstraction used throughout DoWhy. It uses a dowhy.CausalModel to perform identification, and builds any models it needs automatically using this identification.

Specifying Interventions

There is a kwarg on the dowhy.do_sampler.DoSampler object called keep_original_treatment. While an intervention might be to set all units treatment values to some specific value, it’s often natural to keep them set as they were, and instead remove confounding bias during effect estimation. If you’d prefer not to specify an intervention, you can set the kwarg like keep_original_treatment=True, and the second stage of the 3-stage process will be skipped. In that case, any intervention specified on sampling will be ignored.

If the keep_original_treatment flag is set to false (it is by default), then you must specify an intervention when you sample from the do sampler. For details, see the demo below!

Demo

首先,让我们生成一些数据和一个因果模型。Here, Z confounds our causal state, D, with the outcome, Y.

[1]:
import os, sys
sys.path.append(os.path.abspath("../../../"))

import numpy as np
import pandas as pd
import dowhy.api
[2]:
N = 5000
z = np.random.uniform(size=N)
d = np.random.binomial(1., p=1./(1. + np.exp(-5. * z)))
y = 2. * z + d + 0.1 * np.random.normal(size=N)
df = pd.DataFrame({'Z': z, 'D': d, 'Y': y})
(df[df.D == 1].mean() - df[df.D == 0].mean())['Y']
[2]:
$\displaystyle 1.634523393385388$

结果比真实的因果效应高 60%. 那么,让我们为这些数据建立因果模型。

[3]:
from dowhy import CausalModel

causes = ['D']
outcomes = ['Y']
common_causes = ['Z']

model = CausalModel(df,
                    causes,
                    outcomes,
                    common_causes=common_causes,
                    proceed_when_unidentifiable=True)
WARNING:dowhy.causal_model:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['D'] on outcome ['Y']

Now that we have a model, we can try to identify the causal effect.

[4]:
identification = model.identify_effect()
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['U', 'Z']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]

Identification works! We didn’t actually need to do this yet, since it will happen internally with the do sampler, but it can’t hurt to check that identification works before proceeding. Now, let’s build the sampler.

[5]:
from dowhy.do_samplers.weighting_sampler import WeightingSampler

sampler = WeightingSampler(df,
                           causal_model=model,
                           keep_original_treatment=True,
                           variable_types={'D': 'b', 'Z': 'c', 'Y': 'c'})
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['U', 'Z']
WARNING:dowhy.causal_identifier:If this is observed data (not from a randomized experiment), there might always be missing confounders. Causal effect cannot be identified perfectly.
INFO:dowhy.causal_identifier:Continuing by ignoring these unobserved confounders because proceed_when_unidentifiable flag is True.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.

Now, we can just sample from the interventional distribution! Since we set the keep_original_treatment flag to False, any treatment we pass here will be ignored. Here, we’ll just pass None to acknowledge that we know we don’t want to pass anything.

If you’d prefer to specify an intervention, you can just put the interventional value here instead as a list or numpy array.

[6]:
interventional_df = sampler.do_sample(None)
[7]:
(interventional_df[interventional_df.D == 1].mean() - interventional_df[interventional_df.D == 0].mean())['Y']
[7]:
$\displaystyle 1.0422121269660511$

现在我们的结果更接近真实效应 1.0!