Merge branch 'master' of github.com:DoubleML/doubleml-serverless into 0.0.X

MalteKurz · MalteKurz · commit aa439e6ffc9f · 2021-04-20T12:00:11.000+02:00
diff --git a/README.md b/README.md
@@ -1,13 +1,17 @@
-# DoubleML-Serverless - Distributed Double Machine Learning with a Serverless Architecture
+# DoubleML-Serverless - Distributed Double Machine Learning with a Serverless Architecture <a href="https://docs.doubleml.org"><img src="https://raw.githubusercontent.com/DoubleML/doubleml-for-py/master/doc/logo.png" align="right" width = "120" /></a>
 
 This repo contains a prototype implementation **DoubleML-Serverless** of distributed double machine learning with a serverless infrastructure
 using [AWS Lambda](https://aws.amazon.com/lambda).
-A detailed discussion of this prototype can be found in the paper "Distributed Double Machine Learning with a  Serverless Architecture" (Kurz, 2021).
+A detailed discussion of this prototype can be found in the paper ["Distributed Double Machine Learning with a Serverless Architecture" (Kurz, 2021)](https://doi.org/10.1145/3447545.3451181).
 DoubleML-Serverless is an extension for serverless cloud computing of the Python package **DoubleML**.
 DoubleML is available via PyPI [https://pypi.org/project/DoubleML](https://pypi.org/project/DoubleML) and on GitHub [https://github.com/DoubleML/doubleml-for-py](https://github.com/DoubleML/doubleml-for-py).
-Also see [https://docs.doubleml.org](https://docs.doubleml.org) for a detailed documentation and user guide for the DoubleML package.
+The Python package DoubleML was introduced in
+"DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python"
+([Bach et al., 2021](https://arxiv.org/abs/2104.03220))
+and a detailed documentation \& user guide for the package is available at
+[https://docs.doubleml.org](https://docs.doubleml.org).
 
-## Getting started
+## Getting Started
 
 ### Installation of DoubleML-Serverless
 
@@ -30,7 +34,7 @@ After downloading the wheel, the package can be installed with pip (replace `XXX
 pip install -U DoubleML-Serverless-XXX-py3-none-any.whl
 ```
 
-### Deploy the corresponding serverless app to AWS Lambda using AWS SAM
+### Deploy the Corresponding Serverless App to AWS Lambda using AWS SAM
 
 To use AWS Lambda for estimating double machine learning models, a deployment in your AWS account is necessary.
 The corresponding serverless application consists of the following components:
@@ -56,11 +60,11 @@ There are two options for deployment:
     sam deploy --guided
     ```
 
-### Estimating a partially linear regression model with double machine learning and serverless scaling using AWS Lambda
+### Estimating a Partially Linear Regression Model with Double Machine Learning and Serverless Scaling Using AWS Lambda
 
 To demonstrate the functionality of DoubleML-Serverless we revisit the Pennsylvania  Reemployment Bonus experiment
-and estimate the effect of provisioning a cash bonus on the unemployment duration as studied in Chernozhukov et al. (2018).
-This example is also discussed in the accompanying paper to the DoubleML-Serverless package (Kurz, 2021).
+and estimate the effect of provisioning a cash bonus on the unemployment duration as studied in [Chernozhukov et al. (2018)](https://doi.org/10.1111/ectj.12097).
+This example is also discussed in the accompanying paper to the DoubleML-Serverless package ([Kurz, 2021](https://doi.org/10.1145/3447545.3451181)).
 
 We first load the data using functionalities from the DoubleML package.
 ```python
@@ -112,9 +116,48 @@ dml_lambda_plr_bonus.fit_aws_lambda()
 A summary of the estimation result is available via the property `dml_lambda_plr_bonus.summary`.
 Some metrics about the estimation on AWS Lambda can be obtained via the property  `dml_lambda_plr_bonus.aws_lambda_metrics`.
 
+## Citation
+
+If you use the DoubleML-Serverless package a citation is highly appreciated:
+
+Kurz, M. S. (2021). Distributed Double Machine Learning with a Serverless Architecture.
+In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE '21).
+Association for Computing Machinery, New York, NY, USA, 27–33.
+doi:[10.1145/3447545.3451181](https://doi.org/10.1145/3447545.3451181).
+
+Bibtex-entry:
+
+```
+@inproceedings{kurz2021DoublemlServerless,
+   author = {Kurz, Malte S.},
+   title = {Distributed Double Machine Learning with a Serverless Architecture},
+   year = {2021},
+   isbn = {9781450383318},
+   publisher = {Association for Computing Machinery},
+   address = {New York, NY, USA},
+   url = {https://doi.org/10.1145/3447545.3451181},
+   doi = {10.1145/3447545.3451181},
+   abstract = {This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It allows to get fast on-demand estimations without additional cloud maintenance effort. We provide a prototype Python implementation DoubleML-Serverless for the estimation of double machine learning models with the serverless computing platform AWS Lambda and demonstrate its utility with a case study analyzing estimation times and costs.},
+   booktitle = {Companion of the ACM/SPEC International Conference on Performance Engineering},
+   pages = {27--33},
+   numpages = {7},
+   keywords = {machine learning, causal machine learning, serverless computing, distributed computing, AWS Lambda, function-as-a-service (FAAS)},
+   location = {Virtual Event, France},
+   series = {ICPE '21}
+}
+```
+
 ## References
 
-Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018),
-Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. doi:[10.1111/ectj.12097](https://doi.org/10.1111/ectj.12097).
+Bach, P., Chernozhukov, V., Kurz, M. S., and Spindler, M. (2021).
+DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python.
+arXiv:[2104.03220](https://arxiv.org/abs/2104.03220).
+
+Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018).
+Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68.
+doi:[10.1111/ectj.12097](https://doi.org/10.1111/ectj.12097).
 
-Kurz, M.S. 2020. "Distributed Double Machine Learning with a  Serverless Architecture". Unpublished Working Paper.
+Kurz, M. S. (2021). Distributed Double Machine Learning with a Serverless Architecture.
+In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE '21).
+Association for Computing Machinery, New York, NY, USA, 27–33.
+doi:[10.1145/3447545.3451181](https://doi.org/10.1145/3447545.3451181).
diff --git a/doubleml_serverless/double_ml_iivm_aws_lambda.py b/doubleml_serverless/double_ml_iivm_aws_lambda.py
@@ -19,6 +19,7 @@ def __init__(self,
                  n_folds=5,
                  n_rep=1,
                  score='ATE',
+                 subgroups=None,
                  dml_procedure='dml2',
                  trimming_rule='truncate',
                  trimming_threshold=1e-12,
@@ -32,6 +33,7 @@ def __init__(self,
                               n_folds,
                               n_rep,
                               score,
+                              subgroups,
                               dml_procedure,
                               trimming_rule,
                               trimming_threshold,
@@ -72,32 +74,49 @@ def _ml_nuisance_aws_lambda(self, cv_params):
                         self._dml_data.z_cols[0], self._dml_data.x_cols,
                         method='predict_proba')
 
-        _attach_learner(payload_ml_r0,
-                        'ml_r0', self.learner['ml_r'],
-                        self._dml_data.d_cols[0], self._dml_data.x_cols,
-                        method='predict_proba')
-
-        _attach_learner(payload_ml_r1,
-                        'ml_r1', self.learner['ml_r'],
-                        self._dml_data.d_cols[0], self._dml_data.x_cols,
-                        method='predict_proba')
-
-        all_payloads = [payload_ml_g0, payload_ml_g1, payload_ml_m, payload_ml_r0, payload_ml_r1]
-        all_smpls = [smpls_z0, smpls_z1, self.smpls, smpls_z0, smpls_z1]
+        all_payloads = [payload_ml_g0, payload_ml_g1, payload_ml_m]
+        all_smpls = [smpls_z0, smpls_z1, self.smpls]
+        send_train_ids = [True, True, False]
+        params_names = ['ml_g0', 'ml_g1', 'ml_m']
+
+        if self.subgroups['always_takers']:
+            _attach_learner(payload_ml_r0,
+                            'ml_r0', self.learner['ml_r'],
+                            self._dml_data.d_cols[0], self._dml_data.x_cols,
+                            method='predict_proba')
+            all_payloads.append(payload_ml_r0)
+            all_smpls.append(smpls_z0)
+            send_train_ids.append(True)
+            params_names.append('ml_r0')
+
+        if self.subgroups['never_takers']:
+            _attach_learner(payload_ml_r1,
+                            'ml_r1', self.learner['ml_r'],
+                            self._dml_data.d_cols[0], self._dml_data.x_cols,
+                            method='predict_proba')
+            all_payloads.append(payload_ml_r1)
+            all_smpls.append(smpls_z1)
+            send_train_ids.append(True)
+            params_names.append('ml_r1')
 
         payloads = _attach_smpls(all_payloads,
                                  all_smpls,
                                  self.n_folds,
                                  self.n_rep,
                                  self._dml_data.n_obs,
                                  cv_params['n_lambdas_cv'],
-                                 [True, True, False, True, True],
+                                 send_train_ids,
                                  cv_params['seed'])
 
-        preds = self.invoke_lambdas(payloads, self.smpls, self.params_names,
+        preds = self.invoke_lambdas(payloads, self.smpls, params_names,
                                     self._dml_data.n_obs, self.n_rep,
                                     cv_params['n_lambdas_cv'])
 
+        if not self.subgroups['always_takers']:
+            preds['ml_r0'] = np.zeros_like(preds['ml_g0'])
+        if not self.subgroups['never_takers']:
+            preds['ml_r1'] = np.ones_like(preds['ml_g1'])
+
         for i_rep in range(self.n_rep):
             # compute score elements
 
diff --git a/doubleml_serverless/tests/test_iivm.py b/doubleml_serverless/tests/test_iivm.py
@@ -51,8 +51,16 @@ def trimming_threshold(request):
     return request.param
 
 
+@pytest.fixture(scope='module',
+                params=[{'always_takers': True, 'never_takers': True},
+                        {'always_takers': False, 'never_takers': True},
+                        {'always_takers': True, 'never_takers': False}])
+def subgroups(request):
+    return request.param
+
+
 @pytest.fixture(scope="module")
-def dml_iivm_fixture(generate_data_iivm, idx, learner, score, dml_procedure, trimming_threshold):
+def dml_iivm_fixture(generate_data_iivm, idx, learner, score, dml_procedure, trimming_threshold, subgroups):
     boot_methods = ['normal']
     n_folds = 4
     n_rep_boot = 502
@@ -77,6 +85,7 @@ def dml_iivm_fixture(generate_data_iivm, idx, learner, score, dml_procedure, tri
                                                   ml_g, ml_m, ml_r,
                                                   n_folds,
                                                   score=score,
+                                                  subgroups=subgroups,
                                                   dml_procedure=dml_procedure)
 
     dml_iivm_lambda.fit_aws_lambda()
@@ -87,6 +96,7 @@ def dml_iivm_fixture(generate_data_iivm, idx, learner, score, dml_procedure, tri
                                 ml_g, ml_m, ml_r,
                                 n_folds,
                                 score=score,
+                                subgroups=subgroups,
                                 dml_procedure=dml_procedure)
 
     dml_iivm.fit()
diff --git a/requirements.txt b/requirements.txt
@@ -1,9 +1,9 @@
-DoubleML>=0.1.2
+DoubleML>=0.2.2
 joblib
 numpy
 pandas
 scipy
-sklearn
+scikit-learn==0.23.2
 statsmodels
 aiobotocore==1.1.2
 boto3==1.14.44