Reply to Lake and Baroni, 2023

This repo includes the original code from Lake and Baroni, 2023 as well as some new scripts which allow one to replicate some of our challenges to their claims.

Installation

To install, one can use the following code to create a Python environment and install the necessary packages:

python -mvenv env
source env/bin/activate.sh
python -m pip install -r requirements.txt

This should install the required packages (including PyTorch) in a couple of minutes at the most. Running the different scripts should not take too long around 2 minutes on a computer with a decent GPU, and perhaps up to 10 minutes on a standard desktop.

Finding strings

You can use the script find_weird.sh to evaluate random strings from the Gold Grammar to find ones with interesting behaviour (e.g. complete or partial failures of generalization (Level 1 or 2 errors)).

You can run it on different models by passing a different model from out_models, by default it runs on net-BIML-algebraic-top.pt

./find_weird.sh
./find_weird.sh net-BIML-top.pt

Example output

Here's some abriged output from ./find_weird.sh net-biml-top.pt

Seed number 23
DAX 3 thrice -> DAX 3 3 3
 Acc Novel (val): 100.0 SD= 0.0 N= 1296
Seed number 24
3 2 surround 3 after 1 surround DAX -> 1 DAX 1 3 2 3 2
 Acc Novel (val): 0.0 SD= 0.0 N= 1296

We can see that the model failed to generalise to 3 2 surround 3 after 1 surround DAX (corresponding to $c_3c_2\text{ surround }c_3\text{ after }c_1\text{ surround }c_h$.)

Recreate tables

evaluate_script_from_gg.sh is a Bash script which creates the data for tables 1 and 2 (level 2 errors).

./evaluate_script_from_gg.sh # Recreate our original tables
./evaluate_string_from_gg.sh "DAX 1 DAX" # Recreate our tables for a novel string, DAX 1 DAX
./evaluate_string_from_gg.sh "3 DAX thrice" net-BIML-top.pt # Recreate our tables for a novel string and for the canonical MCL model, `net-BIML-top.pt` rather than the default of `net-BIML-algebraic-top.pt` (In order to pass a model, you need to provide both a string and the model.)

Note that the script is formatting using L&B's original formatting for strings in the gold grammar, see table below:

Name	Description
1	First colour
2	Second colour
3	Third colour
DAX	Fourth/heldout colour
$x$ thrice	Output $xxx$ where $x$ is a colour word
$x$ surround $y$	Output $xyx$ where $x$ and $y$ are colour words
$a$ after $b$	Output $ba$ where $a$ and $b$ are strings

Example output

The output for ./evaluate_string_from_gg.sh "2 after 3 DAX thrice" net-BIML-top.pt shows that the model heistates between two solutions (slightly preferring the incorrect solution to the correct one) and always choses the wrong answer when the held out colour is purple.

generalization
3 h 3 h 3 h 2    2649
3 h h h 2        1671
Name: count, dtype: int64
colour_h #Correctness by colour
BLUE      0.463889
GREEN     0.676389
PINK      0.520833
PURPLE    0.000000
RED       0.148611
YELLOW    0.511111
Name: correct, dtype: float64
colour_h #Standard Error of the mean
BLUE      0.018598
GREEN     0.017448
PINK      0.018631
PURPLE    0.000000
RED       0.013266
YELLOW    0.018642

Recreate figure 1

generate_rule_figure.sh is a Bash script which creates Figure 1 from our paper (discussed in the section on Level 3 mistakes). You can supply a different model as an optional argument.

./generate_rule_figure.sh # Reproduce figure 1 from article
./generate_rule_figure.sh net-BIML-top.pt # Reproduce figure 1 from article for the `net-BIML-top.pt` model

Extending figure 1

If you are interested in trying different string generalisations or seeing the precision generalisation curve for "DAX repeatN" where repeatN is a generalised version of thrice for all N, then you should look at simple.py and modify the strings after _QUERY_ but before _GRAMMAR_. This allows one to see precisely which strings are studied and which are queried.

README FROM LAKE AND BARONI: Meta-Learning for Compositionality (MLC) for modeling human behavior

Meta-Learning for Compositionality (MLC) is an optimization procedure that encourages systematicity through a series of few-shot compositional tasks. This code shows how to train and evaluate a sequence-to-sequence (seq2seq) transformer in PyTorch to implement MLC for modeling human behavior.

A separate repository here has code for applying MLC to machine learning benchmarks, including SCAN and COGS.

Note: Users can regard the acronym BIML as synonymous with MLC. The approach was renamed to MLC after the code was written.

This code accompanies the following submitted paper.

Lake, B. M. and Baroni, M. (submitted). Human-like systematic generalization through a meta-learning neural network.

You can email brenden AT nyu.edu if you would like a copy.

Credits

This repo borrows from the excellent PyTorch seq2seq tutorial.

Requirements

Python 3 with the following packages (install time within minutes): torch (PyTorch), sklearn (scikit-learn), numpy, matplotlib

The specific versions used for development: Python (3.7.9), PyTorch (1.10.1), sklean (0.24.2), numpy (1.21.5), matplotlib (3.3.2)

Downloading data and pre-trained models

Meta-training data To get the episodes used for meta-training, you should download the following zip file with the 100K meta-training episodes. Please extract data_algebraic.zip such that data_algebraicis a sub-directory of the main repo.

Pre-trained models To get the top pre-trained models, you should download the following zip file. Please extract BIML_top_models.zip such that out_models is a sub-directory of the main repo and contains the model files net-BIML-*.pt.

Evaluating models

There are many different ways to evaluate a model after training, each of which should take less than a minute on a standard desktop. Here are a few examples.

Generating algebraic outputs on few-shot learning task

Here we find the best response from the pre-trained MLC model using greedy decoding:

python eval.py  --max --episode_type few_shot_gold --fn_out_model net-BIML-top.pt --verbose

Evaluating human responses on few-shot learning task (using log-likelihood)

Here we evaluate the log-likelihood of the human data:

python eval.py  --ll --ll_nrep 100 --episode_type few_shot_human --ll_p_lapse 0.03 --fn_out_model net-BIML-top.pt

To evaluate the log-likelihood of all models and to reproduce Table 1 in the manuscript, you can run this command for the various models. Please see the table below for how to set the arguments in each case. Note that due to system/version differences, the log-likelihood values may vary in minor ways from the paper.

--fn_out_model	--ll_p_lapse	--episode_type
net-basic-seq2seq-top.pt	0.9	human_vanilla
net-BIML-copy-top.pt	0.5	few_shot_human
net-BIML-algebraic-top.pt	0.1	few_shot_human
net-BIML-joint-top.pt	0.03	few_shot_human
net-BIML-top.pt	0.03	few_shot_human

Sampling model responses for the few-shot learning task

The models can be asked to mimic human responses on few-shot learning. To do so, the models sample from their distribution of possible outputs. A full set of samples from the models is available on this webpage. To reproduce the results for MLC (or other models), you can type the following to generate a HTML page.

python eval.py --episode_type few_shot_human_mult10 --sample_html --fn_out_model net-BIML-top.pt

Then, after ensuring the right file name is listed under __main__ in the script analysis_few_shot.py, you can compare the human and machine mistakes

cd html_output/few_shot_human_mult10
python analysis_few_shot.py

This should reproduce the hosted HTML file and numbers reported in the paper. Small variations may arise through version differences. Here is a snippet of the HTML and the text output.

Human responses (item accuracy):
   DAX after 1 : 86.364
  ...
  mean overall acc.: 80.739
  mean acc. on simple queries: 85.479
  mean acc. on complex queries: 76.0
  mean acc. on len=6 queries: 72.5
  perc. of errors that are one2one:  24.39 ; 10 of 41
  perc. of errors (involving "after") that are iconic :  23.333 ; 7 of 30
Model responses (item accuracy):
   DAX after 1 : 85.909
  ...
  mean overall acc.: 82.376
  mean acc. on simple queries: 86.252
  mean acc. on complex queries: 78.5
  mean acc. on len=6 queries: 77.75
  perc. of errors that are one2one:  56.267 ; 211 of 375
  perc. of errors (involving "after") that are iconic :  13.83 ; 39 of 282

Correlation for item accuracies: r= 0.788 ; p= 0.007
Generating HTML file: human_few_shot_behavior.html

Sampling model responses for the open-ended task

The models can be asked to mimic human responses on the open-ended task. Again, a full set of samples from the models is available on this webpage. To reproduce the results for MLC, you can type the following two commands to generate a HTML page.

python eval.py --episode_type open_end_freeform --sample_iterative --fn_out_model net-BIML-open-ended-top.pt

Then, after ensuring the right file name is listed under __main__ in the script analysis_freeform.py, you can compare the human and machine mistakes

cd html_output/open_end_freeform
python analysis_freeform.py

This should reproduce the hosted HTML file and numbers reported in the paper. Small variations may arise through version differences. Here is a snippet of the HTML and the text output.

Human:
   Processing 29 human participants.
   Percent with perfect maps (consistent with 3 inductive biases): 58.621 ; N= 17 of 29
   Percent with one2one maps: 62.069 ; N= 18 of 29
   Percent with iconic concatenation: 79.31 ; N= 23 of 29
   Percent with ME maps: 93.103 ; N= 27 of 29
Model:
   Processing 100 model samples.
   Percent with perfect maps (consistent with 3 inductive biases): 65.0 ; N= 65 of 100
   Percent with one2one maps: 66.0 ; N= 66 of 100
   Percent with iconic concatenation: 85.0 ; N= 85 of 100
   Percent with ME maps: 99.0 ; N= 99 of 100
Generating HTML file: human_open_end_freeform.html
Generating HTML file: open_end_freeform_net-BIML-open-ended-top.html

The full set of evaluation arguments can be viewed with when typing python eval.py -h:

optional arguments:
  -h, --help            show this help message and exit
  --fn_out_model FN_OUT_MODEL
                        *REQUIRED*. Filename for loading the model
  --dir_model DIR_MODEL
                        Directory for loading the model file
  --max_length_eval MAX_LENGTH_EVAL
                        Maximum generated sequence length
  --batch_size BATCH_SIZE
                        Number of episodes in batch
  --episode_type EPISODE_TYPE
                        What type of episodes do we want? See datasets.py for
                        options
  --dashboard           Showing loss curves during training.
  --ll                  Evaluate log-likelihood of validation (val) set
  --max                 Find best outputs for val commands (greedy decoding)
  --sample              Sample outputs for val commands
  --sample_html         Sample outputs for val commands in html format (using
                        unmap to canonical text)
  --sample_iterative    Sample outputs for val commands iteratively
  --fit_lapse           Fit the lapse rate
  --ll_nrep LL_NREP     Evaluate each episode this many times when computing
                        log-likelihood (needed for stochastic remappings)
  --ll_p_lapse LL_P_LAPSE
                        Lapse rate when evaluating log-likelihoods
  --verbose             Inspect outputs in more detail

Episode types

Please see datasets.py for the full set of options. Here are a few key episode types that can be set via --episode_type:

algebraic+biases : For meta-training. Corresponds to "MLC" in Table 1 and main results
algebraic_noise : For meta-training. Corresponds to "MLC (algebraic only)" in Table 1 and main results
retrieve : For meta-training. Correspond to "MLC (copy only)" in Table 1 and main results
few_shot_gold : For evaluating MLC on the prescribed algebraic responses for the few-shot learning task. (test only)
few_shot_human : For evaluating MLC on predicting human responses for the few-shot learning task. (test only)
few_shot_human_mult10 : For evaluating MLC on predicting human responses for the few-shot learning task (human data up-sampled/repeated 10x). (test only)
open_end_freeform : For generating MLC responses on open-ended task. Here, the models iteratively fill out responses one-by-one. (test only)

Training models from scratch

To train MLC on few-shot learning (as in the MLC model in Fig. 2 and Table 1), you can run the train command with default arguments:

python train.py --episode_type algebraic+biases --fn_out_model net-BIML.pt

which will produce a file out_models/net-BIML.pt.

The full set of training arguments can be viewed with python train.py -h:

optional arguments:
  -h, --help            show this help message and exit
  --fn_out_model FN_OUT_MODEL
                        *REQUIRED* Filename for saving model checkpoints.
                        Typically ends in .pt
  --dir_model DIR_MODEL
                        Directory for saving model files
  --episode_type EPISODE_TYPE
                        What type of episodes do we want? See datasets.py for
                        options
  --batch_size BATCH_SIZE
                        number of episodes per batch
  --nepochs NEPOCHS     number of training epochs
  --lr LR               learning rate
  --lr_end_factor LR_END_FACTOR
                        factor X for decrease learning rate linearly from
                        1.0*lr to X*lr across training
  --no_lr_warmup        Turn off learning rate warm up (by default, we use 1
                        epoch of warm up)
  --nlayers_encoder NLAYERS_ENCODER
                        number of layers for encoder
  --nlayers_decoder NLAYERS_DECODER
                        number of layers for decoder
  --emb_size EMB_SIZE   size of embedding
  --ff_mult FF_MULT     multiplier for size of the fully-connected layer in
                        transformer
  --dropout DROPOUT     dropout applied to embeddings and transformer
  --act ACT             activation function in the fully-connected layer of
                        the transformer (relu or gelu)
  --save_best           Save the "best model" according to validation loss.
  --save_best_skip SAVE_BEST_SKIP
                        Do not bother saving the "best model" for this
                        fraction of early training
  --resume              Resume training from a previous checkpoint

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_human		data_human
html_output		html_output
imgs		imgs
paper-data		paper-data
.gitignore		.gitignore
README.md		README.md
TeX.gitignore		TeX.gitignore
__init__.py		__init__.py
all_data.sh		all_data.sh
datasets.py		datasets.py
eval.py		eval.py
evaluate_string_from_gg.sh		evaluate_string_from_gg.sh
fancy-fig.ipynb		fancy-fig.ipynb
find_weird.sh		find_weird.sh
generate_datasets.py		generate_datasets.py
generate_novel_val_dataset.py		generate_novel_val_dataset.py
generate_rule_figure.sh		generate_rule_figure.sh
interpret_grammar.py		interpret_grammar.py
model.py		model.py
parse_generalizations.py		parse_generalizations.py
parse_output.py		parse_output.py
parse_partial.py		parse_partial.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
simple.py		simple.py
train.py		train.py
train_lib.py		train_lib.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reply to Lake and Baroni, 2023

Installation

Finding strings

Example output

Recreate tables

Example output

Recreate figure 1

Extending figure 1

README FROM LAKE AND BARONI: Meta-Learning for Compositionality (MLC) for modeling human behavior

Credits

Requirements

Downloading data and pre-trained models

Evaluating models

Generating algebraic outputs on few-shot learning task

Evaluating human responses on few-shot learning task (using log-likelihood)

Sampling model responses for the few-shot learning task

Sampling model responses for the open-ended task

Episode types

Training models from scratch

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MichaelGoodale/mlc-reply

Folders and files

Latest commit

History

Repository files navigation

Reply to Lake and Baroni, 2023

Installation

Finding strings

Example output

Recreate tables

Example output

Recreate figure 1

Extending figure 1

README FROM LAKE AND BARONI: Meta-Learning for Compositionality (MLC) for modeling human behavior

Credits

Requirements

Downloading data and pre-trained models

Evaluating models

Generating algebraic outputs on few-shot learning task

Evaluating human responses on few-shot learning task (using log-likelihood)

Sampling model responses for the few-shot learning task

Sampling model responses for the open-ended task

Episode types

Training models from scratch

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages