-
Notifications
You must be signed in to change notification settings - Fork 75
Dynamic Models
Table of Contents
Roughly ordered by priority.
-
Incorporate precision and recall into evaluation metrics.
-
Attention mechanism for dynamic models (completed for legacy).
- Bahdanau (additive) seems like best choice.
- Update: fixed compilation issue (reverted to tf r.1.1 branch while they figure out the issue I posted).
-
Make it easier to group evaluation plots (e.g. gradients of encoder only). Right now they are only partitioned into a few groups, which can get quite large for complicated models.
-
Generalize structural issues related to different RNNCell outputs.
-
Loading pretrained models should be a lot easier. Currently, the user needs to be careful to specify the hyperparameters of the pretrained model (so it matches with the loaded pretrained model), but that shouldn't be necessary. It should be as easy as something like
./main.py --load_pretrained reddit
- Deploy a bot on website.
- [COMPLETED] Create a bot in the same way I currently load a pretrained model.
- [COMPLETED] Restructure (or just create a new) method(s) for conversation such that the ideal interface of
response = bot(sentence)
is realized. - [COMPLETED] Implement this within the chat files for the website.
- Freeze a couple models to protobufs for website use.
- Ideal interface/sequence for doing this:
- Train a newly created bot.
- It is saved such that, after training, user can just do something like:
python3 main.py --unfreeze_previous
- Challenges:
- Decoder graph structure is different for chat sessions. There is no way to get around this in a static graph computation framework like tensorflow. First idea is to just rebuild the structure when freezing to be appropriate for chatting. Although, in general that's not the most desirable design approach, I'm a bit constrained by the static nature of TF; might be the best (or only) way.
- Input pipeline currently handles decoding from raw user input, which is hard to freeze. Under the hood, it just passes the raw input to a feed dict though. Should be easy to work with.
- Ideal interface/sequence for doing this:
- Bidirectional encoders (not decoders) in dynamic model components.
- Need to make output shapes play nice with unidirectional encoders. Apparently not as simple as just calling concat.
- Make it easier to customize underlying RNNCell for encoder/decoder (and separately) in a config.
- Add support for a larger variety of cells other than the current: multi-layer uni-/bi-directional LSTM/GRU cells.
- Beam search. Good values:
- Beam size: 10.
- Length penalty: 1.0.
- Fix bugs in candidate sampling loss from scratch function.
- Remove redundancies in data name (data_dir, name, ckpt_dir, Constructor name). In theory, only one of these is needed in order to create all the others, so long as a strict naming convention is established.
- Residual skip connections.
- Look into preprocessing techniques inside Moses repository.
- Implement PyTorch seq2seq models, since they look way easier (bc dynamic computation).
TODO: Write past goals that have been completed already and describe them.
Links:
The following seem to provide the best performance:
- Cornell:
- Optimizer: Adam.
- Learning rate: 0.002.
- State size: 512. Constrained by GPU memory capacity.
- Number of layers: 2.
- Dropout rate: 0.1.
Source:
I recently did a small random search and grid search over the following hyperparameters: learning rate, embed size, state size. The plots below show some of the findings. These are simply exploratory, I understand their limitations and I'm not drawing strong conclusions from them. They are meant to give a rough sense of the energy landscape in hyperparameter space. Oh and, plots make me happy. Enjoy. For all below, the y-axis is validation loss and the x-axis is global (training) step. The colors distinguish between model hyperparameters defined in the legends.
The only takeaway I saw from these two plots (after seeing the learning rate plots below) is that the learning rate, not the embed size, is overwhelmingly for responsible for any patterns here. It also looks like models with certain emed sizes (like 30) were underrepresented in the sampling, we see less points for them than others. The plots below illustrate the learning rate dependence.
General conclusion: the learning rate influences the validation loss far more than state size or embed size. This was basically known before making these plots, as it is a well known property of such networks (Ng). It was nice to verify this for myself.