Skip to content

Dynamic Models

Brandon McKinzie edited this page Mar 27, 2017 · 33 revisions

TODOs

Roughly ordered by priority.

  • Freeze a couple models to protobufs for website use.
    • This is much harder than it should be. Why? In one word: dynamic.
    • Note to self: as hacky as it is, I think you could get this to work by just initiating a chat session as usual, and then freezing the model. Main error at present is that the dynamic decode graph structure isn't found in the saved meta graph. Just save a chat session as a meta graph and freeze that. Not the most elegant solution, but it only needs to be done once and then you have that frozen model forever.
  • Bidirectional encoders (not decoders) in dynamic model components.
    • Need to make output shapes play nice with unidirectional encoders. Apparently not as simple as just calling concat.
  • Make it easier to customize underlying RNNCell for encoder/decoder (and separately) in a config.
    • Low prio for now because e.g. LSTMCell and GRUCell don't output the same types of objects, which is . . . saddening.
  • Attention mechanism for dynamic models (completed for legacy).
    • Bahdanau (additive) seems like best choice.
    • Update: fixed compilation issue (reverted to tf r.1.1 branch while they figure out the issue I posted).
  • Beam search. Good values:
    • Beam size: 10.
    • Length penalty: 1.0.
  • Fix bugs in candidate sampling loss from scratch function.
  • Remove redundancies in data name (data_dir, name, ckpt_dir, Constructor name). In theory, only one of these is needed in order to create all the others, so long as a strict naming convention is established.
  • Residual skip connections.
  • Look into preprocessing techniques inside Moses repository.
  • Implement PyTorch seq2seq models, since they look way easier (bc dynamic computation).

Architecture Goals

TODO: Write past goals that have been completed already and describe them.

Fast Weights

Links:

Hyperparameters

The following seem to provide the best performance:

  • Cornell:
    • Optimizer: Adam.
    • Learning rate: 0.002.
    • State size: 512. Constrained by GPU memory capacity.
    • Number of layers: 2.
    • Dropout rate: 0.1.

Massive Exploration of Neural Machine Translation Architectures

Source:


Old Plots

Check 2: Random & Grid Search Plots

I recently did a small random search and grid search over the following hyperparameters: learning rate, embed size, state size. The plots below show some of the findings. These are simply exploratory, I understand their limitations and I'm not drawing strong conclusions from them. They are meant to give a rough sense of the energy landscape in hyperparameter space. Oh and, plots make me happy. Enjoy. For all below, the y-axis is validation loss and the x-axis is global (training) step. The colors distinguish between model hyperparameters defined in the legends.

state_size embed_size

The only takeaway I saw from these two plots (after seeing the learning rate plots below) is that the learning rate, not the embed size, is overwhelmingly for responsible for any patterns here. It also looks like models with certain emed sizes (like 30) were underrepresented in the sampling, we see less points for them than others. The plots below illustrate the learning rate dependence.

learning_subs

General conclusion: the learning rate influences the validation loss far more than state size or embed size. This was basically known before making these plots, as it is a well known property of such networks (Ng). It was nice to verify this for myself.

Clone this wiki locally