Dynamic Models

TODOs

Roughly ordered by priority.

Incorporate BLEU score into evaluation metrics.
Attention mechanism for dynamic models (completed for legacy).
Fix bugs in candidate sampling loss from scratch function.
Remove redundancies in data name (data_dir, name, ckpt_dir, Constructor name). In theory, only one of these is needed in order to create all the others, so long as a strict naming convention is established.
Residual skip connections.

Architecture Goals

TODO: Write past goals that have been completed already and describe them.

Fast Weights

Links:

Hyperparameters

The following seem to provide the best performance:

Cornell:
- Optimizer: Adam.
- Learning rate: 0.002.
- State size: 512. Constrained by GPU memory capacity.
- Number of layers: 2.
- Dropout rate: 0.1.

Massive Exploration of Neural Machine Translation Architectures

Source:

Paper
Repo

Old Plots

Check 2: Random & Grid Search Plots

I recently did a small random search and grid search over the following hyperparameters: learning rate, embed size, state size. The plots below show some of the findings. These are simply exploratory, I understand their limitations and I'm not drawing strong conclusions from them. They are meant to give a rough sense of the energy landscape in hyperparameter space. Oh and, plots make me happy. Enjoy. For all below, the y-axis is validation loss and the x-axis is global (training) step. The colors distinguish between model hyperparameters defined in the legends.

The only takeaway I saw from these two plots (after seeing the learning rate plots below) is that the learning rate, not the embed size, is overwhelmingly for responsible for any patterns here. It also looks like models with certain emed sizes (like 30) were underrepresented in the sampling, we see less points for them than others. The plots below illustrate the learning rate dependence.

General conclusion: the learning rate influences the validation loss far more than state size or embed size. This was basically known before making these plots, as it is a well known property of such networks (Ng). It was nice to verify this for myself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic Models

TODOs

Architecture Goals

Fast Weights

Hyperparameters

Massive Exploration of Neural Machine Translation Architectures

Old Plots

Check 2: Random & Grid Search Plots

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally