-
Notifications
You must be signed in to change notification settings - Fork 75
Dynamic Models
Roughly ordered by priority.
- Incorporate BLEU score into evaluation metrics.
- Attention mechanism for dynamic models (completed for legacy).
- Fix bugs in candidate sampling loss from scratch function.
- Remove redundancies in data name (data_dir, name, ckpt_dir, Constructor name). In theory, only one of these is needed in order to create all the others, so long as a strict naming convention is established.
- Residual skip connections.
TODO: Write past goals that have been completed already and describe them.
Links:
The following seem to provide the best performance:
- Cornell:
- Optimizer: Adam.
- Learning rate: 0.002.
- State size: 512. Constrained by GPU memory capacity.
- Number of layers: 2.
- Dropout rate: 0.1.
Source:
I recently did a small random search and grid search over the following hyperparameters: learning rate, embed size, state size. The plots below show some of the findings. These are simply exploratory, I understand their limitations and I'm not drawing strong conclusions from them. They are meant to give a rough sense of the energy landscape in hyperparameter space. Oh and, plots make me happy. Enjoy. For all below, the y-axis is validation loss and the x-axis is global (training) step. The colors distinguish between model hyperparameters defined in the legends.
The only takeaway I saw from these two plots (after seeing the learning rate plots below) is that the learning rate, not the embed size, is overwhelmingly for responsible for any patterns here. It also looks like models with certain emed sizes (like 30) were underrepresented in the sampling, we see less points for them than others. The plots below illustrate the learning rate dependence.
General conclusion: the learning rate influences the validation loss far more than state size or embed size. This was basically known before making these plots, as it is a well known property of such networks (Ng). It was nice to verify this for myself.