- 
                Notifications
    
You must be signed in to change notification settings  - Fork 451
 
FastNLP Tutorial
from fastNLP.fastnlp import FastNLP
PATH_TO_CWS_PICKLE_FILES = "/home/zyfeng/fastNLP/reproduction/chinese_word_segment/save/"
nlp = FastNLP(model_dir=PATH_TO_CWS_PICKLE_FILES)
nlp.load("cws_basic_model", config_file="cws.cfg", section_name="POS_test")
text = ["这是最好的基于深度学习的中文分词系统。",
            "大王叫我来巡山。",
            "我党多年来致力于改善人民生活水平。"]
results = nlp.run(text)
# [[('这', 'S'), ('是', 'S'), ('最', 'S'), ('好', 'S'), ('的', 'S'), ('基', 'B'), ('于', 'E'), ('深', 'B'), ('度', 'E'), ('学', 'B'), ('习', 'E'), ('的', 'S'), ('中', 'B'), ('文', 'E'), ('分', 'B'), ('词', 'E'), ('系', 'B'), ('统', 'E'), ('。', 'S')], [('大', 'B'), ('王', 'E'), ('叫', 'S'), ('我', 'S'), ('来', 'S'), ('巡', 'B'), ('山', 'E'), ('。', 'S')], [('我', 'B'), ('党', 'E'), ('多', 'S'), ('年', 'S'), ('来', 'S'), ('致', 'B'), ('力', 'E'), ('于', 'S'), ('改', 'B'), ('善', 'E'), ('人', 'B'), ('民', 'E'), ('生', 'B'), ('活', 'E'), ('水', 'B'), ('平', 'E'), ('。', 'S')]]def train():
    # Load configuration with a ConfigLoader
    trainer_args = ConfigSection()
    model_args = ConfigSection()
    ConfigLoader("_").load_config(config_dir, {
        "test_seq_label_trainer": trainer_args, "test_seq_label_model": model_args})
    # Load data with a DataLoader
    pos_loader = POSDatasetLoader(data_path)
    train_data = pos_loader.load_lines()
    # Pre-processing: generate DataSet objects
    p = Preprocess()
    data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
    model_args["vocab_size"] = p.vocab_size
    model_args["num_classes"] = p.num_classes
    
    # Define a trainer
    trainer = Trainer(
        task="seq_label",
        epochs=trainer_args["epochs"],
        batch_size=trainer_args["batch_size"],
        validate=False,
        use_cuda=False,
        pickle_path=pickle_path,
        save_best_dev=trainer_args["save_best_dev"],
        model_name=model_name,
        optimizer=Optimizer("SGD", lr=0.01, momentum=0.9),
    )
    # Define a model
    model = SeqLabeling(model_args)
    # Start training
    trainer.train(model, data_train, data_dev)
    # Save model with a ModelSaver
    saver = ModelSaver(os.path.join(pickle_path, model_name))
    saver.save_pytorch(model)Before you start a new task, you first have corresponding datasets in hand. Implement a dataset loader which is a sub-class of DatasetLoader in dataset_loader.py.
Your dataset loader is reponsible for transforming raw data into three-level Python lists. For example,
[
    [[token_1, token_2, token_3, ...], [label_1, label_2, label_3, ...]],
    ...
]The first dimension of your Python lists must be the number of examples. As for the rest dimension, you are free to design them, because you are responsible to parse them in the next section.
Preprocessor transforms three-level lists mentioned above into DataSet object(s), by constructing Feilds in the convert_to_dataset method. Currently, different structures of the three-level lists lead to different field constructions. You are totally free to implement your construction method there.
for example in data:
    words, label = example[0], example[1]
    instance = Instance()
    if isinstance(words, list):
        x = TextField(words, is_target=False)
        instance.add_field("word_seq", x)
        use_word_seq = True
    else:
        raise NotImplementedError("words is a {}".format(type(words)))
    if isinstance(label, list):
        y = TextField(label, is_target=True)
        instance.add_field("label_seq", y)
        use_label_seq = True
    elif isinstance(label, str):
        y = LabelField(label, is_target=True)
        instance.add_field("label", y)
        use_label_str = True
    else:
        raise NotImplementedError("label is a {}".format(type(label)))
    data_set.append(instance)FastNLP uses a config file to store 1) model hyper-parameters; 2) trainer settings.
The config file is a text file. It contains any number of config sections. A section contains any number of configuarations. A configuration is a key-value pair linked by =.
For example,
# test.cfg
[model]
vocab_size = 100
num_hidden_layers = 2
use_drop_out = false
pickle_path = "./save/"
[train]
learning_rate = 0.0001
pickle_path = "./save/"
validate = true
save_dev_output = false
Load config sections with ConfigLoader.
trainer_args = ConfigSection()
model_args = ConfigSection()
ConfigLoader().load_config("./test.cfg", {"train": trainer_args, "model": model_args})Currently, trainer support only a few tasks. You can add more in data_forward. The same as tester.
def data_forward(self, network, x):
    if self._task == "seq_label":
        y = network(x["word_seq"], x["word_seq_origin_len"])
    elif self._task == "text_classify":
        y = network(x["word_seq"])
    else:
        raise NotImplementedError("Unknown task type {}.".format(self._task))trainer = Trainer(trainer_args)
model = SeqLabeling(model_args)
trainer.train(model, data_train, data_dev)            loader                    preprocessor         Batch
raw dataset ------> 2-D list of strings ------->  DataSet -------> data_iterator ------> batch_x 
                                                                                         batch_y
data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
train_data = pos_loader.load_lines()
"""
[
    [["This", "is", "fast", "NLP"], ["label_1", "label_3", "label_2", "label_1"]],
    ...
]
"""p = SeqLabelPreprocess()
data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
# type(data_train) == DataSet
# type(data_dev) == DataSetDataSet 
[
    Instance(Field_1, Field_2, Field_3, ...),
    Instance(Field_1, Field_2, Field_3, ...),
    ...
]
data_iterator = Batch(data_train, batch_size=16, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
    x = batch_x["word_seq"]
    y = network(x)
    get_loss(y, batch_y["label_seq"])dataset.py defines DataSet, which is a list of Instances.
instance.py defines Instance, which is a single example and contains multiple Fields.
field.py defines Field, which is the elementary data type or representation.
TextField defines a list of strings. LabelField defines single interger or string.
You can add extra fields to support more complex data.
Each field
- has a field name
 - has a 
is_targetboolean argument to specify whether it is Y or not (X) in training. - has a 
to_tensormethod to define how this field data is transformed into tensors 
dataset.py defines a function to make DataSet from a list.
def create_dataset_from_lists(str_lists: list, word_vocab: dict, has_target: bool = False, label_vocab: dict = None) --> DataSet: Example: https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_tester.py#L15 https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_trainer.py#L14
batch.py defines Batch, an iterable wrapper of DataSet.
Sampling and padding is applied insides.
Iteration over a Batch object returns two dict, batch_x and batch_y.
The key of the dict is the field name. The value is the corresponding batch tensor.
data_iterator = Batch(data_set, batch_size=8, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
   batch_x["word_seq"]  # torch.LongTensor
   batch_y["label"]  # torch.LongTensorBatch will keep a record of the origin length of a field before padding.
It returns the origin lengths with a string key created by appending "_origin_len" before the field name.
For example,  batch_x["word_seq_origin_len"]  # torch.LongTensor.
Why origin length is tensor rather than a list of int ?
Because the sequence labeling model's forward() has added an extra arguemnt seq_len to represent the origin lengths (The creation of sequence masks is moved into the model, which needs seq_len.). And tensorboardX requires arguemnts passed to forward() to be nothing but tensor.
In previous design, different trainers are responsible for different tasks.
After introducing Fields & DataSet, different tasks are represented by different DataSet structures, which is the way Fields organize.
Therefore, all methods in SeqLabelTrainer and  ClassificationTrainer are removed. They are just an empty sub-class to deprecate, and will throw an warning information when used.
So are those in Testers and Predictor.
However, trainers still need task information to know which fields are network inputs among batch_x. This is because
- we don't know which task is going to do when preprocessing and making DataSet.
 - not all fields in batch_x are needed as network input. Some may be unused, such as 
seq_lenin text classification. - in tester, different tasks require different evaluation methods and metrics.
 - in predictor, different tasks require different pre-process and post-process.
 
Trainer & Tester have a required arguement (raise error if not provided, NO default value) self._task to specify which task is going to perform.
if self._task == "seq_label": 
    y = network(x["word_seq"], x["word_seq_origin_len"]) 
elif self._task == "text_classify": 
    y = network(x["word_seq"]) - design a pytorch model, with forward method.
 - choose fields or create new fields to describe your data set.
 - modify preprocessor's 
build_dictmethod: to build dictionary over your data, and use the dictionary to transform multi-level list of strings into multi-level list of indices. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L182 - modify preprocessor's 
convert_to_datasetmethod: to transform multi-level list of indices into a DataSet object. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L244 - specify which fields you want to use as network inputs in Trainer, Tester, and Predictor. Where 
self._taskappears, where there are modification. - run and debug.
 
- optimize Preprocessor: make it a callable object, customized processing function as argument
 - more unit tests on core/
 - eliminate 
self._task? - merge kezhen's code about build_dict
 
Any questions are welcome!