- 
                Notifications
    
You must be signed in to change notification settings  - Fork 451
 
FastNLP Tutorial
            loader                    preprocessor         Batch
raw dataset ------> 2-D list of strings ------->  DataSet -------> data_iterator ------> batch_x 
                                                                                         batch_y
data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
train_data = pos_loader.load_lines()
"""
[
    [["This", "is", "fast", "NLP"], ["label_1", "label_3", "label_2", "label_1"]],
    ...
]
"""p = SeqLabelPreprocess()
data_train, data_dev = p.run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
# type(data_train) == DataSet
# type(data_dev) == DataSetDataSet 
[
    Instance(Field_1, Field_2, Field_3, ...),
    Instance(Field_1, Field_2, Field_3, ...),
    ...
]
data_iterator = Batch(data_train, batch_size=16, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
    x = batch_x["word_seq"]
    y = network(x)
    get_loss(y, batch_y["label_seq"])from fastNLP.fastnlp import FastNLP
PATH_TO_CWS_PICKLE_FILES = "/home/zyfeng/fastNLP/reproduction/chinese_word_segment/save/"
nlp = FastNLP(model_dir=PATH_TO_CWS_PICKLE_FILES)
nlp.load("cws_basic_model", config_file="cws.cfg", section_name="POS_test")
text = ["这是最好的基于深度学习的中文分词系统。",
            "大王叫我来巡山。",
            "我党多年来致力于改善人民生活水平。"]
results = nlp.run(text)
# [[('这', 'S'), ('是', 'S'), ('最', 'S'), ('好', 'S'), ('的', 'S'), ('基', 'B'), ('于', 'E'), ('深', 'B'), ('度', 'E'), ('学', 'B'), ('习', 'E'), ('的', 'S'), ('中', 'B'), ('文', 'E'), ('分', 'B'), ('词', 'E'), ('系', 'B'), ('统', 'E'), ('。', 'S')], [('大', 'B'), ('王', 'E'), ('叫', 'S'), ('我', 'S'), ('来', 'S'), ('巡', 'B'), ('山', 'E'), ('。', 'S')], [('我', 'B'), ('党', 'E'), ('多', 'S'), ('年', 'S'), ('来', 'S'), ('致', 'B'), ('力', 'E'), ('于', 'S'), ('改', 'B'), ('善', 'E'), ('人', 'B'), ('民', 'E'), ('生', 'B'), ('活', 'E'), ('水', 'B'), ('平', 'E'), ('。', 'S')]]def train_and_test():
    # Load config section from config file
    trainer_args = ConfigSection()
    model_args = ConfigSection()
    ConfigLoader().load_config("./data/config", {
        "test_seq_label_trainer": trainer_args, "test_seq_label_model": model_args})
    # Load data with data loader
    data_loader = POSDatasetLoader("./data/pos_tag_data.txt")
    train_data = pos_loader.load_lines()
    # Preprocessor: 2-D list of strings ----> DataSet
    preprocess = SeqLabelPreprocess()
    data_train, data_dev = preprocess .run(train_data, pickle_path=pickle_path, train_dev_split=0.5)
    model_args["vocab_size"] = preprocess.vocab_size
    model_args["num_classes"] = preprocess.num_classes
    # Define trainer
    trainer = Trainer(
        epochs=trainer_args["epochs"],
        batch_size=trainer_args["batch_size"],
        validate=trainer_args["validate"],
        use_cuda=trainer_args["use_cuda"],
        pickle_path=pickle_path,
        save_best_dev=trainer_args["save_best_dev"],
        model_name=model_name,
        optimizer=Optimizer("SGD", lr=0.01, momentum=0.9),
    )
    # Define a model
    model = SeqLabeling(model_args)
    # Start training
    trainer.train(model, data_train, data_dev)
    print("Training finished!")
    # Define Saver and save a model
    saver = ModelSaver(os.path.join(pickle_path, model_name))
    saver.save_pytorch(model)
    print("Model saved!")
    del model, trainer, pos_loader
    # Define the same model
    model = SeqLabeling(model_args)
    # Load trained weights into the model
    ModelLoader.load_pytorch(model, os.path.join(pickle_path, model_name))
    print("model loaded!")
    # Load test configuration
    tester_args = ConfigSection()
    ConfigLoader("config.cfg").load_config(config_dir, {"test_seq_label_tester": tester_args})
    # Define a tester
    tester = Tester(save_output=False,
                            save_loss=False,
                            save_best_dev=False,
                            batch_size=4,
                            use_cuda=False,
                            pickle_path=pickle_path,
                            model_name="seq_label_in_test.pkl",
                            print_every_step=1
                            )
    # Start testing
    tester.test(model, data_dev)
    print(tester.show_metrics())dataset.py defines DataSet, which is a list of Instances.
instance.py defines Instance, which is a single example and contains multiple Fields.
field.py defines Field, which is the elementary data type or representation.
TextField defines a list of strings. LabelField defines single interger or string.
You can add extra fields to support more complex data.
Each field
- has a field name
 - has a 
is_targetboolean argument to specify whether it is Y or not (X) in training. - has a 
to_tensormethod to define how this field data is transformed into tensors 
dataset.py defines a function to make DataSet from a list.
def create_dataset_from_lists(str_lists: list, word_vocab: dict, has_target: bool = False, label_vocab: dict = None) --> DataSet: Example: https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_tester.py#L15 https://github.com/fastnlp/fastNLP/blob/ad044ef4c76c2c52da6e732a67ff2001e7a677d5/test/core/test_trainer.py#L14
batch.py defines Batch, an iterable wrapper of DataSet.
Sampling and padding is applied insides.
Iteration over a Batch object returns two dict, batch_x and batch_y.
The key of the dict is the field name. The value is the corresponding batch tensor.
data_iterator = Batch(data_set, batch_size=8, sampler=RandomSampler(), use_cuda=False)
for batch_x, batch_y in data_iterator:
   batch_x["word_seq"]  # torch.LongTensor
   batch_y["label"]  # torch.LongTensorBatch will keep a record of the origin length of a field before padding.
It returns the origin lengths with a string key created by appending "_origin_len" before the field name.
For example,  batch_x["word_seq_origin_len"]  # torch.LongTensor.
Why origin length is tensor rather than a list of int ?
Because the sequence labeling model's forward() has added an extra arguemnt seq_len to represent the origin lengths (The creation of sequence masks is moved into the model, which needs seq_len.). And tensorboardX requires arguemnts passed to forward() to be nothing but tensor.
In previous design, different trainers are responsible for different tasks.
After introducing Fields & DataSet, different tasks are represented by different DataSet structures, which is the way Fields organize.
Therefore, all methods in SeqLabelTrainer and  ClassificationTrainer are removed. They are just an empty sub-class to deprecate, and will throw an warning information when used.
So are those in Testers and Predictor.
However, trainers still need task information to know which fields are network inputs among batch_x. This is because
- we don't know which task is going to do when preprocessing and making DataSet.
 - not all fields in batch_x are needed as network input. Some may be unused, such as 
seq_lenin text classification. - in tester, different tasks require different evaluation methods and metrics.
 - in predictor, different tasks require different pre-process and post-process.
 
Trainer & Tester have a required arguement (raise error if not provided, NO default value) self._task to specify which task is going to perform.
if self._task == "seq_label": 
    y = network(x["word_seq"], x["word_seq_origin_len"]) 
elif self._task == "text_classify": 
    y = network(x["word_seq"]) - design a pytorch model, with forward method.
 - choose fields or create new fields to describe your data set.
 - modify preprocessor's 
build_dictmethod: to build dictionary over your data, and use the dictionary to transform multi-level list of strings into multi-level list of indices. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L182 - modify preprocessor's 
convert_to_datasetmethod: to transform multi-level list of indices into a DataSet object. https://github.com/fastnlp/fastNLP/blob/ef3c753e0db37c710f4068403c6efde4fcb9c3c4/fastNLP/core/preprocess.py#L244 - specify which fields you want to use as network inputs in Trainer, Tester, and Predictor. Where 
self._taskappears, where there are modification. - run and debug.
 
- optimize Preprocessor: make it a callable object, customized processing function as argument
 - more unit tests on core/
 - eliminate 
self._task? - merge kezhen's code about build_dict
 
Any questions are welcome!