Skip to content

Configuration

VincentFoulon80 edited this page Jun 7, 2019 · 13 revisions

Basic Configuration

You can configure the search engine by giving an array as the first parameter of the constructor:

$engine = new Engine($myConfiguration);

Here's the default configuration array:

$default = [
    "config" => [
        "var_dir" => $_SERVER['DOCUMENT_ROOT'].DIRECTORY_SEPARATOR."var",
        "index_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."index",
        "documents_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."documents",
        "cache_dir" => DIRECTORY_SEPARATOR."engine".DIRECTORY_SEPARATOR."cache",
        "fuzzy_cost" => 1
    ],
    "schemas" => [
        "example-post" => [
            "title" => [
                "_type" => "string",
                "_indexed" => true,
                "_boost" => 10
            ],
            "content" => [
                "_type" => "text",
                "_indexed" => true,
                "_boost" => 0.5
            ],
            "date" => [
                "_type" => "datetime",
                "_indexed" => true,
                "_boost" => 2
            ],
            "categories" => [
                "_type" => "list",
                "_type." => "string",
                "_indexed" => true,
                "_filterable" => true,
                "_boost" => 6
            ],
            "comments" => [
                "_type" => "list",
                "_type." => "array",
                "_array" => [
                    "author" => [
                        '_type' => "string",
                        "_indexed" => true,
                        "_filterable" => true,
                        "_boost" => 1
                    ],
                    "date" => [
                        "_type" => "datetime",
                        "_indexed" => true,
                        "_boost" => 0
                    ],
                    "message" => [
                        "_type" => "text",
                        "_indexed" => true,
                        "_boost" => 0.1
                    ]
                ]
            ]
        ]
    ],
    "types" => [
        "datetime" => [
            DateFormatTokenizer::class,
            DateSplitTokenizer::class
        ],
        "_default" => [
            LowerCaseTokenizer::class,
            WhiteSpaceTokenizer::class,
            TrimPunctuationTokenizer::class
        ]
    ]
];

Configuring the search engine

The "config" section

This section defines the engine's parameters, such as working directories.

  • `var_dir': The root directory where the engine's files will be created.
  • index_dir: The index subdirectory name where the index will be built
  • documents_dir: the documents subdirectory where every documents will be stored
  • cache_dir: the cache subdirectory. be sure that there'll be nothing else than the engine's cache file in this subdirectory.
  • fuzzy_cost: The cost of the fuzzy searching's approximate function. The number represents how many characters the user can misstype, see examples on release note 0.5

The "schemas" section

defining a schema

This section defines every schemas that you want to index in the engine. You can define a shema as long as you want, as deep as you want.

the "schemas" section is an associative array which have the document type name as key and the corresponding schema as array.

inside a "schema" array, you have a list of every fields with the name of the field as key and the configuration of the field as array.

finally, the configuration of a field is also an associative array, here's the list:

  • _type: the type of the field, this can be any values, you can customize the behavior of a type with the "types" section below. There is two special values "list" and "array" that we'll see below.
  • _type.: the subtype of the field, required when the type is 'list'. This will define the type of the items in the list. (e.g. '_list'=>'list','_list.'=>'date' will define the field as a list of dates).
  • _indexed: boolean that say to the engine if the field need to be count in the indexation. If set to false, the field will be stored in the document but you will not be able to search something in this field.
  • _filterable: (coming soon) optionnal boolean (default: false) that'll add to the index the possibility to filter by the values of these fields. Useful for researching into a specific category.
  • _boost: float value that'll be used for determining the score of a document. the more you'll boost a field, the more the values in it will count into the final score.
  • _array: special parameter required if the type or the subtype of the field is "array". see below for more information about the special type 'array'

special types 'list' and 'array'

These two types cannot be used in the index, so naming them in the "types" section will do nothing.

list

The list type will make a field multivalued. Instead of having one value in the field, you'll have an array of values, whose type will be defined by the subtype key '_type.'

example:

$schema = [
    "categories" => [
        "_type" => "list",
        "_type." => "string",
        "_indexed" => true,
        "_filterable" => true,
        "_boost" => 6
    ]
];

array

The array type will define a subschema into your field. You should use it as a subtype of a 'list'. When you put the array type into a field, every other parameters except '_type' and '_type.' will be ignored, and a parameter '_array' will be required. This parameter contain another schema structure, that'll be nested into the current schema.

You can look at the 'comments' field into the default schema above to see an example of array.

the "types" section

This section gives to the user a way to customize the tokenization (= the slicing of values into small 'tokens' to help the engine find easily your documents) of any types. There is a special type "_default" that can be defined for default tokenization, if you have not defined some types in your schema, the "_default" type will be used. There is also a special kind of type named "datetime" that'll convert a value into a DateTime instance. The tokenization is by the way different, as you can see in the default configuration above.

In the section, every type is defined as an associative array where the key is the name of the type and the value is a list of TokenizerInterface class. You can define your own tokenizers, and you can use the default ones that you can retrieve here.

Special types

_default : Fallback type when a "_type" or "_type." is not known.
search : Type used internally to tokenize search terms. If not configured, fallbacks to "_default"

Clone this wiki locally