external-ml.md

Alterra Slot Filler as a Service help

This document describes Alterra.ai's Slot Filler as a Service input and output formats.

The slot filler is used for labelling free-form natural language text with a set of predefined labels, like marking up cities, dates, or prices. It uses machine learning (namely, an artificial neural network) to train a model based on a relatively small corpus of labeled input sentences.

To use this service, you need to provide a set of labels (along with their types) as well as the labeled training corpus.

Please keep the number of labels small (not more than 50 labels). The training corpus is required to contains at least 2-4 thousand diverse natural language sentences.

The format for both label file and training corpus file is json-based.

Label file

Each label has a name and a type. If the type is ENUM, you also must specify the values in this file.

Label file format

The file format is json. Its contents should be a single list of label data structures.

Label data structure

Field name Type Required? Description
name string Y Label name [a-zA-Z-_]+
type string Y Type of the text marked by this label (see below).
params dict N Type-specific parameters (see below)

We accept the following values for the type field:

Type Description Examples
CITY A name of a city San Francisco
DATE A date or date range Mar 15 - 20
ENUM One of specified values economy (fare class)
ANY Any string

For the ENUM type, the params dict has these fields:

Field Type Description
values list of strings list of possible values

Example of a label file

[
{
  "name": "place.from",
  "type": "CITY",
},
{
  "name": "place.to",
  "type": "CITY",
},
{
  "name": "date.depart",
  "type": "DATE",
},
{
  "name": "date.return",
  "type": "DATE",
},
{
  "name": "class",
  "type": "ENUM",
  "params" : {
    "values": ["economy", "business", "first"],
  },
},
{
  "name": "airline",
  "type": "ANY",
},
]

Training corpus file

The format of this file is json-based: namely, each line of the file is a json-formatted object, representing one labeled sentence.

Sentence format

Sentence is a json list of spans. Each span is a dict with the following fields:

Name Required? Description
text Y text of the span
label N optional label for this span

There is an implicit space between adjacent spans, there is no need to include it in the text.

Example:

[{ "text": "i want an" },
 { "text": "economy", "label": "class" },
 { "text": "flight from" },
 { "text": "san francisco", "label": "place.from" },
 { "text": "to" },
 { "text": "chicago", "label": "place.to" },
 { "text": "tomorrow", "label": "date.depart" },
 { "text": "on" },
 { "text": "lufthansa", "label": "airline" }]

Resulting service input/output formats

For the resulting service, the input is simply an English sentence (in plain UTF-8 text) and the output is in the same format as training corpus, with one addition: each span can have an additonal field named "canonical" which contains the postprocessed value extracted from that label. The canonical formats are the following:

Label Example Description
CITY London, Greater London, UK The format is <CityName>, <Admin1>, <CountryCode>
DATE 2016-05-31--2016-06-11 <DATE1>--<DATE2> (each of dates may be omitted), YYYY-MM-DD