# Tokens TokenMakers consists of Tokenizer, Indexer, Vocabulary, and Embedding Modules. `TokenMaker` is responsible for the indexing of text and the generation of the tensors through the embedding module. ## Tokenizers - Tokenizer Design ![images](../../images/tokenizers_design.png) ``` class SentTokenizer(name, config): ... class WordTokenizer(name, sent_tokenizer, config) ... class SubwordTokenizer(name, word_tokenizer, config) ... class CharTokenizer(name, word_tokenizer, config) ... ``` The Tokenizer has a dependency with the other unit's tokenizer and the `tokenize()` function takes the input of text units. (* unit: unit of input e.g. 'text', 'sentence' and 'word') - `tokenizer()` example ``` >>> text = "Hello World.This is tokenizer example code." >>> word_tokenizer.tokenize(text, unit="text") # text -> sentences -> words >>> ['Hello', 'World', '.', 'This', 'is', 'tokenizer', 'example', 'code', '.'] >>> word_tokenizer.tokenize(text, unit="sentence") # text -> words >>> ['Hello', 'World.This', 'is', 'tokenizer', 'example', 'code', '.'] ``` Several tensors in a sub-level text unit can be combined into a single tensor of higher level via a vector operation. For example, subword level tensors can be averaged to represent a word level tensor. e.g.) concatenate \[word; subword\] (subword tokens --average--> word token) * The list of pre-defined `Tokenizers`: | Text Unit | Language | Name | Example | | ---- | ---- | --- | --- | | BPE | All (depend on vocab) | **roberta** | Hello World
-> ["ĠHello", "ĠWorld"] | | Char | All | **character** | Hello World
-> ["Hello", "World"]
-> [["H", "e", "l", "l", "o"], ["W", "o", "r", "l", "d"]] | | Char | Korean | [**jamo_ko**](https://github.com/rhobot/Hangulpy) | "안녕 세상"
-> ["안녕", "세상"]
-> [["ㅇ", "ㅏ", "ㄴ", "ㄴ", "ㅕ", "ㅇ"], ["ㅅ", "ㅔ", "ㅅ", "ㅏ", "ㅇ"]] | | Subword | All (but, need vocab.txt) | [**wordpiece**](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py) | "expectancy of anyone"
-> ["expectancy", "of", "anyone"]
-> ["expect", "##ancy", "of", "anyone"] | | Word | English | [**nltk_en**](http://www.nltk.org/api/nltk.tokenize.html) | - | | Word | English | [**spacy_en**](https://spacy.io/api/tokenizer) | - | | Word | Korean | [**mecab_ko**](https://bitbucket.org/eunjeon/mecab-ko) | - | | Word | All | **bert_basic** | - | | Word | All | **space_all** | "Hello World"
-> ["Hello", "World"] | | Sent | All | [**punkt**](http://www.nltk.org/api/nltk.tokenize.html) | "Hello World. This is punkt tokenizer."
-> ["Hello World.", "This is punkt tokenizer."] | ## Token Maker * The list of pre-defined `Token Maker`: | Type | Description | Category | Notes | | ---- | ---- | --- | --- | | **char** | character -> convolution -> maxpool | `CharCNN` | - | | **cove** | Embeddings from Neural Machine Translation | `NMT` | - From [Salesforce](https://github.com/salesforce/cove) | | **feature** | Do not use embedding function, just pass feature | `Feature` | - | | **word** | word -> Embedding (+pretrained) | `Word2Vec` | - | | **frequent_word** | word token + pre-trained word embeddings fixed and only fine-tune the N most frequent | `Word2Vec` + `Fine-tune` | - | | **exact_match** | Three simple binary features, indicating whether p_i can be exactly matched to one question word in q, either in its original, lowercase or lemma form. | `Feature` | - Sparse or Embedding
- Only for RC| | **elmo** | Embeddings from Language Models | `LM` | From [Allennlp](https://github.com/allenai/allennlp) | | **linguistic** | Linguistic Features like POS Tagging, NER and Dependency Parser | `Feature` | - Sparse or Embedding | - Example of tokens in [BaseConfig](#baseconfig) ``` "token": { "names": ["char", "glove"], "types": ["char", "word"], "tokenizer": { # Define the tokenizer in each unit. "char": { "name": "character" }, "word": { "name": "treebank_en", "split_with_regex": true } }, "char": { # token_name "vocab": { "start_token": "", "end_token": "", "max_vocab_size": 260 }, "indexer": { "insert_char_start": true, "insert_char_end": true }, "embedding": { "embed_dim": 16, "kernel_sizes": [5], "num_filter": 100, "activation": "relu", "dropout": 0.2 } }, "glove": { # token_name "indexer": { "lowercase": true }, "embedding": { "embed_dim": 100, "pretrained_path": ", "trainable": false, "dropout": 0.2 } } }, # Tokens process # Text -> Indexed Featrues -> Tensor -> TokenEmbedder -> Model # Visualization # - Text: Hello World # - Indexed Feature: {'char': [[2, 3, 4, 4, 5], [6, 7, 8, 4, 9]], 'glove': [2, 3]} # - Tensor: {'char': tensor, 'glove': tensor} # - TokenEmbedder: [char; glove] (default: concatenate) # - Model: use embedded_value ```