claf.tokens.indexer package

Submodules

class claf.tokens.indexer.base.TokenIndexer(tokenizer)[source]

Bases: object

Token Indexer

indexing tokens (eg. ‘hi’ -> 4)

index(token)[source]

indexing function

set_vocab(vocab)[source]
class claf.tokens.indexer.bert_indexer.BertIndexer(tokenizer, do_tokenize=True)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Bert Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: SubwordTokenizer

  • Kwargs:

    lowercase: word token to lowercase insert_start: insert start_token to first insert_end: append end_token

index(text)[source]

indexing function

class claf.tokens.indexer.char_indexer.CharIndexer(tokenizer, insert_char_start=None, insert_char_end=None)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Character Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: CharTokenizer

  • Kwargs:
    insert_char_start: insert start index (eg. [‘h’, ‘i’] -> [‘<s>’, ‘h’, ‘i’] )

    default is None

    insert_char_end: insert end index (eg. [‘h’, ‘i’] -> [‘h’, ‘i’, ‘</s>’] )

    default is None

index(text)[source]

indexing function

index_token(chars)[source]

This code is from allenai/allennlp (https://github.com/allenai/allennlp/blob/master/allennlp/data/token_indexers/elmo_indexer.py)

class claf.tokens.indexer.elmo_indexer.ELMoIndexer(tokenizer)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Maps individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here as special of existing character indexers.

BOS_TOKEN = '<S>'
EOS_TOKEN = '</S>'
beginning_of_sentence_character = 256
beginning_of_sentence_characters = [258, 256, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
beginning_of_word_character = 258
end_of_sentence_character = 257
end_of_sentence_characters = [258, 257, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
end_of_word_character = 259
index(text)[source]

indexing function

index_token(word)[source]
max_word_length = 50
padding_character = 260
class claf.tokens.indexer.exact_match_indexer.ExactMatchIndexer(tokenizer, lower=True, lemma=True)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Exact Match Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: WordTokenizer

  • Kwargs:

    lower: add lower feature. default is True (0 or 1) lemma: add lemma case feature. feature is True (0 or 1)

index(text, query_text)[source]

indexing function

index_token(token, query_tokens)[source]
class claf.tokens.indexer.linguistic_indexer.LinguisticIndexer(tokenizer, pos_tag=None, ner=None, dep=None)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Linguistic Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: WordTokenizer

  • Kwargs:

    pos_tag: POS Tagging ner: Named Entity Recognition dep: Dependency Parser

index(text)[source]

indexing function

class claf.tokens.indexer.word_indexer.WordIndexer(tokenizer, do_tokenize=True, lowercase=False, insert_start=None, insert_end=None)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Word Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: WordTokenizer

  • Kwargs:

    lowercase: word token to lowercase insert_start: insert start_token to first insert_end: append end_token

index(text)[source]

indexing function

Module contents

class claf.tokens.indexer.BertIndexer(tokenizer, do_tokenize=True)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Bert Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: SubwordTokenizer

  • Kwargs:

    lowercase: word token to lowercase insert_start: insert start_token to first insert_end: append end_token

index(text)[source]

indexing function

class claf.tokens.indexer.CharIndexer(tokenizer, insert_char_start=None, insert_char_end=None)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Character Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: CharTokenizer

  • Kwargs:
    insert_char_start: insert start index (eg. [‘h’, ‘i’] -> [‘<s>’, ‘h’, ‘i’] )

    default is None

    insert_char_end: insert end index (eg. [‘h’, ‘i’] -> [‘h’, ‘i’, ‘</s>’] )

    default is None

index(text)[source]

indexing function

index_token(chars)[source]
class claf.tokens.indexer.ELMoIndexer(tokenizer)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Maps individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here as special of existing character indexers.

BOS_TOKEN = '<S>'
EOS_TOKEN = '</S>'
beginning_of_sentence_character = 256
beginning_of_sentence_characters = [258, 256, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
beginning_of_word_character = 258
end_of_sentence_character = 257
end_of_sentence_characters = [258, 257, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
end_of_word_character = 259
index(text)[source]

indexing function

index_token(word)[source]
max_word_length = 50
padding_character = 260
class claf.tokens.indexer.ExactMatchIndexer(tokenizer, lower=True, lemma=True)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Exact Match Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: WordTokenizer

  • Kwargs:

    lower: add lower feature. default is True (0 or 1) lemma: add lemma case feature. feature is True (0 or 1)

index(text, query_text)[source]

indexing function

index_token(token, query_tokens)[source]
class claf.tokens.indexer.LinguisticIndexer(tokenizer, pos_tag=None, ner=None, dep=None)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Linguistic Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: WordTokenizer

  • Kwargs:

    pos_tag: POS Tagging ner: Named Entity Recognition dep: Dependency Parser

index(text)[source]

indexing function

class claf.tokens.indexer.WordIndexer(tokenizer, do_tokenize=True, lowercase=False, insert_start=None, insert_end=None)[source]

Bases: claf.tokens.indexer.base.TokenIndexer

Word Token Indexer

  • Property

    vocab: Vocab (claf.tokens.vocabulary)

  • Args:

    tokenizer: WordTokenizer

  • Kwargs:

    lowercase: word token to lowercase insert_start: insert start_token to first insert_end: append end_token

index(text)[source]

indexing function