claf.tokens.tokenizer package

Submodules

class claf.tokens.tokenizer.base.Tokenizer(name, cache_name)[source]

Bases: object

Tokenizer Base Class

MAX_TO_KEEP_CACHE = 3
tokenize(text, unit='text')[source]
class claf.tokens.tokenizer.char.CharTokenizer(name, word_tokenizer, config={})[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Character Tokenizer

text -> word tokens -> [char tokens]

  • Args:

    name: tokenizer name [character|decompose_ko] word_tokenizer: word tokenizer object

class claf.tokens.tokenizer.pass_text.PassText[source]

Bases: object

Pass text without tokenize

tokenize(text)[source]
class claf.tokens.tokenizer.sent.SentTokenizer(name, config={})[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Sentence Tokenizer

text -> [sent tokens]

  • Args:

    name: tokenizer name [punkt]

class claf.tokens.tokenizer.subword.SubwordTokenizer(name, word_tokenizer, config={})[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Subword Tokenizer

text -> [word tokens] -> [[sub word tokens], …]

  • Args:

    name: tokenizer name [wordpiece]

claf.tokens.tokenizer.utils.create_tokenizer_with_regex(nlp, split_regex)[source]
claf.tokens.tokenizer.utils.load_spacy_model_for_tokenizer(split_regex)[source]
class claf.tokens.tokenizer.word.WordTokenizer(name, sent_tokenizer, config={}, split_with_regex=True)[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Word Tokenizer

  • Args:

    name: tokenizer name [treebank_en|spacy_en|mecab_ko|bert_basic]

  • Kwargs:

    flatten: return type as flatten list split_with_regex: post split action. Split tokens that the tokenizer cannot split.

make_split_regex_expression()[source]

Apply a small amount of extra splitting to the given tokens, this is in particular to avoid UNK tokens due to contraction, quotation, or other forms of puncutation. I haven’t really done tests to see if/how much difference this makes, but it does avoid some common UNKs I noticed in SQuAD/TriviaQA

Module contents

class claf.tokens.tokenizer.PassText[source]

Bases: object

Pass text without tokenize

tokenize(text)[source]
class claf.tokens.tokenizer.BPETokenizer(name, config={})[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

BPTE(Byte-Pair Encoding) Tokenizer text -> … * Args:

name: tokenizer name [roberta]

class claf.tokens.tokenizer.CharTokenizer(name, word_tokenizer, config={})[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Character Tokenizer

text -> word tokens -> [char tokens]

  • Args:

    name: tokenizer name [character|decompose_ko] word_tokenizer: word tokenizer object

class claf.tokens.tokenizer.SubwordTokenizer(name, word_tokenizer, config={})[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Subword Tokenizer

text -> [word tokens] -> [[sub word tokens], …]

  • Args:

    name: tokenizer name [wordpiece]

class claf.tokens.tokenizer.WordTokenizer(name, sent_tokenizer, config={}, split_with_regex=True)[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Word Tokenizer

  • Args:

    name: tokenizer name [treebank_en|spacy_en|mecab_ko|bert_basic]

  • Kwargs:

    flatten: return type as flatten list split_with_regex: post split action. Split tokens that the tokenizer cannot split.

make_split_regex_expression()[source]

Apply a small amount of extra splitting to the given tokens, this is in particular to avoid UNK tokens due to contraction, quotation, or other forms of puncutation. I haven’t really done tests to see if/how much difference this makes, but it does avoid some common UNKs I noticed in SQuAD/TriviaQA

class claf.tokens.tokenizer.SentTokenizer(name, config={})[source]

Bases: claf.tokens.tokenizer.base.Tokenizer

Sentence Tokenizer

text -> [sent tokens]

  • Args:

    name: tokenizer name [punkt]