claf.tokens.tokenizer package¶
Submodules¶
-
class
claf.tokens.tokenizer.base.
Tokenizer
(name, cache_name)[source]¶ Bases:
object
Tokenizer Base Class
-
MAX_TO_KEEP_CACHE
= 3¶
-
-
class
claf.tokens.tokenizer.char.
CharTokenizer
(name, word_tokenizer, config={})[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Character Tokenizer
text -> word tokens -> [char tokens]
- Args:
name: tokenizer name [character|decompose_ko] word_tokenizer: word tokenizer object
-
class
claf.tokens.tokenizer.sent.
SentTokenizer
(name, config={})[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Sentence Tokenizer
text -> [sent tokens]
- Args:
name: tokenizer name [punkt]
-
class
claf.tokens.tokenizer.subword.
SubwordTokenizer
(name, word_tokenizer, config={})[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Subword Tokenizer
text -> [word tokens] -> [[sub word tokens], …]
- Args:
name: tokenizer name [wordpiece]
-
class
claf.tokens.tokenizer.word.
WordTokenizer
(name, sent_tokenizer, config={}, split_with_regex=True)[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Word Tokenizer
- Args:
name: tokenizer name [treebank_en|spacy_en|mecab_ko|bert_basic]
- Kwargs:
flatten: return type as flatten list split_with_regex: post split action. Split tokens that the tokenizer cannot split.
-
make_split_regex_expression
()[source]¶ Apply a small amount of extra splitting to the given tokens, this is in particular to avoid UNK tokens due to contraction, quotation, or other forms of puncutation. I haven’t really done tests to see if/how much difference this makes, but it does avoid some common UNKs I noticed in SQuAD/TriviaQA
Module contents¶
-
class
claf.tokens.tokenizer.
BPETokenizer
(name, config={})[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
BPTE(Byte-Pair Encoding) Tokenizer text -> … * Args:
name: tokenizer name [roberta]
-
class
claf.tokens.tokenizer.
CharTokenizer
(name, word_tokenizer, config={})[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Character Tokenizer
text -> word tokens -> [char tokens]
- Args:
name: tokenizer name [character|decompose_ko] word_tokenizer: word tokenizer object
-
class
claf.tokens.tokenizer.
SubwordTokenizer
(name, word_tokenizer, config={})[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Subword Tokenizer
text -> [word tokens] -> [[sub word tokens], …]
- Args:
name: tokenizer name [wordpiece]
-
class
claf.tokens.tokenizer.
WordTokenizer
(name, sent_tokenizer, config={}, split_with_regex=True)[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Word Tokenizer
- Args:
name: tokenizer name [treebank_en|spacy_en|mecab_ko|bert_basic]
- Kwargs:
flatten: return type as flatten list split_with_regex: post split action. Split tokens that the tokenizer cannot split.
-
make_split_regex_expression
()[source]¶ Apply a small amount of extra splitting to the given tokens, this is in particular to avoid UNK tokens due to contraction, quotation, or other forms of puncutation. I haven’t really done tests to see if/how much difference this makes, but it does avoid some common UNKs I noticed in SQuAD/TriviaQA
-
class
claf.tokens.tokenizer.
SentTokenizer
(name, config={})[source]¶ Bases:
claf.tokens.tokenizer.base.Tokenizer
Sentence Tokenizer
text -> [sent tokens]
- Args:
name: tokenizer name [punkt]