claf.data package¶
Subpackages¶
Submodules¶
-
class
claf.data.collate.
FeatLabelPadCollator
(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]¶ Bases:
claf.data.collate.PadCollator
Collator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.
FeatLabelPadCollator allows applying pad to not only features, but also labels.
- Kwargs:
- cuda_device_id: tensor assign to cuda device id
Default is None (CPU)
skip_keys: skip to make tensor
-
class
claf.data.collate.
PadCollator
(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]¶ Bases:
object
Collator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.
- Kwargs:
- cuda_device_id: tensor assign to cuda device id
Default is None (CPU)
skip_keys: skip to make tensor
-
class
claf.data.data_handler.
CachePath
[source]¶ Bases:
object
-
DATASET
= PosixPath('/Users/Dongjun/.claf_cache/dataset')¶
-
MACHINE
= PosixPath('/Users/Dongjun/.claf_cache/machine')¶
-
PRETRAINED_VECTOR
= PosixPath('/Users/Dongjun/.claf_cache/pretrained_vector')¶
-
ROOT
= PosixPath('/Users/Dongjun/.claf_cache')¶
-
TOKEN_COUNTER
= PosixPath('/Users/Dongjun/.claf_cache/token_counter')¶
-
VOCAB
= PosixPath('/Users/Dongjun/.claf_cache/vocab')¶
-
-
class
claf.data.data_handler.
DataHandler
(cache_path=PosixPath('/Users/Dongjun/.claf_cache'))[source]¶ Bases:
object
DataHandler with CachePath
read (from_path, from_http)
dump (.msgpack or .pkl (pickle))
load
-
claf.data.utils.
get_is_head_of_word
(naive_tokens, sequence_tokens)[source]¶ Return a list of flags whether the token is head(prefix) of naively split tokens
- ex) naive_tokens: [“hello.”, “how”, “are”, “you?”]
sequence_tokens: [“hello”, “.”, “how”, “are”, “you”, “?”]
=> [1, 0, 1, 1, 1, 0]
- Args:
naive_tokens: a list of tokens, naively split by whitespace sequence_tokens: a list of tokens, split by ‘word_tokenizer’
- Returns:
- is_head_of_word: a list with its length the same as that of ‘sequence_tokens’.
has 1 if the tokenized word at the position is head(prefix) of a naive_token and 0 if otherwise.
-
claf.data.utils.
make_bert_input
(sequence_a, sequence_b, bert_tokenizer, max_seq_length=128, data_type='train', cls_token='[CLS]', sep_token='[SEP]', input_type='bert')[source]¶
-
claf.data.utils.
make_bert_token_types
(bert_inputs, SEP_token='[SEP]')[source]¶ Bert Inputs segment_ids
ex) [CLS] hi [SEP] he ##llo [SEP] => 0 0 0 1 1 1
- Args:
- bert_inputs: feature dictionary consisting of
text: text from data_reader
token_name: text converted to corresponding token_type
- Kwargs:
SEP_token: SEP special token for BERT
-
claf.data.utils.
padding_tokens
(tokens, max_len=None, token_name=None, pad_value=0)[source]¶ Padding tokens according to token’s dimension