claf.data package¶
Subpackages¶
Submodules¶
-
class
claf.data.collate.FeatLabelPadCollator(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]¶ Bases:
claf.data.collate.PadCollatorCollator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.
FeatLabelPadCollator allows applying pad to not only features, but also labels.
- Kwargs:
- cuda_device_id: tensor assign to cuda device id
Default is None (CPU)
skip_keys: skip to make tensor
-
class
claf.data.collate.PadCollator(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]¶ Bases:
objectCollator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.
- Kwargs:
- cuda_device_id: tensor assign to cuda device id
Default is None (CPU)
skip_keys: skip to make tensor
-
class
claf.data.data_handler.CachePath[source]¶ Bases:
object-
DATASET= PosixPath('/Users/Dongjun/.claf_cache/dataset')¶
-
MACHINE= PosixPath('/Users/Dongjun/.claf_cache/machine')¶
-
PRETRAINED_VECTOR= PosixPath('/Users/Dongjun/.claf_cache/pretrained_vector')¶
-
ROOT= PosixPath('/Users/Dongjun/.claf_cache')¶
-
TOKEN_COUNTER= PosixPath('/Users/Dongjun/.claf_cache/token_counter')¶
-
VOCAB= PosixPath('/Users/Dongjun/.claf_cache/vocab')¶
-
-
class
claf.data.data_handler.DataHandler(cache_path=PosixPath('/Users/Dongjun/.claf_cache'))[source]¶ Bases:
objectDataHandler with CachePath
read (from_path, from_http)
dump (.msgpack or .pkl (pickle))
load
-
claf.data.utils.get_is_head_of_word(naive_tokens, sequence_tokens)[source]¶ Return a list of flags whether the token is head(prefix) of naively split tokens
- ex) naive_tokens: [“hello.”, “how”, “are”, “you?”]
sequence_tokens: [“hello”, “.”, “how”, “are”, “you”, “?”]
=> [1, 0, 1, 1, 1, 0]
- Args:
naive_tokens: a list of tokens, naively split by whitespace sequence_tokens: a list of tokens, split by ‘word_tokenizer’
- Returns:
- is_head_of_word: a list with its length the same as that of ‘sequence_tokens’.
has 1 if the tokenized word at the position is head(prefix) of a naive_token and 0 if otherwise.
-
claf.data.utils.make_bert_input(sequence_a, sequence_b, bert_tokenizer, max_seq_length=128, data_type='train', cls_token='[CLS]', sep_token='[SEP]', input_type='bert')[source]¶
-
claf.data.utils.make_bert_token_types(bert_inputs, SEP_token='[SEP]')[source]¶ Bert Inputs segment_ids
ex) [CLS] hi [SEP] he ##llo [SEP] => 0 0 0 1 1 1
- Args:
- bert_inputs: feature dictionary consisting of
text: text from data_reader
token_name: text converted to corresponding token_type
- Kwargs:
SEP_token: SEP special token for BERT
-
claf.data.utils.padding_tokens(tokens, max_len=None, token_name=None, pad_value=0)[source]¶ Padding tokens according to token’s dimension