claf.data package¶

Subpackages¶

Submodules¶

class claf.data.collate.FeatLabelPadCollator(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]¶

Bases: claf.data.collate.PadCollator

Collator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.

FeatLabelPadCollator allows applying pad to not only features, but also labels.

Kwargs:

cuda_device_id: tensor assign to cuda device id
Default is None (CPU)

skip_keys: skip to make tensor

collate(datas, apply_pad=True, apply_pad_labels=(), apply_pad_values=())[source]¶

class claf.data.collate.PadCollator(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]¶

Bases: object

Collator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.

Kwargs:

cuda_device_id: tensor assign to cuda device id
Default is None (CPU)

skip_keys: skip to make tensor

collate(datas, apply_pad=True, pad_value=0)[source]¶

class claf.data.data_handler.CachePath[source]¶

Bases: object

DATASET = PosixPath('/Users/Dongjun/.claf_cache/dataset')¶

MACHINE = PosixPath('/Users/Dongjun/.claf_cache/machine')¶

PRETRAINED_VECTOR = PosixPath('/Users/Dongjun/.claf_cache/pretrained_vector')¶

ROOT = PosixPath('/Users/Dongjun/.claf_cache')¶

TOKEN_COUNTER = PosixPath('/Users/Dongjun/.claf_cache/token_counter')¶

VOCAB = PosixPath('/Users/Dongjun/.claf_cache/vocab')¶

class claf.data.data_handler.DataHandler(cache_path=PosixPath('/Users/Dongjun/.claf_cache'))[source]¶

Bases: object

DataHandler with CachePath

read (from_path, from_http)
dump (.msgpack or .pkl (pickle))
load

cache_token_counter(data_reader_config, tokenizer_name, obj=None)[source]¶

convert_cache_path(path)[source]¶

dump(file_path, obj, encoding='utf-8')[source]¶

load(file_path, encoding='utf-8')[source]¶

read(file_path, encoding='utf-8', return_path=False)[source]¶

read_embedding(file_path)[source]¶

claf.data.utils.get_is_head_of_word(naive_tokens, sequence_tokens)[source]¶

Return a list of flags whether the token is head(prefix) of naively split tokens

ex) naive_tokens: [“hello.”, “how”, “are”, “you?”]

sequence_tokens: [“hello”, “.”, “how”, “are”, “you”, “?”]

=> [1, 0, 1, 1, 1, 0]

Args:
naive_tokens: a list of tokens, naively split by whitespace sequence_tokens: a list of tokens, split by ‘word_tokenizer’
Returns:

is_head_of_word: a list with its length the same as that of ‘sequence_tokens’.
has 1 if the tokenized word at the position is head(prefix) of a naive_token and 0 if otherwise.

claf.data.utils.get_sequence_a(example)[source]¶

claf.data.utils.get_token_dim(tokens, dim=0)[source]¶

claf.data.utils.get_token_type(tokens)[source]¶

claf.data.utils.is_lazy(tokens)[source]¶

claf.data.utils.make_batch(features, labels)[source]¶

claf.data.utils.make_bert_input(sequence_a, sequence_b, bert_tokenizer, max_seq_length=128, data_type='train', cls_token='[CLS]', sep_token='[SEP]', input_type='bert')[source]¶

claf.data.utils.make_bert_token_type(bert_input_text, SEP_token='[SEP]')[source]¶

claf.data.utils.make_bert_token_types(bert_inputs, SEP_token='[SEP]')[source]¶

Bert Inputs segment_ids

ex) [CLS] hi [SEP] he ##llo [SEP] => 0 0 0 1 1 1

Args:
bert_inputs: feature dictionary consisting of
text: text from data_reader

token_name: text converted to corresponding token_type
Kwargs:
SEP_token: SEP special token for BERT

claf.data.utils.padding_tokens(tokens, max_len=None, token_name=None, pad_value=0)[source]¶: Padding tokens according to token’s dimension

claf.data.utils.sanity_check_iob(naive_tokens, tag_texts)[source]¶

Check if the IOB tags are valid.

Args:
naive_tokens: tokens split by .split() tag_texts: list of tags in IOB format

claf.data.utils.transpose(list_of_dict, skip_keys=[])[source]¶

claf.data package¶

Subpackages¶

Submodules¶

Module contents¶