claf.data package

Submodules

class claf.data.collate.FeatLabelPadCollator(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]

Bases: claf.data.collate.PadCollator

Collator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.

FeatLabelPadCollator allows applying pad to not only features, but also labels.

  • Kwargs:
    cuda_device_id: tensor assign to cuda device id

    Default is None (CPU)

    skip_keys: skip to make tensor

collate(datas, apply_pad=True, apply_pad_labels=(), apply_pad_values=())[source]
class claf.data.collate.PadCollator(cuda_device_id=None, pad_value=0, skip_keys=['text'])[source]

Bases: object

Collator apply pad and make tensor Minimizes amount of padding needed while producing mini-batch.

  • Kwargs:
    cuda_device_id: tensor assign to cuda device id

    Default is None (CPU)

    skip_keys: skip to make tensor

collate(datas, apply_pad=True, pad_value=0)[source]
class claf.data.data_handler.CachePath[source]

Bases: object

DATASET = PosixPath('/Users/Dongjun/.claf_cache/dataset')
MACHINE = PosixPath('/Users/Dongjun/.claf_cache/machine')
PRETRAINED_VECTOR = PosixPath('/Users/Dongjun/.claf_cache/pretrained_vector')
ROOT = PosixPath('/Users/Dongjun/.claf_cache')
TOKEN_COUNTER = PosixPath('/Users/Dongjun/.claf_cache/token_counter')
VOCAB = PosixPath('/Users/Dongjun/.claf_cache/vocab')
class claf.data.data_handler.DataHandler(cache_path=PosixPath('/Users/Dongjun/.claf_cache'))[source]

Bases: object

DataHandler with CachePath

  • read (from_path, from_http)

  • dump (.msgpack or .pkl (pickle))

  • load

cache_token_counter(data_reader_config, tokenizer_name, obj=None)[source]
convert_cache_path(path)[source]
dump(file_path, obj, encoding='utf-8')[source]
load(file_path, encoding='utf-8')[source]
read(file_path, encoding='utf-8', return_path=False)[source]
read_embedding(file_path)[source]
claf.data.utils.get_is_head_of_word(naive_tokens, sequence_tokens)[source]

Return a list of flags whether the token is head(prefix) of naively split tokens

ex) naive_tokens: [“hello.”, “how”, “are”, “you?”]

sequence_tokens: [“hello”, “.”, “how”, “are”, “you”, “?”]

=> [1, 0, 1, 1, 1, 0]

  • Args:

    naive_tokens: a list of tokens, naively split by whitespace sequence_tokens: a list of tokens, split by ‘word_tokenizer’

  • Returns:
    is_head_of_word: a list with its length the same as that of ‘sequence_tokens’.

    has 1 if the tokenized word at the position is head(prefix) of a naive_token and 0 if otherwise.

claf.data.utils.get_sequence_a(example)[source]
claf.data.utils.get_token_dim(tokens, dim=0)[source]
claf.data.utils.get_token_type(tokens)[source]
claf.data.utils.is_lazy(tokens)[source]
claf.data.utils.make_batch(features, labels)[source]
claf.data.utils.make_bert_input(sequence_a, sequence_b, bert_tokenizer, max_seq_length=128, data_type='train', cls_token='[CLS]', sep_token='[SEP]', input_type='bert')[source]
claf.data.utils.make_bert_token_type(bert_input_text, SEP_token='[SEP]')[source]
claf.data.utils.make_bert_token_types(bert_inputs, SEP_token='[SEP]')[source]

Bert Inputs segment_ids

ex) [CLS] hi [SEP] he ##llo [SEP] => 0 0 0 1 1 1

  • Args:
    bert_inputs: feature dictionary consisting of
    • text: text from data_reader

    • token_name: text converted to corresponding token_type

  • Kwargs:

    SEP_token: SEP special token for BERT

claf.data.utils.padding_tokens(tokens, max_len=None, token_name=None, pad_value=0)[source]

Padding tokens according to token’s dimension

claf.data.utils.sanity_check_iob(naive_tokens, tag_texts)[source]

Check if the IOB tags are valid.

  • Args:

    naive_tokens: tokens split by .split() tag_texts: list of tags in IOB format

claf.data.utils.transpose(list_of_dict, skip_keys=[])[source]

Module contents