claf.tokens package¶

Subpackages¶

Submodules¶

This code is from salesforce/cove (https://github.com/salesforce/cove/blob/master/cove/encoder.py)

class claf.tokens.cove.MTLSTM(word_embedding, pretrained_path=None, requires_grad=False, residual_embeddings=False)[source]¶

Bases: torch.nn.modules.module.Module

forward(inputs)[source]¶

A pretrained MT-LSTM (McCann et. al. 2017). This LSTM was trained with 300d 840B GloVe on the WMT 2017 machine translation dataset.

Arguments:

inputs (Tensor): If MTLSTM handles embedding, a Long Tensor of size (batch_size, timesteps).: Otherwise, a Float Tensor of size (batch_size, timesteps, features).

lengths (Long Tensor): (batch_size, lengths) lenghts of each sequence for handling padding hidden (Float Tensor): initial hidden state of the LSTM

This code is from allenai/allennlp (https://github.com/allenai/allennlp/blob/master/allennlp/modules/elmo.py)

class claf.tokens.elmo.Elmo(options_file: str, weight_file: str, num_output_representations: int, requires_grad: bool = False, do_layer_norm: bool = False, dropout: float = 0.5, vocab_to_cache: List[str] = None, module: torch.nn.modules.module.Module = None)[source]¶

Bases: torch.nn.modules.module.Module

Compute ELMo representations using a pre-trained bidirectional language model. See “Deep contextualized word representations”, Peters et al. for details. This module takes character id input and computes num_output_representations different layers of ELMo representations. Typically num_output_representations is 1 or 2. For example, in the case of the SRL model in the above paper, num_output_representations=1 where ELMo was included at the input token representation layer. In the case of the SQuAD model, num_output_representations=2 as ELMo was also included at the GRU output layer. In the implementation below, we learn separate scalar weights for each output layer, but only run the biLM once on each input sequence for efficiency. Parameters ———- options_file : str, required.

ELMo JSON options file

weight_filestr, required.: ELMo hdf5 weight file
num_output_representations: int, required.: The number of ELMo representation layers to output.
requires_grad: bool, optional: If True, compute gradient of ELMo parameters for fine tuning.
do_layer_normbool, optional, (default=False).: Should we apply layer normalization (passed to ScalarMix)?
dropoutfloat, optional, (default = 0.5).: The dropout to be applied to the ELMo representations.
vocab_to_cacheList[str], optional, (default = 0.5).: A list of words to pre-compute and cache character convolutions for. If you use this option, Elmo expects that you pass word indices of shape (batch_size, timesteps) to forward, instead of character indices. If you use this option and pass a word which wasn’t pre-cached, this will break.
moduletorch.nn.Module, optional, (default = None).: If provided, then use this module instead of the pre-trained ELMo biLM. If using this option, then pass None for both options_file and weight_file. The module must provide a public attribute num_layers with the number of internal layers and its forward method must return a dict with activations and mask keys (see _ElmoBilm` for an example). Note that requires_grad is also ignored with this option.

forward(inputs: torch.Tensor, word_inputs: torch.Tensor = None) → Dict[str, Union[torch.Tensor, List[torch.Tensor]]][source]¶

inputs: torch.Tensor, required. Shape (batch_size, timesteps, 50) of character ids representing the current batch. word_inputs : torch.Tensor, required.

If you passed a cached vocab, you can in addition pass a tensor of shape (batch_size, timesteps), which represent word ids which have been pre-cached.

Dict with keys: 'elmo_representations': List[torch.Tensor]

A num_output_representations list of ELMo representations for the input sequence. Each representation is shape (batch_size, timesteps, embedding_dim)

'mask': torch.Tensor: Shape (batch_size, timesteps) long tensor with sequence mask.

classmethod from_params(params) → claf.tokens.elmo.Elmo[source]¶

get_output_dim()[source]¶

class claf.tokens.elmo.ElmoLstm(input_size: int, hidden_size: int, cell_size: int, num_layers: int, requires_grad: bool = False, recurrent_dropout_probability: float = 0.0, memory_cell_clip_value: Optional[float] = None, state_projection_clip_value: Optional[float] = None)[source]¶

Bases: claf.modules.encoder.lstm_cell_with_projection._EncoderBase

A stacked, bidirectional LSTM which uses LstmCellWithProjection’s with highway layers between the inputs to layers. The inputs to the forward and backward directions are independent - forward and backward states are not concatenated between layers. Additionally, this LSTM maintains its own state, which is updated every time forward is called. It is dynamically resized for different batch sizes and is designed for use with non-continuous inputs (i.e inputs which aren’t formatted as a stream, such as text used for a language modelling task, which is how stateful RNNs are typically used). This is non-standard, but can be thought of as having an “end of sentence” state, which is carried across different sentences. Parameters ———- input_size : int, required

The dimension of the inputs to the LSTM.

hidden_sizeint, required: The dimension of the outputs of the LSTM.
cell_sizeint, required.: The dimension of the memory cell of the LstmCellWithProjection.
num_layersint, required: The number of bidirectional LSTMs to use.
requires_grad: bool, optional: If True, compute gradient of ELMo parameters for fine tuning.
recurrent_dropout_probability: float, optional (default = 0.0): The dropout probability to be used in a dropout scheme as stated in A Theoretically Grounded Application of Dropout in Recurrent Neural Networks .
state_projection_clip_value: float, optional, (default = None): The magnitude with which to clip the hidden_state after projecting it.
memory_cell_clip_value: float, optional, (default = None): The magnitude with which to clip the memory cell.

forward(inputs: torch.Tensor, mask: torch.LongTensor) → torch.Tensor[source]¶

inputstorch.Tensor, required.: A Tensor of shape (batch_size, sequence_length, hidden_size).
masktorch.LongTensor, required.: A binary mask of shape (batch_size, sequence_length) representing the non-padded elements in each sequence in the batch.

A torch.Tensor of shape (num_layers, batch_size, sequence_length, hidden_size), where the num_layers dimension represents the LSTM output from that layer.

load_weights(weight_file: str) → None[source]¶: Load the pre-trained weights from the file.

claf.tokens.elmo.add_sentence_boundary_token_ids(tensor: torch.Tensor, mask: torch.Tensor, sentence_begin_token: Any, sentence_end_token: Any) → Tuple[torch.Tensor, torch.Tensor][source]¶

Add begin/end of sentence tokens to the batch of sentences. Given a batch of sentences with size (batch_size, timesteps) or (batch_size, timesteps, dim) this returns a tensor of shape (batch_size, timesteps + 2) or (batch_size, timesteps + 2, dim) respectively. Returns both the new tensor and updated mask. Parameters ———- tensor : torch.Tensor

A tensor of shape (batch_size, timesteps) or (batch_size, timesteps, dim)

masktorch.Tensor: A tensor of shape (batch_size, timesteps)
sentence_begin_token: Any (anything that can be broadcast in torch for assignment): For 2D input, a scalar with the <S> id. For 3D input, a tensor with length dim.
sentence_end_token: Any (anything that can be broadcast in torch for assignment): For 2D input, a scalar with the </S> id. For 3D input, a tensor with length dim.

tensor_with_boundary_tokenstorch.Tensor: The tensor with the appended and prepended boundary tokens. If the input was 2D, it has shape (batch_size, timesteps + 2) and if the input was 3D, it has shape (batch_size, timesteps + 2, dim).
new_masktorch.Tensor: The new mask for the tensor, taking into account the appended tokens marking the beginning and end of the sentence.

claf.tokens.elmo.remove_sentence_boundaries(tensor: torch.Tensor, mask: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Remove begin/end of sentence embeddings from the batch of sentences. Given a batch of sentences with size (batch_size, timesteps, dim) this returns a tensor of shape (batch_size, timesteps - 2, dim) after removing the beginning and end sentence markers. The sentences are assumed to be padded on the right, with the beginning of each sentence assumed to occur at index 0 (i.e., mask[:, 0] is assumed to be 1). Returns both the new tensor and updated mask. This function is the inverse of add_sentence_boundary_token_ids. Parameters ———- tensor : torch.Tensor

A tensor of shape (batch_size, timesteps, dim)

masktorch.Tensor: A tensor of shape (batch_size, timesteps)

tensor_without_boundary_tokenstorch.Tensor: The tensor after removing the boundary tokens of shape (batch_size, timesteps - 2, dim)
new_masktorch.Tensor: The new mask for the tensor of shape (batch_size, timesteps - 2).

Hangulpy.py Copyright (C) 2012 Ryan Rho, Hyunwoo Cho Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

exception claf.tokens.hangul.NotHangulException[source]¶: Bases: Exception

exception claf.tokens.hangul.NotLetterException[source]¶: Bases: Exception

exception claf.tokens.hangul.NotWordException[source]¶: Bases: Exception

claf.tokens.hangul.add_ryul(word)[source]¶: add suffix either ‘률’ or ‘율’ at the end of this word

claf.tokens.hangul.compose(chosung, joongsung, jongsung='')[source]¶: This function returns a Hangul letter by composing the specified chosung, joongsung, and jongsung. @param chosung @param joongsung @param jongsung the terminal Hangul letter. This is optional if you do not need a jongsung.

claf.tokens.hangul.decompose(hangul_letter)[source]¶: This function returns letters by decomposing the specified Hangul letter.

claf.tokens.hangul.has_approximant(letter)[source]¶: Approximant makes complex vowels, such as ones starting with y or w. In Korean there is a unique approximant euㅡ making uiㅢ, but ㅢ does not make many irregularities.

claf.tokens.hangul.has_batchim(letter)[source]¶: This method is the same as has_jongsung()

claf.tokens.hangul.has_jongsung(letter)[source]¶: Check whether this letter contains Jongsung

claf.tokens.hangul.ili(word)[source]¶: convert {가} or {이} to their correct respective particles automagically.

claf.tokens.hangul.is_all_hangul(phrase)[source]¶: Check whether the phrase contains all Hangul letters @param phrase a target string @return True if the phrase only consists of Hangul. False otherwise.

claf.tokens.hangul.is_hangul(phrase)[source]¶: Check whether the phrase is Hangul. This method ignores white spaces, punctuations, and numbers. @param phrase a target string @return True if the phrase is Hangul. False otherwise.

claf.tokens.hangul.josa_eg(word)[source]¶: add josa either ‘이’ or ‘가’ at the end of this word

claf.tokens.hangul.josa_el(word)[source]¶: add josa either ‘을’ or ‘를’ at the end of this word

claf.tokens.hangul.josa_en(word)[source]¶: add josa either ‘은’ or ‘는’ at the end of this word

claf.tokens.hangul.josa_gwa(word)[source]¶: add josa either ‘과’ or ‘와’ at the end of this word

claf.tokens.hangul.josa_ida(word)[source]¶: add josa either ‘이다’ or ‘다’ at the end of this word

claf.tokens.hangul.josa_ro(word)[source]¶: add josa either ‘으로’ or ‘로’ at the end of this word

class claf.tokens.linguistic.NER[source]¶

Bases: object

Named Entity Recognition

Models trained on the OntoNotes 5 corpus support the following entity types: (https://spacy.io/api/annotation#section-dependency-parsing)

classes = ['NONE', 'PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']¶

class claf.tokens.linguistic.POSTag[source]¶

Bases: object

Universal POS tags expends by spacy (https://spacy.io/api/annotation#section-pos-tagging)

classes = ['ADJ', 'ADP', 'ADV', 'AUX', 'CONJ', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', 'SPACE']¶

class claf.tokens.text_handler.TextHandler(token_makers, lazy_indexing=True)[source]¶

Bases: object

Text Handler

voacb and token_counter
raw_features -> indexed_features
raw_features -> tensor

Args:
token_makers: Dictionary consisting of
key: token_name

value: TokenMaker (claf.tokens.token_maker)
Kwargs:
lazy_indexing: Apply Lazy Evaluation to text indexing

build_vocabs(token_counters)[source]¶

index(datas, text_columns)[source]¶

is_all_vocab_use_pretrained()[source]¶

make_token_counters(texts, config=None)[source]¶

raw_to_tensor_fn(data_reader, cuda_device=None, helper={})[source]¶

class claf.tokens.token_maker.TokenMaker(token_type, tokenizer=None, indexer=None, embedding_fn=None, vocab_config=None)[source]¶

Bases: object

Token Maker (Data Transfer Object)

Token Maker consists of Tokenizer, Indexer, Embedding and Vocab

Kwargs:
tokenizer: Tokenizer (claf.tokens.tokenizer.base) indexer: TokenIndexer (claf.tokens.indexer.base) embedding_fn: wrapper function of TokenEmbedding (claf.tokens.embedding.base) vocab_config: config dict of Vocab (claf.tokens.vocaburary)

BERT_TYPE = 'bert'¶

CHAR_TYPE = 'char'¶

COVE_TYPE = 'cove'¶

ELMO_TYPE = 'elmo'¶

EXACT_MATCH_TYPE = 'exact_match'¶

FEATURE_TYPE = 'feature'¶

FREQUENT_WORD_TYPE = 'frequent_word'¶

LINGUISTIC_TYPE = 'linguistic'¶

WORD_TYPE = 'word'¶

property embedding_fn¶

property indexer¶

set_vocab(vocab)[source]¶

property tokenizer¶

property vocab¶

property vocab_config¶

class claf.tokens.vocabulary.Vocab(token_name, pad_token=None, oov_token=None, start_token=None, end_token=None, cls_token=None, sep_token=None, min_count=None, max_vocab_size=None, frequent_count=None, pretrained_path=None, pretrained_token=None)[source]¶

Bases: object

Vocaburary Class

Vocab consists of token_to_index and index_to_token.

Args:
token_name: Token name (Token and Vocab is one-to-one relationship)

Kwargs:
pad_token: padding token value (eg. <pad>) oov_token: out-of-vocaburary token value (eg. <unk>) start_token: start token value (eg. <s>, <bos>) end_token: end token value (eg. </s>, <eos>) cls_token: CLS token value for BERT (eg. [CLS]) sep_token: SEP token value for BERT (eg. [SEP]) min_count: token’s minimal frequent count.

when you define min_count, tokens remain that bigger than min_count.

max_vocab_size: vocaburary’s maximun size.
when you define max_vocab_size, tokens are selected according to frequent count.

frequent_count: get frequent_count threshold_index.
(eg. frequent_count = 1000, threshold_index is the tokens that frequent_count is 999 index number.)

pretrained_path: pretrained vocab file path
(format: A

B C D …)

DEFAULT_OOV_INDEX = 1¶

DEFAULT_OOV_TOKEN = '[UNK]'¶

DEFAULT_PAD_INDEX = 0¶

DEFAULT_PAD_TOKEN = '[PAD]'¶

PRETRAINED_ALL = 'all'¶

PRETRAINED_INTERSECT = 'intersect'¶

add(token, predefine_vocab=None)[source]¶

build(token_counter, predefine_vocab=None)[source]¶

build token with token_counter

Args:
token_counter: (collections.Counter) token’s frequent_count Counter.

build_with_pretrained_file(token_counter)[source]¶

dump(path)[source]¶

from_texts(texts)[source]¶

get_all_tokens()[source]¶

get_index(token)[source]¶

get_token(index)[source]¶

init()[source]¶

load(path)[source]¶

to_text()[source]¶

class claf.tokens.vocabulary.VocabDict(oov_value)[source]¶

Bases: collections.defaultdict

Vocab DefaultDict Class

Kwargs:
oov_value: out-of-vocaburary token value (eg. <unk>)

Module contents¶

class claf.tokens.BertTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

BERT Token Pre-training of Deep Bidirectional Transformers for Language Understanding

example.

hello -> [‘[CLS]’, ‘he’, ‘##llo’, [SEP]] -> [1, 4, 7, 2] -> BERT -> tensor

consisting of

tokenizer: WordTokenizer
indexer: WordIndexer
embedding: ELMoEmbedding (Language Modeling BiLSTM)
vocab: Vocab

class claf.tokens.CharTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

Character Token

Character-level Convolutional Networks for Text Classification (https://arxiv.org/abs/1509.01626)

example.

hello -> [‘h’, ‘e’, ‘l’, ‘l’, ‘o’] -> [2, 3, 4, 4, 5] -> CharCNN -> tensor

consisting of

tokenizer: CharTokenizer
indexer: CharIndexer
embedding: CharEmbedding (CharCNN)
vocab: Vocab

class claf.tokens.CoveTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

CoVe Token

Learned in Translation: Contextualized Word Vectors (McCann et. al. 2017) (https://github.com/salesforce/cove)

example.

hello -> [‘hello’] -> [2] -> CoVe -> tensor

consisting of

tokenizer: WordTokenizer
indexer: WordIndexer
embedding: CoveEmbedding (Machine Translation LSTM)
vocab: Vocab

class claf.tokens.ElmoTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

ELMo Token Embedding from Language Modeling

Deep contextualized word representations (https://github.com/allenai/allennlp/blob/master/allennlp/modules/elmo.py)

example.

hello -> [‘h’, ‘e’, ‘l’, ‘l’, ‘o’] -> [2, 3, 4, 4, 5] -> ELMo -> tensor

consisting of

tokenizer: WordTokenizer
indexer: WordIndexer
embedding: ELMoEmbedding (Language Modeling BiLSTM)
vocab: Vocab

class claf.tokens.ExactMatchTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

Exact Match Token (Sparse Feature)

Three simple binary features, indicating whether p_i can be exactly matched to one question word in q, either in its original, lowercase or lemma form.

example.

c: i do, q: i -> [‘i’, ‘do’] -> [1, 0] -> tensor

consisting of

tokenizer: WordTokenizer
indexer: WordIndexer
embedding: SparseFeature
vocab: Vocab

class claf.tokens.FeatureTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

Feature Token

Do not use Embedding function. Just pass indexed_feature

example.

hello -> [‘hello’, ‘world’] -> [3, 5] -> tensor

consisting of

tokenizer: Tokenizer (need to define unit)
indexer: WordIndexer
embedding: None
vocab: Vocab

class claf.tokens.FrequentWordTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

Frequent-Tuning Word Token

word token + pre-trained word embeddings fixed and only fine-tune the N most frequent

example.

i do -> [‘i’, ‘do’] -> [1, 2] -> Embedding Matrix -> tensor finetuning only ‘do’

consisting of

tokenizer: WordTokenizer
indexer: WordIndexer
embedding: FrequentTuningWordEmbedding
vocab: Vocab

class claf.tokens.LinguisticTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

Exact Match Token (Sparse Feature)

Three simple binary features, indicating whether p_i can be exactly matched to one question word in q, either in its original, lowercase or lemma form.

example.

c: i do, q: i -> [‘i’, ‘do’] -> [1, 0] -> tensor

consisting of

tokenizer: WordTokenizer
indexer: WordIndexer
embedding: SparseFeature
vocab: Vocab

class claf.tokens.WordTokenMaker(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶

Bases: claf.tokens.token_maker.TokenMaker

Word Token (default)

i do -> [‘i’, ‘do’] -> [1, 2] -> Embedding Matrix -> tensor

consisting of

tokenizer: WordTokenizer
indexer: WordIndexer
embedding: WordEmbedding
vocab: Vocab

claf.tokens.basic_embedding_fn(embedding_config, module)[source]¶