claf.tokens package¶
Subpackages¶
Submodules¶
This code is from salesforce/cove (https://github.com/salesforce/cove/blob/master/cove/encoder.py)
-
class
claf.tokens.cove.
MTLSTM
(word_embedding, pretrained_path=None, requires_grad=False, residual_embeddings=False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(inputs)[source]¶ A pretrained MT-LSTM (McCann et. al. 2017). This LSTM was trained with 300d 840B GloVe on the WMT 2017 machine translation dataset.
- Arguments:
- inputs (Tensor): If MTLSTM handles embedding, a Long Tensor of size (batch_size, timesteps).
Otherwise, a Float Tensor of size (batch_size, timesteps, features).
lengths (Long Tensor): (batch_size, lengths) lenghts of each sequence for handling padding hidden (Float Tensor): initial hidden state of the LSTM
-
This code is from allenai/allennlp (https://github.com/allenai/allennlp/blob/master/allennlp/modules/elmo.py)
-
class
claf.tokens.elmo.
Elmo
(options_file: str, weight_file: str, num_output_representations: int, requires_grad: bool = False, do_layer_norm: bool = False, dropout: float = 0.5, vocab_to_cache: List[str] = None, module: torch.nn.modules.module.Module = None)[source]¶ Bases:
torch.nn.modules.module.Module
Compute ELMo representations using a pre-trained bidirectional language model. See “Deep contextualized word representations”, Peters et al. for details. This module takes character id input and computes
num_output_representations
different layers of ELMo representations. Typicallynum_output_representations
is 1 or 2. For example, in the case of the SRL model in the above paper,num_output_representations=1
where ELMo was included at the input token representation layer. In the case of the SQuAD model,num_output_representations=2
as ELMo was also included at the GRU output layer. In the implementation below, we learn separate scalar weights for each output layer, but only run the biLM once on each input sequence for efficiency. Parameters ———- options_file :str
, required.ELMo JSON options file
- weight_file
str
, required. ELMo hdf5 weight file
- num_output_representations:
int
, required. The number of ELMo representation layers to output.
- requires_grad:
bool
, optional If True, compute gradient of ELMo parameters for fine tuning.
- do_layer_norm
bool
, optional, (default=False). Should we apply layer normalization (passed to
ScalarMix
)?- dropout
float
, optional, (default = 0.5). The dropout to be applied to the ELMo representations.
- vocab_to_cache
List[str]
, optional, (default = 0.5). A list of words to pre-compute and cache character convolutions for. If you use this option, Elmo expects that you pass word indices of shape (batch_size, timesteps) to forward, instead of character indices. If you use this option and pass a word which wasn’t pre-cached, this will break.
- module
torch.nn.Module
, optional, (default = None). If provided, then use this module instead of the pre-trained ELMo biLM. If using this option, then pass
None
for bothoptions_file
andweight_file
. The module must provide a public attributenum_layers
with the number of internal layers and itsforward
method must return adict
withactivations
andmask
keys (see _ElmoBilm` for an example). Note thatrequires_grad
is also ignored with this option.
-
forward
(inputs: torch.Tensor, word_inputs: torch.Tensor = None) → Dict[str, Union[torch.Tensor, List[torch.Tensor]]][source]¶ inputs:
torch.Tensor
, required. Shape(batch_size, timesteps, 50)
of character ids representing the current batch. word_inputs :torch.Tensor
, required.If you passed a cached vocab, you can in addition pass a tensor of shape
(batch_size, timesteps)
, which represent word ids which have been pre-cached.Dict with keys:
'elmo_representations'
:List[torch.Tensor]
A
num_output_representations
list of ELMo representations for the input sequence. Each representation is shape(batch_size, timesteps, embedding_dim)
'mask'
:torch.Tensor
Shape
(batch_size, timesteps)
long tensor with sequence mask.
- weight_file
-
class
claf.tokens.elmo.
ElmoLstm
(input_size: int, hidden_size: int, cell_size: int, num_layers: int, requires_grad: bool = False, recurrent_dropout_probability: float = 0.0, memory_cell_clip_value: Optional[float] = None, state_projection_clip_value: Optional[float] = None)[source]¶ Bases:
claf.modules.encoder.lstm_cell_with_projection._EncoderBase
A stacked, bidirectional LSTM which uses
LstmCellWithProjection
’s with highway layers between the inputs to layers. The inputs to the forward and backward directions are independent - forward and backward states are not concatenated between layers. Additionally, this LSTM maintains its own state, which is updated every timeforward
is called. It is dynamically resized for different batch sizes and is designed for use with non-continuous inputs (i.e inputs which aren’t formatted as a stream, such as text used for a language modelling task, which is how stateful RNNs are typically used). This is non-standard, but can be thought of as having an “end of sentence” state, which is carried across different sentences. Parameters ———- input_size :int
, requiredThe dimension of the inputs to the LSTM.
- hidden_size
int
, required The dimension of the outputs of the LSTM.
- cell_size
int
, required. The dimension of the memory cell of the
LstmCellWithProjection
.- num_layers
int
, required The number of bidirectional LSTMs to use.
- requires_grad:
bool
, optional If True, compute gradient of ELMo parameters for fine tuning.
- recurrent_dropout_probability:
float
, optional (default = 0.0) The dropout probability to be used in a dropout scheme as stated in A Theoretically Grounded Application of Dropout in Recurrent Neural Networks .
- state_projection_clip_value:
float
, optional, (default = None) The magnitude with which to clip the hidden_state after projecting it.
- memory_cell_clip_value:
float
, optional, (default = None) The magnitude with which to clip the memory cell.
-
forward
(inputs: torch.Tensor, mask: torch.LongTensor) → torch.Tensor[source]¶ - inputs
torch.Tensor
, required. A Tensor of shape
(batch_size, sequence_length, hidden_size)
.- mask
torch.LongTensor
, required. A binary mask of shape
(batch_size, sequence_length)
representing the non-padded elements in each sequence in the batch.
A
torch.Tensor
of shape (num_layers, batch_size, sequence_length, hidden_size), where the num_layers dimension represents the LSTM output from that layer.- inputs
- hidden_size
-
claf.tokens.elmo.
add_sentence_boundary_token_ids
(tensor: torch.Tensor, mask: torch.Tensor, sentence_begin_token: Any, sentence_end_token: Any) → Tuple[torch.Tensor, torch.Tensor][source]¶ Add begin/end of sentence tokens to the batch of sentences. Given a batch of sentences with size
(batch_size, timesteps)
or(batch_size, timesteps, dim)
this returns a tensor of shape(batch_size, timesteps + 2)
or(batch_size, timesteps + 2, dim)
respectively. Returns both the new tensor and updated mask. Parameters ———- tensor :torch.Tensor
A tensor of shape
(batch_size, timesteps)
or(batch_size, timesteps, dim)
- mask
torch.Tensor
A tensor of shape
(batch_size, timesteps)
- sentence_begin_token: Any (anything that can be broadcast in torch for assignment)
For 2D input, a scalar with the <S> id. For 3D input, a tensor with length dim.
- sentence_end_token: Any (anything that can be broadcast in torch for assignment)
For 2D input, a scalar with the </S> id. For 3D input, a tensor with length dim.
- tensor_with_boundary_tokens
torch.Tensor
The tensor with the appended and prepended boundary tokens. If the input was 2D, it has shape (batch_size, timesteps + 2) and if the input was 3D, it has shape (batch_size, timesteps + 2, dim).
- new_mask
torch.Tensor
The new mask for the tensor, taking into account the appended tokens marking the beginning and end of the sentence.
- mask
-
claf.tokens.elmo.
remove_sentence_boundaries
(tensor: torch.Tensor, mask: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Remove begin/end of sentence embeddings from the batch of sentences. Given a batch of sentences with size
(batch_size, timesteps, dim)
this returns a tensor of shape(batch_size, timesteps - 2, dim)
after removing the beginning and end sentence markers. The sentences are assumed to be padded on the right, with the beginning of each sentence assumed to occur at index 0 (i.e.,mask[:, 0]
is assumed to be 1). Returns both the new tensor and updated mask. This function is the inverse ofadd_sentence_boundary_token_ids
. Parameters ———- tensor :torch.Tensor
A tensor of shape
(batch_size, timesteps, dim)
- mask
torch.Tensor
A tensor of shape
(batch_size, timesteps)
- tensor_without_boundary_tokens
torch.Tensor
The tensor after removing the boundary tokens of shape
(batch_size, timesteps - 2, dim)
- new_mask
torch.Tensor
The new mask for the tensor of shape
(batch_size, timesteps - 2)
.
- mask
Hangulpy.py Copyright (C) 2012 Ryan Rho, Hyunwoo Cho Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
claf.tokens.hangul.
compose
(chosung, joongsung, jongsung='')[source]¶ This function returns a Hangul letter by composing the specified chosung, joongsung, and jongsung. @param chosung @param joongsung @param jongsung the terminal Hangul letter. This is optional if you do not need a jongsung.
-
claf.tokens.hangul.
decompose
(hangul_letter)[source]¶ This function returns letters by decomposing the specified Hangul letter.
-
claf.tokens.hangul.
has_approximant
(letter)[source]¶ Approximant makes complex vowels, such as ones starting with y or w. In Korean there is a unique approximant euㅡ making uiㅢ, but ㅢ does not make many irregularities.
-
claf.tokens.hangul.
ili
(word)[source]¶ convert {가} or {이} to their correct respective particles automagically.
-
claf.tokens.hangul.
is_all_hangul
(phrase)[source]¶ Check whether the phrase contains all Hangul letters @param phrase a target string @return True if the phrase only consists of Hangul. False otherwise.
-
claf.tokens.hangul.
is_hangul
(phrase)[source]¶ Check whether the phrase is Hangul. This method ignores white spaces, punctuations, and numbers. @param phrase a target string @return True if the phrase is Hangul. False otherwise.
-
class
claf.tokens.linguistic.
NER
[source]¶ Bases:
object
Named Entity Recognition
Models trained on the OntoNotes 5 corpus support the following entity types: (https://spacy.io/api/annotation#section-dependency-parsing)
-
classes
= ['NONE', 'PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']¶
-
-
class
claf.tokens.linguistic.
POSTag
[source]¶ Bases:
object
Universal POS tags expends by spacy (https://spacy.io/api/annotation#section-pos-tagging)
-
classes
= ['ADJ', 'ADP', 'ADV', 'AUX', 'CONJ', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', 'SPACE']¶
-
-
class
claf.tokens.text_handler.
TextHandler
(token_makers, lazy_indexing=True)[source]¶ Bases:
object
Text Handler
voacb and token_counter
raw_features -> indexed_features
raw_features -> tensor
- Args:
- token_makers: Dictionary consisting of
key: token_name
value: TokenMaker (claf.tokens.token_maker)
- Kwargs:
lazy_indexing: Apply Lazy Evaluation to text indexing
-
class
claf.tokens.token_maker.
TokenMaker
(token_type, tokenizer=None, indexer=None, embedding_fn=None, vocab_config=None)[source]¶ Bases:
object
Token Maker (Data Transfer Object)
Token Maker consists of Tokenizer, Indexer, Embedding and Vocab
- Kwargs:
tokenizer: Tokenizer (claf.tokens.tokenizer.base) indexer: TokenIndexer (claf.tokens.indexer.base) embedding_fn: wrapper function of TokenEmbedding (claf.tokens.embedding.base) vocab_config: config dict of Vocab (claf.tokens.vocaburary)
-
BERT_TYPE
= 'bert'¶
-
CHAR_TYPE
= 'char'¶
-
COVE_TYPE
= 'cove'¶
-
ELMO_TYPE
= 'elmo'¶
-
EXACT_MATCH_TYPE
= 'exact_match'¶
-
FEATURE_TYPE
= 'feature'¶
-
FREQUENT_WORD_TYPE
= 'frequent_word'¶
-
LINGUISTIC_TYPE
= 'linguistic'¶
-
WORD_TYPE
= 'word'¶
-
property
embedding_fn
¶
-
property
indexer
¶
-
property
tokenizer
¶
-
property
vocab
¶
-
property
vocab_config
¶
-
class
claf.tokens.vocabulary.
Vocab
(token_name, pad_token=None, oov_token=None, start_token=None, end_token=None, cls_token=None, sep_token=None, min_count=None, max_vocab_size=None, frequent_count=None, pretrained_path=None, pretrained_token=None)[source]¶ Bases:
object
Vocaburary Class
Vocab consists of token_to_index and index_to_token.
- Args:
token_name: Token name (Token and Vocab is one-to-one relationship)
- Kwargs:
pad_token: padding token value (eg. <pad>) oov_token: out-of-vocaburary token value (eg. <unk>) start_token: start token value (eg. <s>, <bos>) end_token: end token value (eg. </s>, <eos>) cls_token: CLS token value for BERT (eg. [CLS]) sep_token: SEP token value for BERT (eg. [SEP]) min_count: token’s minimal frequent count.
when you define min_count, tokens remain that bigger than min_count.
- max_vocab_size: vocaburary’s maximun size.
when you define max_vocab_size, tokens are selected according to frequent count.
- frequent_count: get frequent_count threshold_index.
(eg. frequent_count = 1000, threshold_index is the tokens that frequent_count is 999 index number.)
- pretrained_path: pretrained vocab file path
(format: A
B C D …)
-
DEFAULT_OOV_INDEX
= 1¶
-
DEFAULT_OOV_TOKEN
= '[UNK]'¶
-
DEFAULT_PAD_INDEX
= 0¶
-
DEFAULT_PAD_TOKEN
= '[PAD]'¶
-
PRETRAINED_ALL
= 'all'¶
-
PRETRAINED_INTERSECT
= 'intersect'¶
-
class
claf.tokens.vocabulary.
VocabDict
(oov_value)[source]¶ Bases:
collections.defaultdict
Vocab DefaultDict Class
- Kwargs:
oov_value: out-of-vocaburary token value (eg. <unk>)
Module contents¶
-
class
claf.tokens.
BertTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
BERT Token Pre-training of Deep Bidirectional Transformers for Language Understanding
- example.
hello -> [‘[CLS]’, ‘he’, ‘##llo’, [SEP]] -> [1, 4, 7, 2] -> BERT -> tensor
- consisting of
tokenizer: WordTokenizer
indexer: WordIndexer
embedding: ELMoEmbedding (Language Modeling BiLSTM)
vocab: Vocab
-
class
claf.tokens.
CharTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
Character Token
Character-level Convolutional Networks for Text Classification (https://arxiv.org/abs/1509.01626)
- example.
hello -> [‘h’, ‘e’, ‘l’, ‘l’, ‘o’] -> [2, 3, 4, 4, 5] -> CharCNN -> tensor
- consisting of
tokenizer: CharTokenizer
indexer: CharIndexer
embedding: CharEmbedding (CharCNN)
vocab: Vocab
-
class
claf.tokens.
CoveTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
CoVe Token
Learned in Translation: Contextualized Word Vectors (McCann et. al. 2017) (https://github.com/salesforce/cove)
- example.
hello -> [‘hello’] -> [2] -> CoVe -> tensor
- consisting of
tokenizer: WordTokenizer
indexer: WordIndexer
embedding: CoveEmbedding (Machine Translation LSTM)
vocab: Vocab
-
class
claf.tokens.
ElmoTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
ELMo Token Embedding from Language Modeling
Deep contextualized word representations (https://github.com/allenai/allennlp/blob/master/allennlp/modules/elmo.py)
- example.
hello -> [‘h’, ‘e’, ‘l’, ‘l’, ‘o’] -> [2, 3, 4, 4, 5] -> ELMo -> tensor
- consisting of
tokenizer: WordTokenizer
indexer: WordIndexer
embedding: ELMoEmbedding (Language Modeling BiLSTM)
vocab: Vocab
-
class
claf.tokens.
ExactMatchTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
Exact Match Token (Sparse Feature)
Three simple binary features, indicating whether p_i can be exactly matched to one question word in q, either in its original, lowercase or lemma form.
- example.
c: i do, q: i -> [‘i’, ‘do’] -> [1, 0] -> tensor
- consisting of
tokenizer: WordTokenizer
indexer: WordIndexer
embedding: SparseFeature
vocab: Vocab
-
class
claf.tokens.
FeatureTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
Feature Token
Do not use Embedding function. Just pass indexed_feature
- example.
hello -> [‘hello’, ‘world’] -> [3, 5] -> tensor
- consisting of
tokenizer: Tokenizer (need to define unit)
indexer: WordIndexer
embedding: None
vocab: Vocab
-
class
claf.tokens.
FrequentWordTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
Frequent-Tuning Word Token
word token + pre-trained word embeddings fixed and only fine-tune the N most frequent
- example.
i do -> [‘i’, ‘do’] -> [1, 2] -> Embedding Matrix -> tensor finetuning only ‘do’
- consisting of
tokenizer: WordTokenizer
indexer: WordIndexer
embedding: FrequentTuningWordEmbedding
vocab: Vocab
-
class
claf.tokens.
LinguisticTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
Exact Match Token (Sparse Feature)
Three simple binary features, indicating whether p_i can be exactly matched to one question word in q, either in its original, lowercase or lemma form.
- example.
c: i do, q: i -> [‘i’, ‘do’] -> [1, 0] -> tensor
- consisting of
tokenizer: WordTokenizer
indexer: WordIndexer
embedding: SparseFeature
vocab: Vocab
-
class
claf.tokens.
WordTokenMaker
(tokenizers, indexer_config, embedding_config, vocab_config)[source]¶ Bases:
claf.tokens.token_maker.TokenMaker
Word Token (default)
i do -> [‘i’, ‘do’] -> [1, 2] -> Embedding Matrix -> tensor
- consisting of
tokenizer: WordTokenizer
indexer: WordIndexer
embedding: WordEmbedding
vocab: Vocab