athena.data.text_featurizer
¶
Text featurizer
Module Contents¶
Classes¶
Vocabulary |
|
English vocabulary seperated by space |
|
SentencePieceFeaturizer using tensorflow-text api |
|
TextTokenizer |
|
The main text featurizer interface |
- class athena.data.text_featurizer.Vocabulary(vocab_file)¶
Vocabulary
- load_model(vocab_file)¶
load model
- _default_unk_index()¶
- _default_unk_symbol()¶
- __len__()¶
- decode(ids)¶
convert a list of ids to a sentence
- decode_to_list(ids, ignored_id=[])¶
convert a list of ids to a list of symbols
- encode(sentence)¶
convert a sentence to a list of ids, with special tokens added.
- __call__(inputs)¶
- class athena.data.text_featurizer.EnglishVocabulary(vocab_file)¶
Bases:
Vocabulary
English vocabulary seperated by space
- decode(ids)¶
convert a list of ids to a sentence.
- encode(sentence)¶
convert a sentence to a list of ids, with special tokens added.
- decode_to_list(ids, ignored_id=[])¶
convert a list of ids to a list of symbols
- class athena.data.text_featurizer.SentencePieceFeaturizer(spm_file)¶
SentencePieceFeaturizer using tensorflow-text api
- load_model(model_file)¶
load sentence piece model
- __len__()¶
- encode(sentence)¶
convert a sentence to a list of ids by sentence piece model
- decode(ids)¶
convert a list of ids to a sentence
- decode_to_list(ids, ignored_id=[])¶
- class athena.data.text_featurizer.TextTokenizer(text=None)¶
TextTokenizer
- load_model(text)¶
load model
- save_vocab(vocab_file)¶
- load_csv(csv_file)¶
- __len__()¶
- encode(texts)¶
convert a sentence to a list of ids, with special tokens added.
- decode(sequences)¶
conver a list of ids to a sentence
- decode_to_list(ids, ignored_id=[])¶
- class athena.data.text_featurizer.TextFeaturizer(config=None)¶
The main text featurizer interface
- property model_type¶
@property
- Returns
the model type
- property unk_index¶
@property
- Returns
the unk index
- Return type
int
- supported_model¶
- default_config¶
- load_model(model_file)¶
load model
- delete_punct(tokens)¶
delete punctuation tokens
- __len__()¶
- encode(texts)¶
convert a sentence to a list of ids, with special tokens added.
- decode(sequences)¶
conver a list of ids to a sentence
- decode_to_list(sequences, ignored_id=[])¶