`athena.data.text_featurizer`¶

Text featurizer

Module Contents¶

Classes¶

`Vocabulary`	Vocabulary
`EnglishVocabulary`	English vocabulary seperated by space
`SentencePieceFeaturizer`	SentencePieceFeaturizer using tensorflow-text api
`TextTokenizer`	TextTokenizer
`TextFeaturizer`	The main text featurizer interface

class athena.data.text_featurizer.Vocabulary(vocab_file)¶

Vocabulary

load_model(vocab_file)¶: load model

_default_unk_index()¶

_default_unk_symbol()¶

__len__()¶

decode(ids)¶: convert a list of ids to a sentence

decode_to_list(ids, ignored_id=[])¶: convert a list of ids to a list of symbols

encode(sentence)¶: convert a sentence to a list of ids, with special tokens added.

__call__(inputs)¶

class athena.data.text_featurizer.EnglishVocabulary(vocab_file)¶

Bases: Vocabulary

English vocabulary seperated by space

decode(ids)¶: convert a list of ids to a sentence.

encode(sentence)¶: convert a sentence to a list of ids, with special tokens added.

decode_to_list(ids, ignored_id=[])¶: convert a list of ids to a list of symbols

class athena.data.text_featurizer.SentencePieceFeaturizer(spm_file)¶

SentencePieceFeaturizer using tensorflow-text api

load_model(model_file)¶: load sentence piece model

__len__()¶

encode(sentence)¶: convert a sentence to a list of ids by sentence piece model

decode(ids)¶: convert a list of ids to a sentence

decode_to_list(ids, ignored_id=[])¶

class athena.data.text_featurizer.TextTokenizer(text=None)¶

TextTokenizer

load_model(text)¶: load model

save_vocab(vocab_file)¶

load_csv(csv_file)¶

__len__()¶

encode(texts)¶: convert a sentence to a list of ids, with special tokens added.

decode(sequences)¶: conver a list of ids to a sentence

decode_to_list(ids, ignored_id=[])¶

class athena.data.text_featurizer.TextFeaturizer(config=None)¶

The main text featurizer interface

property model_type¶

@property

Returns: the model type

property unk_index¶

@property

Returns: the unk index
Return type: int

supported_model¶

default_config¶

load_model(model_file)¶: load model

delete_punct(tokens)¶: delete punctuation tokens

__len__()¶

encode(texts)¶: convert a sentence to a list of ids, with special tokens added.

decode(sequences)¶: conver a list of ids to a sentence

decode_to_list(sequences, ignored_id=[])¶

athena.data.text_featurizer¶

Module Contents¶

Classes¶

`athena.data.text_featurizer`¶