athena.data.text_featurizer

Text featurizer

Module Contents

Classes

Vocabulary

Vocabulary

EnglishVocabulary

English vocabulary seperated by space

SentencePieceFeaturizer

SentencePieceFeaturizer using tensorflow-text api

TextTokenizer

TextTokenizer

TextFeaturizer

The main text featurizer interface

class athena.data.text_featurizer.Vocabulary(vocab_file)

Vocabulary

load_model(vocab_file)

load model

_default_unk_index()
_default_unk_symbol()
__len__()
decode(ids)

convert a list of ids to a sentence

decode_to_list(ids, ignored_id=[])

convert a list of ids to a list of symbols

encode(sentence)

convert a sentence to a list of ids, with special tokens added.

__call__(inputs)
class athena.data.text_featurizer.EnglishVocabulary(vocab_file)

Bases: Vocabulary

English vocabulary seperated by space

decode(ids)

convert a list of ids to a sentence.

encode(sentence)

convert a sentence to a list of ids, with special tokens added.

decode_to_list(ids, ignored_id=[])

convert a list of ids to a list of symbols

class athena.data.text_featurizer.SentencePieceFeaturizer(spm_file)

SentencePieceFeaturizer using tensorflow-text api

load_model(model_file)

load sentence piece model

__len__()
encode(sentence)

convert a sentence to a list of ids by sentence piece model

decode(ids)

convert a list of ids to a sentence

decode_to_list(ids, ignored_id=[])
class athena.data.text_featurizer.TextTokenizer(text=None)

TextTokenizer

load_model(text)

load model

save_vocab(vocab_file)
load_csv(csv_file)
__len__()
encode(texts)

convert a sentence to a list of ids, with special tokens added.

decode(sequences)

conver a list of ids to a sentence

decode_to_list(ids, ignored_id=[])
class athena.data.text_featurizer.TextFeaturizer(config=None)

The main text featurizer interface

property model_type

@property

Returns

the model type

property unk_index

@property

Returns

the unk index

Return type

int

supported_model
default_config
load_model(model_file)

load model

delete_punct(tokens)

delete punctuation tokens

__len__()
encode(texts)

convert a sentence to a list of ids, with special tokens added.

decode(sequences)

conver a list of ids to a sentence

decode_to_list(sequences, ignored_id=[])