NgramHash Class
Extracts NGrams from text and converts them to vector using hashing trick.
- Inheritance
-
nimbusml.internal.core.feature_extraction.text.extractor._ngramhash.NgramHashNgramHash
Constructor
NgramHash(number_of_bits=16, ngram_length=1, skip_length=0, all_lengths=True, seed=314489979, ordered=True, maximum_number_of_inverts=0, **params)
Parameters
Name | Description |
---|---|
number_of_bits
|
Number of bits to hash into. Must be between 1 and 30, inclusive. |
ngram_length
|
Ngram length. |
skip_length
|
Maximum number of tokens to skip when constructing an n-gram. |
all_lengths
|
Whether to include all n-gram lengths up to ngramLength or only ngramLength. |
seed
|
Hashing seed. |
ordered
|
Whether the position of each source column should be included in the hash (when there are multiple source columns). |
maximum_number_of_inverts
|
Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit. |
params
|
Additional arguments sent to compute engine. |
Examples
###############################################################################
# NGramFeaturizer
from nimbusml import FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.text import NGramFeaturizer
from nimbusml.feature_extraction.text.extractor import NgramHash
# data input (as a FileDataStream)
path = get_dataset('wiki_detox_train').as_filepath()
data = FileDataStream.read_csv(path, sep='\t')
print(data.head())
# Sentiment SentimentText
# 0 1 ==RUDE== Dude, you are rude upload that carl p...
# 1 1 == OK! == IM GOING TO VANDALIZE WILD ONES WIK...
# 2 1 Stop trolling, zapatancas, calling me a liar m...
# 3 1 ==You're cool== You seem like a really cool g...
# 4 1 ::::: Why are you threatening me? I'm not bein...
# transform usage
xf = NGramFeaturizer(word_feature_extractor=NgramHash(),
columns=['SentimentText'])
# fit and transform
features = xf.fit_transform(data)
# print features
print(features.head())
# Sentiment SentimentText.0 ... entimentText.65534 SentimentText.65535
# 0 1 0.0 ... 0.0 0.0
# 1 1 0.0 ... 0.0 0.0
# 2 1 0.0 ... 0.0 0.0
# 3 1 0.0 ... 0.0 0.0
# 4 1 0.0 ... 0.0 0.0
Remarks
The NGramFeaturizer
transform produces a bag of counts of
sequences of consecutive words, called n-grams, from a given corpus
of text.
There are two ways it can do this:
- build a dictionary of n-grams and use the id in the dictionary as
the index in the bag;
- hash each n-gram and use the hash value as the index in the bag.
This class provide the text extractor that implement the second. In NGramFeaturizer, users should specify which text extractor to use as the argument.
The purpose of hashing is to convert variable-length text documents into equal-length numeric feature vectors, to support dimensionality reduction and to make the lookup of feature weights faster.
The n-grams are represented as count vectors, with vector slots corresponding to their hashes. Embedding ngrams in a vector space allows their contents to be compared in an efficient manner. The slot values in the vector can be weighted by the following factors:
- term frequency - The number of occurrences of the slot in the
text
inverse document frequency - A ratio (the logarithm of
inverse relative slot frequency) that measures the information a
slot provides by determining how common or rare it is across the entire
text.
term frequency-inverse document frequency - the product
term frequency and the inverse document frequency.
Methods
get_params |
Get the parameters for this operator. |
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
Name | Description |
---|---|
deep
|
Default value: False
|