Text Mining Online Reviews for Sentiment Analysis
Fri, 28 Oct 2016
Data Science, Machine Learning
This post aims to introduce several basic text mining techniques. Sample implementations will be explored in the
Scikit-learn library using
Anaconda Python.
Introduction
In data science and machine learning, there is often difficulty in extracting useful features from raw data. Textual data presents an interesting challenge in this regards, especially due to its abundance on the internet. Because of its complexity, natural language is often not directly suited to training a classifier or regressor model. The following section discusses several simple ways to extract useful features from raw text. The dataset containing the raw text that will be used can be
found here.
Feature Extraction
The dataset consists of sentences gathered from Imbd, Amazon, and Yelp reviews. Each sentence is associated with a sentiment score: 0 if it is a negative sentence, and 1 if it is positive. For simplicity, the three files are first combined into a single file. This can be accomplished using a linux simple command:
cat imdb_labelled.txt amazon_cells_labelled.txt yelp_labelled.txt > comb.txt.
A basic function to parse the data is shown in the following block:
#Read sentiment labeled sentences from the specified path
#path: The path to the file containing sentiment labeled text data
#return: A tuple (S, y) where S is an array of sentences and y is an
# array of target values.
def LoadData(path):
#File format is \t
#Parse accordingly
S = []
y = []
#Open file and loop over it line by line
with open(path) as f:
for l in f:
text, sent = l.split('\t')
#Strip any non-ascii characters
text = StripNonAscii(text)
#Parse sentiment score
sent = float(sent)
#Append results
S.append(text)
y.append(sent)
return (S, y)
With the data parsed, the next step is to extract numeric features from it. A simple yet effective way of accomplishing this is to make a vector of word frequencies. The concept of a frequency vector is like that of a histogram or word cloud.
Figure 1: Word Frequency Histogram
In a frequency vector, each component corresponds to the number of times a given word occurs in the corpus. A histogram where each bin contains a single word in the vocabulary is a visual representation of this concept.
Another popular diagram that is related to these concepts is the word cloud. The word cloud plots words with their font size determined by the frequency of their occurance. An example word cloud created from the above dataset is shown below in Figure 2.
Figure 2: Word Cloud of the Dataset
Computing a matrix of word frequencies can be easily accomplished with Scikit-learn using the
CountVectorizer class. The constructor takes many arguments, but useful default are provided for all but one. Some interesting arguments to notes are:
- input: A file, filename, or sequence of string-like objects.
- ngram_range: The range of ngram* sizes to include.
- stop_words: Words that will be ignored (like 'a').
- max_df: Any word occuring more frequently than this number is discarded.
- min_df: Any word occuring less frequently than this number is discarded.
- max_features: The maximum number of terms that will be maintained.
- vocabulary: Explicitly provide a list of words to count and ignore others.
*
Note: An
ngram is a sequence of contiguous words like "the phone" or "favorite movie." The use of ngrams will be explored in a later blog post.
To extract the features with our code so far, the following three lines suffice:
S, y = LoadData('/path/to/directory/comb.txt')
cv = CountVectorizer()
A = cv.fit_transform(S)
#Example use of cv
The following code prints to the screen the top 32 words among all sentences along with the number of their occurances:
V = np.sum(cv.fit_transform(S).toarray(), axis = 0)
D = list(zip(V, cv.get_feature_names(), range(V.shape[0])))
for freq, word, c in sorted(D, key = lambda t : t[0], reverse = True)[0:32]:
print('{:5d}'.format(c) + '{:5d}'.format(freq) + '\t' + word)
An inspection of Table 1 below reveals that the most commonly occuring features do not offer much useful information about the data. The goal is to assign a sentence a sentiment value, but the above words can be reasonably expected to occur both in positive and negative sentences. Their frequency is simply due to the semantics of the English language.
Number |
Frequency |
Word |
1 |
1953 |
the |
2 |
1138 |
and |
3 |
789 |
it |
4 |
754 |
is |
5 |
670 |
to |
Table 1: Top 5 Words by Frequency
There are several ways to get around this problem. The most direct approach is to compile a list of
stop words, or words to ignore. Thankfully, Scikit-learn has already implemented this. Simply specify
stop_words='english' in the CountVectorizer constructor. Table 2 below shows the updated results.
Number |
Frequency |
Word |
1 |
230 |
good |
2 |
210 |
great |
3 |
182 |
movie |
4 |
168 |
phone |
5 |
163 |
film |
Table 2: Top 5 Words by Frequency with Stop Words
The above list looks better, but it could be better; "movie", "phone", and "film" are most likely not the best words for determining the sentiment of a sentence. As seen above, Scikit-learn offers the ability to supply a custom vocabulary. Intuitively speaking, words with positive and negative connotations like "great", "horrible", and "love" ought to be of highest importance as a features.
To explore this further, consider the dimensionality transform provided by
linear discriminant analysis. By modeling positive sentiment \((1)\) and negative sentiment \((0) \) as classes, a linear transform which maximizes the between-class variance relative to the within-class variance is constructed. Since there are only two classes in this case, the transform matrix reduces the \(n\)-dimensional features to \(1\)-dimensional features and thus will be of dimension \((1, n)\). The components of largest magnitude in this matrix will thus be the directions that most greatly influence the sentiment score. Code to view the top \(m\) components is as follows:
cv = CountVectorizer(stop_words = 'english', max_features=256)
D = cv.fit_transform(S)
lda = LinearDiscriminantAnalysis()
lda.fit(D.toarray(), y)
m = 40
topmfeats = np.abs(lda.coef_[0]).argsort()[-m:][::-1]
for i, j in enumerate(topmfeats):
s = '{:4d}'.format(i) + "\t"
s += '{:16s}'.format(cv.get_feature_names()[j])
s += '{:+5.3f}'.format(lda.coef_[0][j])
print(s)
The results are shown below in Table 3.
Index |
Word |
Coefficient |
0 |
perfect |
+3.458 |
1 |
fantastic |
+3.448 |
2 |
delicious |
+3.432 |
3 |
awesome |
+3.400 |
4 |
beautiful |
+3.287 |
5 |
enjoyed |
+3.165 |
6 |
disappointing |
-3.107 |
7 |
liked |
+3.063 |
Table 3: Top Words by LDA Coefficient Magnitude
When considering the sources of the data (Imdb, Amazon, and Yelp), the above results confirm intuition. The sentiment rating is largely influence by words with strongly negative or positive connotations. Further words with positive connotations influence the result in a positive direction (towards \(1\)) while words with negative connotations influence the result in a negative direction (towards \(0\)).
Training and Results
Next, a classifier is trained and results are generated. First, the raw frequencies will be used with a stock logistic regression model. Sample code and results follow.
#Prints testing accuracy results to the screen
#C: The classifier to use
#F: The feature extractor to use
#S: The list of sentences
#y: The target vectors
def RunCVTest(C, F, S, y):
#Fix the random state for better comparison
kf = KFold(len(S), shuffle = True, random_state = 32)
for trn, tst in kf:
#Make sure to only train with the training data
#in a realistic scenario only training data is available at the
#feature extraction stage
F.fit(S[trn])
B = F.transform(S)
#Fit the classifier C
C.fit(B[trn], y[trn])
#Results for cross-validation set
r1 = C.score(B[tst], y[tst])
#Results for training data
r2 = C.score(B[trn], y[trn])
#Both results combined
r3 = C.score(B, y)
s = 'Tst: ' + '{:.4f}'.format(r1)
s += '\tTrn: ' + '{:.4f}'.format(r2)
s += '\tAll: ' + '{:.4f}'.format(r3)
print(s)
#...
#%% A first attempt
S, y = LoadData(DATA_PATH + 'comb.txt')
cv = CountVectorizer()
lr = LogisticRegression()
#Convert to numpy array for indexing ability
S = np.array(S)
y = np.array(y)
print('LogisticRegression: ')
RunCVTestHTML(lr, cv, S, y)
At this point, the results are decent. However, as can be seen from Table 4 below, there is a large discrepancy between the testing and training accuracy scores; the model appears to be over-fitting the training data. This is not overly suprising when the results from Table 1 are considered. If the features contain superfluous information, the model is likely to at least partially fit the superfluous information allowing for high accuracy on the training data but poor generalization ability.
Test |
Train |
All |
79.30% |
98.20% |
91.90% |
79.90% |
97.85% |
91.87% |
82.10% |
97.95% |
92.67% |
Table 4: Logistic Regression Performance Results
To help reduce the dimensionality of the data, prevent over-fitting, and to slightly improve the results, a custom vocabulary is used. This vocabulary is constructed by using the LDA components of largest magnitude as discussed earlier.
#%% A second attempt with custom vocabulary
S, y = LoadData(DATA_PATH + 'comb.txt')
cv = CountVectorizer(stop_words = 'english', max_features = 512)
D = cv.fit_transform(S)
lda = LinearDiscriminantAnalysis()
lda.fit(D.toarray(), y)
#Determined by exhaustively searching 1 <= m <= 512
m = 213
topmfeats = np.abs(lda.coef_[0]).argsort()[-m:][::-1]
voc = [cv.get_feature_names()[i] for i in topmfeats]
avgs = RunCVTest(LogisticRegression(), CountVectorizer(vocabulary = voc), S, y)
In the above code, only the first 213 words are preserved. Table 5 contains the updated results from the above code.
Test |
Train |
All |
80.50% |
84.40% |
83.10% |
80.10% |
83.90% |
82.63% |
82.50% |
82.75% |
82.67% |
Table 5: Logistic Regression with Custom Vocabulary Results
As can be seen, there is a modest improvement in the cross-validation performance. Performance on the training data has decreased. This is reasonable as some spurious features have been removed so the potential for over-fitting has been reduced. Finally, some other slight performance improvements can be had by grid searching through the parameters of the feature extractor and classifier.
#Determines locally optimal parameters for the LogisticRegression
#classifier using exhaustive search
#S: The list of sentences
#y: The target vectors of sentiment scores
#voc: The vocabulary to use for CountVectorizer
#ret: The locally optimal classifier
def FindBestParams(S, y, voc):
#This will take a long time to run!
params = {'penalty':('l1', 'l2'), 'intercept_scaling':np.arange(0.1,10.1,0.1), 'C':np.arange(0.1, 10.1, 0.1)}
cv = CountVectorizer(vocabulary = voc)
gscv = GridSearchCV(LogisticRegression(), params)
gscv.fit(cv.fit_transform(S), y)
return gscv
#...
gscv = FindBestParams(S, y, voc)
lr = gscv.best_estimator_
RunCVTest(lr, cv, S, y)
The final results are shown below in Table 6.
Test |
Train |
All |
80.60% |
84.80% |
83.40% |
80.80% |
84.55% |
83.30% |
84.60% |
83.10% |
83.60% |
Table 6: Tuned Results for Logistic Regression
Further improvements in the size of the data and the performance of the model can probably be had by further tuning of the parameters.
Conclusion
Vectors of word frequencies are a basic type of feature that can be extracted from textual data. Despite the simplicity of the feature, reasonable performance can be achieved. A future blog post will explore some slightly more sophisticated methods available in Scikit-learn and possibly other libraries. I hope to see you then.