Traditional Culture Encyclopedia - Hotel reservation - 130 Online Homestay UGC Data Mining Practice--Application of Integrated Model in Sentiment Analysis

130 Online Homestay UGC Data Mining Practice--Application of Integrated Model in Sentiment Analysis

This experiment will load two data, one is the already marked user review data, the other is the user evaluation topic sentence, and the emotional polarity model based on the integrated model will be carried out through the marked user review data. Training, and then using the model to perform emotional polarity inference on topic sentences, and finally obtaining the topic emotional polarity through data aggregation and visualization.

Use Pandas to load the online data table and view the data dimensions and the first 5 rows of data.

The data attributes are as shown in the table below

Load the topic sentence we extracted through the topic dictionary before.

The data attributes are shown in the following table

User comment word segmentation

Jieba word segmentation is preheated. The first time you use it, you need to load the dictionary and cache. You can see from the results What is returned is a list of word segments.

It takes some time to segment user reviews in batches, and print the first row of segmentation results of the emotional polarity training set.

Carry out word segmentation on user evaluation topic sentences in batches, and print the word segmentation results of the first user topic sentence.

According to the statistical model assumption, assuming that the words in user reviews are independent of each other, and each word in user reviews is a feature, we directly use TF-IDF to extract features from user reviews, and The user evaluation after feature extraction is input into the classification model for classification, and the probability of the category output being positive is used as the user polarity mapping.

User comment vectorization

TF-IDF is a commonly used weighting technology for information retrieval and data mining. When the TF-IDF of a word in an article is larger, Generally speaking, the higher the importance of this word in this article, the more suitable it is to quantify the keywords in user comments.

Data set division

Divide the data set into a ratio of 80% training set and 20% test set, and check the number of data sets after division.

We used the Naive Bayes model to train the sentiment analysis model at the beginning of the series of experiments. Below we add a new logistic regression model as a comparison model. Logistic Regression is a machine learning method used to solve binary classification problems. Based on linear regression, a sigmod function is applied. This function maps the linear result to a probability interval, and is usually divided by 0.5 , which makes the classification results of the data tend to be at both ends of 0 and 1. This method can also be used to predict user emotions after vectorizing user comments. This experiment directly trains the annotated user emotion data and verifies the difference in emotion analysis performance between a single model and an integrated model.

Model loading

By passing in the original labels and predicted labels, the performance of the classifier can be measured directly, and the trained model can be evaluated using commonly used classification model evaluation indicators. , accuracy_score evaluates the proportion of correctly predicted samples to the total samples. Precision is an indicator of the accuracy of the model. It refers to the ratio of the number of documents recognized by the model to the total number of documents recognized. It measures the accuracy of the model. Recall is also called sensitivity. It refers to the ratio of the number of relevant documents identified by the model to the number of all relevant documents in the document library. It measures the recall rate of the retrieval system, indicating that the positive samples are correctly divided into samples. The f1_score value is the harmonic average of the precision rate and the recall rate, and is a comprehensive index.

We use the same data set for training and testing of different models to compare the differences between single models and print the model running time for your reference. It takes some time to batch process different models. Do the calculations and wait patiently.

By evaluating the model through the obtained indicators, we found that using the same data for model training, the performance of the naive Bayes model and the logistic regression model are basically the same, with a very weak difference, and the logistic regression has a slight advantage.

Stacking stack model training

Ensemble learning is to combine the advantages from two or more basic machine learning algorithms and learn how to best combine the advantages from multiple well-performing machine learning algorithms. model's predictions and make better predictions than any one model in the ensemble. It is mainly divided into Bagging, Boosting and Stacking. The Stacking stack model is a kind of integrated machine learning model. Specifically, all the trained base models are used to predict the entire training set, and then the prediction results output by each model are merged into a new one. characteristics and trained. It can mainly reduce the risk of over-fitting of the model and improve the accuracy of the model.

Start the integrated training of the two models. The training time is longer than that of a single model, so please wait patiently.

Collection of evaluation results.

Result analysis

Store the results in a Dataframe for result analysis. lr represents logistic regression, nb represents Naive Bayes, and model_stacking is a model that integrates two single models. Judging from the results, the integrated model has the highest accuracy and f1 value. Combining the advantages of the two models, the overall prediction performance is better and the robustness is better.

Sample test

Through the test sample, it was found that the classifier is better at judging normal positive and negative. But when we change the semantic information, the emotional model cannot be recognized, and the model is less robust. As an early text classification model, the feature extraction method we use TFIDF cannot solve semantic problems well. Natural language is associated with word order and semantics, and the association between words affects the emotional polarity of the entire sentence. In the future, we will continue to experiment with deep sentiment analysis models to study and solve such problems.

Load B&B theme data.

Model prediction

Write the results of sentiment analysis model inference into DataFrame for aggregation.

Single topic aggregation analysis

Select a topic for topic sentiment analysis.

Perform descriptive statistics on the "facilities" of the B&B. This time we used the theme dictionary to get 4,628 user discussions about the "facilities" of the B&B. The average user sentiment polarity was 0.40, expressed as a whole. There is a situation of dissatisfaction. More than half of the B&B reviews about "facilities" indicate that users are dissatisfied. Chongqing B&Bs need to improve "facilities" to improve user satisfaction.

Visualization of single-topic emotion polarity

We start to visualize the user-topic emotions under the "Settings" topic. First, load the drawing module.

To visualize the user sentiment polarity under the "facilities" topic, we use the ensemble model to predict the sentiment polarity of the topic sentence, as shown below.