Traditional Culture Encyclopedia - Travel guide - The Fire of Text Analysis and Machine Learning

The Fire of Text Analysis and Machine Learning

First, what is text analysis?

As a new quantitative analysis method based on qualitative research, text analysis method can reveal the changes and characteristics of the text and provide a new idea for the study of classical problems.

Text analysis is applied in many fields. For example, in tourism, the perception of tourism image can be studied through text analysis. For example, in economics, the current insurance policy can be studied through text analysis, and other fields will also be applied to text analysis.

Second, the general research steps of text analysis

Text analysis has five common steps, including data collection, word segmentation, data cleaning, feature extraction, modeling and other analysis, as shown in the following figure:

data compilation

The first step of text analysis needs to collect data, and the ways to obtain text data generally include network platform, media platform, news, hownet, forum and so on.

participle

The computer will divide the character string we imported into words for subsequent analysis.

Data cleaning

In the process of text analysis, it is necessary to preprocess the text first. Pretreatment is a very important step, which directly affects the accuracy and reliability of subsequent analysis. Punctuation and stop words removal is a common preprocessing operation, which can effectively remove irrelevant information in the text and improve the analysis efficiency. At the same time, word segmentation and stop words removal are also helpful to extract more accurate keywords and topics. In addition, the theme of the text will be analyzed by the frequency and distribution of keywords, and some researchers will also understand the emotional tendency of the text by analyzing emotional words.

Feature extraction

After data cleaning, features can be extracted, for example, tf-idf in the visual part can be used, which is a common feature extraction method, which takes into account the importance of vocabulary in the text and the universality in the corpus. The higher the TF-IDF value, the greater the importance of the word in the text, and there are other methods.

Subsequent analysis

Use text data for subsequent analysis, such as visual graphic display, topic analysis, clustering, etc., which will be explained in the next module.

Third, how does SPSSAU work?

Text analysis demonstration: Click' Text Analysis Module' on the left dashboard of SPSSAU main system to enter.

After entering the text analysis module, researchers can choose to upload data by themselves, including pasting text or uploading txt/excel files (the size is limited to 5m). As shown in the figure below:

Then you can choose the analysis method according to your own needs and analyze it:

What can text analysis do?

Text analysis has many applications. Taking SPSSAU as an example, it can carry out text visualization (word cloud analysis), text emotion analysis, text clustering analysis, social network diagram, LDA topic analysis semantic analysis and so on.

Text visualization

In the text analysis module, the most important and basic thing is to show the results of word segmentation, usually using word clouds. In word cloud analysis, SPSSAU provides four functions, namely word cloud analysis, custom word cloud, word location and tf-idf.

Word cloud analysis

The cloud picture of "Zi" intuitively shows the news content keywords of 65438+February 2023 ***4 1, and the number of households, cities, development and construction are all key information. The first 100 high-frequency keywords are displayed by default, and this number can be set independently. You can also modify the style of the word cloud and download the cloud image of the word.

Custom word cloud

If you are not satisfied with the word cloud analysis, you can also use a custom word cloud. Researchers can paste (or edit) the sorted information, including keywords and their word frequencies, directly into the table, and then the corresponding word cloud map will appear.

Word localization

A word can be observed by word positioning or by line number.

tf-idf

Tf-idf is an important indicator in text analysis, which reflects the importance of a keyword in the whole data. The higher tf-idf is, the more important it is. Its meaning is different from word frequency, which refers to the number of occurrences, while tf-idf pays more attention to the importance of keywords. Where: tf-idf = tf * idf;; Where tf: TF = n/n, where n is the word frequency of keywords, n is the sum of the word frequencies of the whole data, and n is a fixed value. When n is the higher the word frequency, the higher the TF, which means that the keywords are more important; Idf = log(D/( 1+d)), where log is logarithm, d is the number of rows in the data, and d is the number of rows that a word has appeared in the data. D is a fixed value. The greater the value of d, the smaller the idf appears everywhere. The smaller the value of d, the higher the idf when it does not appear everywhere. The higher the idf, the higher the importance of keywords.

Text sentiment analysis

At present, the mainstream text sentiment analysis methods can be divided into three categories: sentiment dictionary-based, machine learning and deep learning. The method based on emotion dictionary is a traditional emotion analysis method, which uses the emotion polarity in the emotion dictionary to calculate the emotion value of the target sentence. Although the dictionary-based analysis method is simple to implement, it also has some shortcomings. Its accuracy largely depends on the quality of dictionary construction, which requires a lot of manpower and material resources and has poor adaptability to new words.

In the text analysis module, SPSSAU*** provides two ways of sentiment analysis, namely word-by-word sentiment analysis and line-by-line sentiment analysis. Word-by-word sentiment analysis is to analyze the sentiment of the extracted keywords and display them visually; Line-by-line emotional analysis refers to the emotional analysis of the original data in units of' lines', and the specific emotional score information can be downloaded.

Text clustering

Text clustering refers to the clustering and visual display of keywords that need to be analyzed. SPSSAU*** provides two text clustering methods, namely word clustering and line clustering.

Social network diagram

The social network diagram shows the relationship between keywords. The relationship here refers to the' * * * word matrix', that is, the frequency of two keywords appearing at the same time, and the information of the' * * * word matrix' is presented in a visual way.

* * * Word matrix: it is mainly used to express the correlation strength between keywords. It is a matrix composed of rows and columns, and the elements in the matrix represent the degree of correlation between keywords. In the * * * word matrix, the greater the value of the element, the stronger the correlation between the two keywords, that is, the higher their * * * co-occurrence frequency.

Social network diagram: The application of social network diagram in text analysis is mainly to reveal the relationship between entities in the text. This kind of chart can help us better understand the theme and content of the text and discover the hidden information and patterns in the text.

LDA theme analysis

Topic model is a statistical model used to count the number of topics appearing in a series of documents. LDA can discover the topic information hidden in the text by unsupervised learning. LDA regards topics as the condensation of document contents, so we can generate documents from information in large-scale corpora through LDA. The generated document can be regarded as composed of many topics, and every word that constitutes a topic is out of order, thus achieving the effect of reducing the dimension of the document, greatly reducing the complexity of the problem and having semantic features. The results of SPSSAU are as follows (the bubble size indicates the importance of the theme, and the length of the bar indicates the weight of words when expressing the theme):

The discovery of new words

What the dictionary can't identify involves two key indicators, information entropy and mutual information. The greater the information entropy, the easier it is for a word to be combined with other phrases, while the smaller the information entropy, the harder it is for a word to be combined with other words.

Stopword/Emotional Words

Stop words: stop words refers to words that appear frequently in the article, but make little contribution to the theme and content of the article. The removal of stop words can improve the efficiency and accuracy of analysis;

Emotional words: Emotional words refer to words that express feelings or emotional tendencies. The identification and analysis of emotional words can help us better understand the emotional connotation of the text;