Traditional Culture Encyclopedia - Weather forecast - The origin of Chinese

The origin of Chinese

Chinese

Chinese-As the mother tongue of a nation, Chinese is the largest branch of the popular language system in the world today. It was founded when the Yellow Emperor was alive in BC and achieved at the end of the 20th century. It is a language system with the earliest origin and the latest maturity. It is a symbol and achievement of oriental civilization and an important information carrier used by human beings to accurately name and define everything. The system includes thousands of commonly used words and idioms, which is an indispensable part of civilized society. An editor from Beijing

Since the topic of machine translation was put forward in the early 1950s, the research and development history of natural language processing has been at least 50 years. In the early 1990s, the research goal of NLP began to shift from small-scale restricted language processing to large-scale real text processing. It was 13 International Conference on Computational Linguistics held in Helsinki in 1990 that formally incorporated this new goal into the theme of the conference. Those limited language analysis systems with only a few hundred entries and dozens of grammatical rules are usually dubbed as "toys" by people in the industry, and they can't be of any practical value. What the government, enterprises and computer users expect is a practical system that can handle large-scale real texts, such as Chinese character input, voice dictation machine, text-to-speech conversion (TTS), search engine, information extraction (IE), information security and machine translation (MT).

Based on the attention to this milestone, the author listed four application prospects of large-scale real text processing in 1993: a new generation of information retrieval system; Newspapers edited according to customers' requirements; Information extraction, that is, transforming unstructured text into structured information base; Automatic tagging of large-scale corpora. Fortunately, these four directions have achieved practical or commercial results today.

Although the whole world regards large-scale real text processing as the strategic goal of NLP, it does not mean to stop natural language analysis technologies such as machine translation, voice dialogue and telephone translation or theoretical research based on deep understanding in limited fields. The diversification of goals and tasks is a symbol of the prosperity of academic circles. The problem is to consider clearly where the main battlefield of NLP is and where our main force should be deployed.

Is Chinese difficult?

When it comes to the important application topics faced by enterprises and computer users in Chinese information processing, such as Chinese character input and speech recognition, it seems that there is no disagreement. However, when discussing the methods or technical routes to realize these topics, the differences will be clearly defined immediately. The first view holds that the essence of Chinese information processing is Chinese understanding, that is, syntactic and semantic analysis of Chinese real texts. Scholars who hold this view believe that the probability statistics method used in Chinese information processing in the past has come to an end. In order to solve the problem of Chinese information processing at the level of understanding or language, we must find another way, which is semantics. It is said that this is because Chinese is different from western languages, and its syntax is quite flexible, and it is essentially a paratactic language.

Contrary to the above viewpoint, most of the above-mentioned application systems (except MT) are actually implemented without syntactic and semantic analysis, so they are not "understanding". If we must say "understanding", it is only the so-called "understanding" confirmed by Turing experiment.

The focus of the above-mentioned disputes between the two sides is the method, but the goal and the method are usually inseparable. If we agree to take large-scale real text processing as the strategic goal of NLP, then the theory and method to achieve this goal will inevitably change accordingly. Coincidentally, 1992, the 4th International Conference on Theories and Methods of Machine Translation (TMI-92) held in Montreal, announced that the theme of the conference was "Empiricism and Rationalism in Machine Translation". This is an open admission that besides the traditional NLP technology based on linguistics and artificial intelligence (rationalism), there is a new method based on corpus and statistical language model (empiricism) that is rapidly emerging.

NLP's strategic objectives and corresponding corpus methods are obtained from the international academic arena, and Chinese information processing is no exception. The view that Chinese text processing is so difficult that it needs another way lacks convincing factual basis. Taking information retrieval (IR) as an example, its task is to find documents related to users' queries from a large-scale document library. How to express the contents of documents and queries, and how to measure the correlation between documents and queries have become two basic problems that need to be solved in information retrieval technology. Recall and precision are two main indexes to evaluate information retrieval system. Since documents and queries are expressed in natural languages, this task can be used to illustrate that the problems faced by Chinese and western languages are actually very similar. Generally speaking, the IR system of various languages uses the word frequency (tf) and inverse document frequency (idf) in documents and queries to represent the contents of documents and queries, so it is essentially a statistical method.

World Text Retrieval Congress TREC (and W = w 1...wn represent part-of-speech tagging sequence and word sequence respectively, so the part-of-speech tagging task can be regarded as a problem of calculating the following conditional probability maximum when the word sequence w is known:

C*= argmaxC P(C|W)

= argmaxC P(W|C)P(C) / P(W)

≈ argmaxC ∏i i= 1,...,nP(wi|ci)P(ci|ci- 1)

P(C|W) indicates the conditional probability that the part-of-speech marker sequence C appears when the input word sequence W is known. The mathematical symbol argmaxC indicates that the word sequence W* that maximizes the conditional probability P(C|W) is found by examining different candidate word class marker sequences c, and the latter should be the result of marking w.

The second line of the formula is the result of using Bayesian law. Since the denominator P(W) is constant for a given w and does not affect the calculation of the maximum value, it can be deleted from the formula. Then approximate the formula. Firstly, the independence hypothesis is introduced, and it is considered that the occurrence probability of any word wi in the word sequence is approximate, which is only related to the part-of-speech marker ci of the current word, but has nothing to do with the surrounding (contextual) part-of-speech markers. Namely lexical probability

P(W|C) ≈ ∏i i= 1,...,nP(wi|ci)

Secondly, the binary hypothesis is adopted, that is, it is approximately considered that the occurrence probability of any part-of-speech marker ci is only related to its immediately preceding part-of-speech marker ci- 1. Therefore, there are:

p(C)≈I I =,...,n P(ci|ci- 1)

P(ci|ci- 1) is the transition probability of part-of-speech markers, also called binary model.

These two probability parameters can also be estimated by corpus with part-of-speech tags:

P(wi|ci) ≈ Count (wi, ci)/Count (ci)

P(ci|ci- 1) ≈ count (ci- 1ci)/count (ci- 1)

By the way, the automatic tagging of Chinese and English parts of speech realized by domestic and foreign scholars with binary or ternary models of part of speech tagging has reached about 95% accuracy.

Why is evaluation the only criterion?

Only when there is evaluation can there be recognition. The only criterion for judging the quality of a method is comparable evaluation, not the "self-evaluation" designed by the designer himself, not to mention human intuition or someone's "foresight" In recent years, in the field of language information processing, there are many examples of promoting scientific and technological progress through evaluation. The national "863 Plan" expert group of intelligent computers has made many national evaluations on topics such as speech recognition, Chinese character (printed and handwritten) recognition, automatic text segmentation, automatic part-of-speech tagging, automatic summarization, and translation quality of machine translation, which has played a very positive role in promoting technological progress in these fields.

Internationally, two programs related to language information processing initiated by the US Department of Defense, TIPSTER and TIDES, are called "evaluation drivers". They not only provide large-scale training corpus and test corpus, but also provide unified scoring methods and evaluation software for information retrieval (TREC), information extraction (MUC), named entity recognition (MET-2) and other research topics, so as to ensure that all research groups can discuss research methods under fair and open conditions and promote scientific and technological progress. The multilingual evaluation activities organized by conferences such as TREC, MUC and MET-2 also strongly show that the methods adopted in other languages and proved to be effective are also applicable to Chinese, and the performance indicators of application systems in different languages are roughly the same. Of course, every language has its own personality, but these personalities should not be used to deny the * * * of language and make wrong judgments in the absence of facts.

In order to promote the development of Chinese information processing, let's take the weapon of evaluation and study its applicable technology in a down-to-earth manner, and don't take it for granted. It is suggested that government scientific research departments should allocate at least 10% of the total funds of a project to fund the evaluation of the project. The research results without unified evaluation are not completely credible after all.