Traditional Culture Encyclopedia - Photography major - OCR of OCR

OCR of OCR

Optical character recognition refers to the recognition of optical characters through image processing and pattern recognition technology, which is an important aspect of the research and application of automatic recognition technology. It is a software technology that can automatically recognize characters and input them into the computer, and it is the main software supporting the scanner. It belongs to the category of non-keyboard input and needs the cooperation of image input equipment, mainly scanners.

At present, OCR mainly refers to [1] character recognition software. Before 1996 ziguang began to be equipped with Chinese recognition software, scanners and OCR software on the market were always sold separately, and professional OCR software was more expensive than scanners earlier. With the improvement of scanner resolution, OCR software is constantly upgrading, and now scanner manufacturers have professional OCR software to sell with their own scanners. The rapid development of OCR technology is closely related to the wide use of scanners. In recent two years, with the gradual popularization of scanners and the improvement of OCR technology, OCR has become the right-hand man for most scanner users. Since the first generation of OCR products appeared in the early 1960s, after half a century of continuous development and improvement, the research on various OCR technologies including handwriting has made remarkable achievements. The functional requirements of OCR products have also changed from paying attention to recognition rate to putting forward higher requirements for recognition speed, user-friendly interface, simplicity of operation, product stability, adaptability, reliability and easy upgrade, and pre-sales and after-sales service quality.

The concept of OCR was first put forward by German scientist Tausheck in 1929, and later American scientist Handel also put forward the idea of using technology to recognize characters. Kathy and Naji were the first people to study the recognition of printed Chinese characters. 1966, they published the first article about Chinese character recognition, and recognized 1000 printed Chinese characters by template matching method.

As early as 1960s and 1970s, countries all over the world began to study OCR. At the beginning of the research, most of them focused on the method of character recognition, and the recognized characters were only numbers from 0 to 9. Taking Japan, which also has box symbols, as an example, the basic recognition theory of OCR began to be studied around 1960. At first, numbers were taken as the object. Until 1965 to 1970, some simple products began to appear, such as the postcode recognition system of printed characters, which recognized the postcode on the mail and helped the post office to distribute regional letters. Therefore, postal code has always been an address writing method advocated by various countries.

In the early 1970s, Japanese scholars began to study Chinese character recognition and did a lot of work. The research on OCR technology in China started late, and the recognition of numbers, English letters and symbols began in 1970s. At the end of 70' s, we began to study the recognition of Chinese characters. By 1986, the research on Chinese character recognition has entered a substantive stage, and many research institutions have successively launched Chinese OCR products. The early OCR software can't meet the actual needs due to many factors such as recognition rate and productization. At the same time, due to the high cost and slow running speed of hardware equipment, it has not reached the practical level. Only a few departments, such as the information department and news publishing units, use OCR software. After 1986, great progress has been made in OCR research in China, and innovations have been made in Chinese character modeling and recognition methods, and fruitful results have been achieved in system development and application. Many units have successively launched Chinese OCR products. Since 1990s, with the wide application of platform scanners and the popularization of information automation and office automation in China, OCR technology has been greatly promoted, and the recognition accuracy and speed of OCR have met the requirements of users. Due to the popularity and wide application of scanners, OCR software only needs to provide an interface with scanners and use scanner driver software. Therefore, OCR software mainly consists of four parts: image processing module, layout segmentation module, text recognition module and text editing module.

1, image processing module

The image processing module mainly has the functions of document scanning, image scaling and image rotation. After the manuscript is input by the scanner, it forms an image file, and the image processing module can enlarge the image and remove stains and scratches. If the image is not placed correctly, the image can be rotated manually or automatically to create better conditions for character recognition and make the recognition rate higher.

2. Layout segmentation module

The layout division module mainly includes layout division and change division, that is, understanding layout, word segmentation, standardization and so on. You can choose automatic or manual layout splitting method. The purpose is to tell OCR software to separate articles, tables, etc. So that they can be processed separately and identified in what order.

3. Character recognition module

Character recognition module is the core part of OCR software. The character recognition module mainly reads the input Chinese characters, but one eye can't read many lines, and it must be cut line by line. For Chinese characters, it is usually word-for-word recognition, that is, word recognition, and then normalization. The character recognition module completes the recognition by extracting the features of different samples of Chinese characters, automatically finds suspicious words, and has the function of association before and after.

4. Text editing module

The text editing module mainly modifies and edits the text recognized by OCR. If the system identifies an error, the text will be displayed in striking red or blue, and similar text will be provided for selection, and an editor will be selected for output. The purpose of an OCR recognition system is very simple, that is, to transform the image, so that the graphics in the image can be kept, and the data in the table and the characters in the image can be turned into computer characters, which can reduce the storage of image data, reuse and analyze the recognized characters, and of course save the manpower and time of keyboard input.

From the image to the result output, we should go through image input, image preprocessing, text feature extraction, comparison and recognition, and finally correct the typo by manual correction and output the result.

1 image input

The subject to be OCR processed must be transmitted to the computer through optical instruments, such as image scanners, fax machines or any photographic equipment. With the progress of science and technology, input devices such as scanners have become more and more exquisite, light and short, and of high quality, which is of great help to OCR. The resolution of the scanner makes the image clearer, and the scanning speed improves the efficiency of OCR processing.

Download: Taibbi Science and Technology Optical OCR image preprocessing: Image preprocessing is the most important module in OCR system. The process from obtaining non-black and non-white binary images or gray-scale color images to independently generating text images belongs to image preprocessing. Including image normalization, denoising, image correction and other image processing, as well as graphic analysis, text lines and text separation and other file preprocessing. In image processing, the theory and technology have reached a mature stage, so there are many link libraries available in the market or on the website. In the pretreatment of documents, it depends on various skills; The image should first separate the picture, table and text area, and even distinguish the typesetting direction, outline and text of the article, so that the size and font of the text can be judged as the original document.

Character feature extraction: In terms of recognition rate alone, feature extraction can be said to be the core of OCR. What features and how to extract them directly affect the quality of recognition, so there are many research reports on feature extraction in the early stage of OCR research. Features can be said to be chips for recognition, and simple distinction can be divided into two categories: one is statistical features, such as the black/white point ratio of a text area. When the text is divided into several regions, the combination of black/white point ratio in each region becomes a numerical vector of space, and the basic mathematical theory is enough for comparison. Another kind of feature is structural feature, such as the number and position of stroke endpoints and word intersections obtained after thinning text images, or comparing them with stroke segments by special comparison methods. The recognition methods of online handwriting input software in the market are mostly based on this structural method.

Contrast database: After calculating the features of input characters, whether using statistical features or structural features, there must be a contrast database or feature database for comparison. The contents of the database should include all the character sets to be recognized and the feature groups obtained by the same feature extraction method as the input characters.

2 Contrast recognition

This is a module that can give full play to mathematical operation theory. According to different characteristics, different mathematical distance functions are selected. The well-known methods are Euclidean space comparison method, relaxation comparison method and dynamic programming method (DP). And the establishment and comparison of famous methods such as neural network-like database and hmm (hidden Markov model). In order to make the recognition results more stable, some people also put forward the so-called expert system, which makes use of the differences and complementarities of various feature comparison methods to make the recognition results have particularly high confidence.

Text post-processing: Because the recognition rate of OCR cannot reach 100%, or in order to strengthen the correctness and confidence value of comparison, some functions of debugging and even helping to correct errors have become essential modules in OCR system. Word post-processing is an example. Using the compared recognized words and their possible similar candidate words, the most logical words can be found and corrected according to the recognized words before and after.

Thesaurus: A thesaurus established for word post-processing.

3 Manual correction

Before the last level of OCR, the user may just hold the mouse, follow the rhythm of software design or just watch, and it may take the user's spirit and time to correct or even find out what may be OCR errors. A good OCR software not only has a stable image processing and recognition core to reduce the error rate, but also the operation flow and function of manual correction affect the processing efficiency of OCR. Therefore, the comparison between the text image and the recognized characters, the position of the screen information, the candidate character function of each recognized character, the function of refusing to recognize the characters, and the potentially problematic characters are specially marked after the text post-processing. Are designed for users to use the keyboard as little as possible. Of course, it doesn't mean that the text that the system doesn't display is necessarily correct, just like the staff that is completely input by the keyboard will make mistakes. At this time, it depends entirely on the needs of users.

4 Result output

Some people only want the text file to be reused as a part of the text, so as long as the general text file is exactly the same as the input file, some people want to reproduce the original text, and some people pay attention to the text in the table, so they should combine Excel and other software. No matter how it changes, it is only a change in the format of the output file. If it needs to be restored to the same format as the original, it needs manual typesetting after recognition, which is time-consuming and laborious. 1 data input

Digital input of documents and materials is generally divided into:

1. Pure image mode.

2. Directory text, body image mode.

3. Full-text mode.

4. Full-text indexing method. A mixture of text mode and image mode.

2 identification process

Book level: Chinese and English; Simplified, traditional;

Layout level: vertical and horizontal; Whether there are columns;

Line segmentation word segmentation

Recognition: the real OCR recognition process, image information is restored to text information.

Post-processing: manual intervention, mainly concentrated in the first four stages.

3 Determinants of identification results

1. General recommendation for image quality 150dpi or above.

2. color. Generally, color recognition is poor, and black and white pictures are high. Therefore, black and white tif format is recommended for ocr.

The most important thing is the font. If it is handwriting, the recognition rate is very low.

The error rate of simplified OCR recognition in China is three ten thousandths. If higher precision is needed, more manual intervention is needed. It is difficult to identify traditional Chinese characters because the traditional Chinese character library is inconsistent (the font library in the Republic of China is inconsistent with the current traditional Chinese character library). Under the manual intervention, the accuracy rate can reach above 90% (in the case of clear pictures and texts). The resolution setting of 1. is an important prerequisite for character recognition. Generally speaking, scanners provide more image information, and recognition software can easily get recognition results. However, the higher the scan resolution is set, the higher the recognition accuracy will be. Choose a resolution of 300dpi or 400dpi, which is suitable for scanning most documents. Pay attention to the scanning recognition of the original text, and do not exceed the optical resolution of the scanner when setting the scan resolution, otherwise the loss will outweigh the gain. The following are some typical settings for reference only.

(1) 1, 2,3, 200dpi is recommended.

(2) Small paragraphs 4 and 5 suggest 300dpl.

(3) 400dpl is recommended for paragraphs 5 and 6 with small numbers.

(4) It is recommended to use 600dpi in paragraphs 7 and 8.

2. Adjust the brightness and contrast values properly when scanning to make the scanned document black and white. This is the key to the recognition rate. The setting of scanning brightness and contrast value is based on the principle of observing the fine strokes of Chinese characters in scanned images but not stopping. Before recognition, look at the quality of the words in the scanned image. If there are black spots or black spots in the image or the lines of words are thick and dark, and the strokes are unclear, it means that the brightness value is too small, so you should increase the brightness value and try again. If the text lines in the image are uneven, broken or even the outline of Chinese characters is seriously incomplete, it means that the brightness value is too large, so you should reduce the brightness and try again.

3. Select scanning software. Choosing a good OCR software that suits you is the basis of character recognition. Generally, OEM software provided with the scanner should not be used. OEM OCR software has few functions and poor effects, and some even have no Chinese recognition.

Select another image software. First of all, OCR software cannot recognize all scanners. The second and most important point is that the images scanned by the scanning interface of the imaging software are easy to process.

4. If the text is to be formatted, such as bold, oblique, indented first line, etc. Some OCR software will not recognize it, and the format will be lost or garbled. If you must scan formatted text, make sure that the recognition software you use supports text format scanning in advance. You can also turn off the pattern recognition system, so that the software can concentrate on finding the correct characters, regardless of the font and font format.

5. When scanning and identifying newspapers or other semi-transparent manuscripts, the characters on the back will confuse the fonts through the paper, which will cause great obstacles to recognition. In this scanning, just stick it on the back of the scanned manuscript. When scanning, cover a piece of black paper to increase the scanning contrast, which can reduce the influence of fuzzy fonts on the back and improve the recognition accuracy.

6. Generally, scanned texts are black and white, but the scanning mode is often set to gray mode when scanning settings are made. Especially when the quality of the manuscript is poor, it can obtain better recognition accuracy by scanning in gray mode and continuing recognition after processing by scanning software. It is worth noting that OCR recognition software can determine the threshold by itself, and a few percentage points difference in the threshold may affect normal recognition. Of course, the size of the obtained image file will be much larger than that of the black and white file. When scanning a large number of manuscripts, it is necessary to test the manuscripts to find the best threshold percentage.

7. When you encounter a scanned manuscript with mixed pictures and texts, you must first determine whether the recognition software you use supports the function of automatically analyzing pictures and texts. If supported, OCR software will automatically calculate the content, position and order of the text during this scanning recognition process. Text parts can be recognized normally according to the labeling order.

8. Manual selection of scanning area will have better recognition effect. After setting the parameters, preview first, and then start to select the scanning area. Don't choose an article to use in one area, because in order to pursue better visual effects, the current article typesetting is more mixed with pictures and texts, and scanning into a picture will affect OCR recognition. So the layout should be divided into n areas according to the actual situation. How to divide the regions? The font and font size of each area should be consistent, and there should be no graphics and images, and the width of each line should be consistent. In the case of different lengths, it should be subdivided. Generally, at most 10 selection can be scanned at one time. According to different situations, the order of identification areas should be set reasonably. Don't think this process is too annoying, it is an effective means to improve the recognition rate. Please note that there should be no intersection between identification areas, and identification should not be carried out until everything feels intact. In this way, the general recognition rate will be above 95%. After proofreading the wrongly recognized words, you can enter the corresponding word processing software for the required processing.

9. When placing the scanned manuscript, the scanned text material must be placed in the center of the scanning starting line to minimize the distortion caused by the optical lens. At the same time, the scanner glass should be protected from damage. The text is tilted at a certain angle, or the original typesetting is irregular, so it must be corrected by rotating tools after scanning; Otherwise, OCR recognition software will treat horizontal strokes as oblique strokes, and the recognition accuracy will drop a lot. Users are advised to straighten the scanned manuscript as much as possible. Using tools for rotation correction will reduce the image quality and increase the difficulty of character recognition.

10. Preview the whole layout first, select the area to be scanned, and then use the zoom preview tool to select a small piece to enlarge the full screen display, observe the contrast and depth concentration of the text, and adjust the threshold value according to the situation. Finally, the text is required to be clear, not thick (word cluster) or light (word truncation), generally at a threshold of about 80, and then scanned.

1 1. Use tools to erase image stains, including illustrations and separation lines that don't need to be recognized in the original layout, so that there is nothing extra in the text image except the text; This can greatly improve the recognition rate and reduce the modification work after recognition.

12. If you want to scan an article with poor printing quality, such as a newspaper, the scanning result will not be black and white, there will be many black spots, and there will be adhesion on the strokes of the font. These two items are taboo in Chinese character recognition, which will seriously affect the correct rate of Chinese character recognition. In order to get a better recognition effect, it is necessary to carefully adjust the tone and scan repeatedly to get the ideal result. In addition, because newspapers are very thin, the quality of most papers is not high, and the cover plate on the scanner can't completely hold down the newspapers (there is a gap), so the scanning recognition effect of newspapers is generally not as good as that of magazines. The solution is to press one or two 16K magazines in the newspaper, and the effect is still good.