Traditional Culture Encyclopedia - Photography major - Can the original paper documents be destroyed after digital conversion?

Can the original paper documents be destroyed after digital conversion?

(1) Digital Processing of Paper Documents There are two main digital processing methods for paper documents: direct scanning method and microfilming method. 1. direct scanning method The so-called direct scanning method refers to the optical scanning of the original paper document with a scanner, which transmits the image information to a photoelectric converter to become an analog electrical signal, then converts the analog electrical signal into a digital electrical signal, and then transmits it to a computer memory through a computer interface. Direct scanning can be divided into two ways: (1) After scanning the paper file, the paper file is recognized by the character recognition (OCR) software, and finally a text file is generated. The advantages of this digital file are: small space, convenient computer full-text retrieval, and easy extraction and editing when the file is used. Its disadvantages are: it is impossible to keep the original information such as typesetting format, signature and seal of the original document; Sometimes the accuracy of OCR character recognition is low, it is difficult to check and modify, and the digitization efficiency is very low, which has actually destroyed the authenticity of the original file. (2) Scan paper files to form digital image files. The advantages of this image file are: it can keep the original appearance of the file content and typesetting, and the digitization speed is fast. Disadvantages are: unable to conduct full-text retrieval, unable to edit text content, and occupying a large storage space. The advantages and disadvantages of the above two methods are just complementary. Now there is a way to combine the advantages of the two methods into a document, that is, to make a double-layer PDF. The production method is: scan the original paper file into a digital image file and then convert it into a text file, and then put two files with the same content in the same PDF file, with the image file in the upper layer and the text file hidden in the lower layer of the image file. When querying this file, we can not only see the original image file in the upper layer, but also search the hidden text file in full text. 2. Microfilming The so-called microfilm conversion method is a method of converting analog images on microfilm into digital images by using special scanning equipment (that is, microfilm scanner) for documents that have been microfilmed. Compared with direct scanning method, microscopic scanning method is more economical, simple and efficient. However, this method must be based on the microprocessing of paper documents. It is worth noting that after scanning the microfilm, the original should be kept together with the paper file, and it is not allowed to be destroyed without authorization. In this way, the file forms a "three-episode" storage state. Although microfilm is not as easy to save, copy, query and spread as digital files, as analog information, microfilm files have advantages that digital files do not have, such as human readability, good stability and small size, which paper files do not have, and should be an important supplementary form of archival information resources. (II) Digitization workflow of paper archives Digitization of paper archives is a complicated process, and its basic links mainly include: file arrangement, file scanning, image processing, image storage, cataloging and database building, data linking, data acceptance, data backup and results management. Before scanning paper files, according to the file management, the files should be properly sorted according to the following steps, and marked as needed to ensure the digital quality of files. (1) document delivery Generally speaking, a large number of paper documents are digitized, and the documents to be digitized should be moved from the document warehouse to the temporary turnover warehouse; Then, the digital processor receives the file from the turnover warehouse and digitizes it. Whether it is the former or the latter, the digital processor must apply according to the predetermined plan, and after examination and approval, hand over the files to both parties, register and complete the handover procedures. (2) The compilation of catalog data standardizes the contents of archives according to the requirements of Archives Description Rules (DA/T 18- 1999), including determining the description items, field lengths and content requirements of archives. Then, a catalog database for digital file retrieval is established. Database construction can make use of the cataloging basis of the original paper files. If there are errors or irregularities in the original paper file directory, such as title, file name, person in charge, starting and ending page numbers and pages, etc. , should be modified. If the machine-readable catalogue database is not established for paper files, it should be re-entered according to the file description rules. (3) Unbound files can be labeled with bar codes one by one before unbinding, so as to accurately and efficiently control scanned files by identifying bar codes in the subsequent process. Barcode can also provide convenience for the management of file borrowing and utilization in the future. Then, the staff checked the files roll by roll and page by page. Register the missing contents, omitted contents, reversed page numbers and precious damaged files, and hand them over to the file storage institution for proper handling. For files that are not unbound and will affect the scanning work, they should be unbound. When removing the binding, attention should be paid to protecting the file from damage. After the binding is removed, the original documents should be arranged in sequence and clamped with clips to prevent scattering. For those with a long history, poor paper conditions and inconvenient unpacking, a zero margin scanner can be used for scanning. (4) Distinguish between scanned parts and non-scanned parts. According to the requirements, separate the scanned and non-scanned documents in the same document, and eliminate irrelevant and duplicate documents. (5) The quality of cut paper is related to the choice of scanner and the scanning effect. Therefore, files with serious damage, uneven folds and blurred handwriting must be registered separately. For example, folded files can be ironed; For contaminated paper, you can gently brush off floating dust, dirt or mold with a soft brush in a ventilated environment; Damaged and incomplete documents must be repaired. (6) Filing and registration: submit the sorted original documents to the scanner, make and fill in the registration form for digital processing of paper documents, and record the starting page number and page number of each document in detail. (seven) after the completion of the binding, repair and return scanning work, the dismantled files should be re-bound according to the requirements of file storage. When restoring binding, attention should be paid to keeping the arrangement order of documents unchanged, so as to be safe, accurate and without omission. Replace severely damaged rollers and boxes. The bookbinding machine will affix a special seal and a special digital seal to the bound document. After the digital processing and rebinding of documents are completed, they should be counted. After the inventory is correct, return it to the archives management department and go through the formalities of file return. 2. The choice of scanning equipment for file scanning (1) depends on the size of file format (A4, A3, A0, etc. ), select the scanner of corresponding specifications. Large format documents can be scanned by a wide scanner, scanned by a film digital conversion device after microfilming, and spliced by images after scanning in a small format. Documents with poor paper condition, too thin, too soft or too thick, and documents with multi-color pages can be scanned by ordinary flat scanner. A4 and A3 documents with good paper conditions can be scanned by a high-speed scanner to improve work efficiency. Files that are not suitable for unpacking can be scanned by a zero-margin scanner. (2) There are generally two scanning color modes in the selection of scanning color mode: one is scanning to form a black-and-white binary image. This kind of image has only two levels of black and white, and there is no transition gray scale. It is characterized by black and white, clear handwriting and small file capacity. It is suitable for scanning text or graphic files with clear handwriting and lines. The second is scanning to form a continuous-tone static image. This kind of image is divided into gray image and color image. Grayscale images are composed of different grayscales from the darkest black to the brightest white. Gray scale indicates the level of an image from light to dark, also known as color scale. The higher the gray level, the richer the levels and the larger the file capacity. Gray mode is suitable for scanning black-and-white photos and image files, and the selection of color levels should be moderate, as long as it does not affect the image quality. The number of colors in the color mode indicates the range of colors. The more colors, the more vivid the image and the larger the file capacity. Similarly, the choice of color number should be moderate, not the more the better. Color mode is suitable for scanning files or color photo files with red titles and seals on pages. Files that need to be preserved permanently or for a long time, or transferred to the national archives, should generally be scanned in color mode. (3) The selection of scan resolution scan resolution parameters is based on the clarity and integrity of the scanned image in principle, which will not affect the utilization effect of the image. When scanning files in black-and-white binary, grayscale and color modes, the resolution is generally recommended to be greater than or equal to 200dpi. Under special circumstances, such as small text, dense text and poor clarity. The resolution can be improved appropriately. For documents requiring OCR Chinese character recognition, scan resolution suggested choosing 300dpi. (4) Optical Character Recognition Processing At present, OCR technology is quite mature, and general scanners have their own OCR software, which is also very convenient to use. However, the recognition accuracy of OCR is often unsatisfactory, which affects the retrieval effect. It is troublesome to manually correct the typos in the manuscript. Therefore, improving the OCR recognition rate is an important issue in the digitization of archives. In fact, as long as we pay attention to the following points, we can obviously improve the OCR recognition rate: First, choose the appropriate scan resolution. Too low scan resolution will often lead to the decrease of OCR recognition rate, while too high resolution will make the image file too large and slow down the recognition speed. In practice, the operator can judge the acceptability by looking at the number of red typos (such as less than 3%) in the text generated after OCR recognition, and decide whether to scan with this resolution for OCR recognition. The second is to scan in black and white binary mode as much as possible. When scanning a document with a scanner, OCR usually accepts gray or black-and-white binary mode, but does not accept color mode. If the manuscript printing quality is good, the gray mode can be used, otherwise the black and white binary mode should be used. When scanning, you can manually adjust the size of the black and white threshold. If the outline of the text on the black-and-white binary image is incomplete, increase the threshold appropriately. If the outline of the text is too thick, it means that there is more information redundancy, and the threshold can be lowered appropriately. The black-and-white binary scanned image formed after such adjustment can achieve better OCR recognition effect. Third, pay attention to the tilt correction of characters when OCR recognition is carried out. OCR recognition allows the document to tilt slightly, but excessive tilt will affect the recognition rate. The correction method is to click the tilt correction button on the scanning software, and the recognition software will automatically correct the image before OCR recognition. The fourth is the pretreatment before manuscript identification. Remove sundries and pictures from the manuscript, because sundries will interfere with text recognition, and pictures will not be recognized, which will affect the text segmentation of OCR. For the columns in the manuscript, it is suggested to set the column area manually, that is, select the characters to be recognized with multiple boxes, and then carry out OCR recognition. Fifth, adopt appropriate identification methods. Simplified and traditional manuscripts are mixed, and the recognition rate of Chinese and English manuscripts is often low. If simplified and traditional Chinese and English are distributed in blocks, different text blocks can be edited into files with similar text blocks by image processing software, and then different characters can be recognized by OCR. (5) Scan registration: fill in the registration form of digital conversion process of paper documents carefully, register the number of scanned pages, and check whether the actual number of scanned pages of each document is consistent with the number of documents filled in when filing. If there is any inconsistency, the specific reasons and treatment methods should be indicated. 3. Image processing After the scanning is completed, the obtained image must be technically processed as required to correct the deviation between the scanned file and the original file and make the scanned file clearer and more standardized. Image processing generally includes the following contents: (1) Image data quality check to check the skew, sharpness and distortion of the image. If it is found that it does not meet the quality requirements, the image should be reprocessed. When the scanned image file is incomplete or cannot be clearly identified due to improper operation, it should be scanned again; If there is any omission in the scanned document, make up the scanned document in time and insert the image correctly; When it is found that the arrangement order of scanned images is inconsistent with the original file, it should be adjusted in time. Fill in the relevant forms carefully and record the quality inspection results and handling opinions. (2) Correction should correct the deflected image so that the deflection is not felt visually. Pictures with incorrect direction should be rotated and restored, which is in line with reading habits. (3) Remove impurities such as black spots, black lines, black frames and black edges that affect the image quality. In the process of processing, care should be taken not to destroy the original information of the file. (4) Multiple images formed by scanning large format documents in different areas should be spliced and merged into a complete image to ensure the integrity of the digital image of the document. (5) Cutting the scanned image in the cutting color mode to remove redundant white edges, effectively reducing the capacity of image files and saving storage space. The above rectification, decontamination, sorting and other treatments can be completed manually according to the naked eye. You can also use specially designed software to make certain settings in advance, and then the computer will automatically handle it. Computer processing is of course efficient, but it is not as flexible as manual processing. For example, once the size of the stain is designed to be too small, the computer will automatically remove some punctuation marks as stains. Therefore, the processing of scanned images also needs a combination of manual and automatic processing. 4. Image storage (1) storage format Image files scanned in black and white binary mode are usually stored in TIFF(G4) format. Image files scanned in gray mode and color mode are usually stored in JPEG format. The selection of compression ratio during storage should be based on minimizing the storage capacity on the premise of ensuring the readability of scanned images. Provides scanned images of network queries, and can also be stored as CEB, PDF or other files. (2) Naming digital file resources of image files should be named by file number or unique identifier. If digital file resources are named by file number and sorted by volume, the file number should be compiled according to the Rules for Compilation of File Number (DA/T 13- 1994), and it is suggested to add the file category code as a sub-item of the category number. If sorted by files, the file number can adopt the structure of "fonds number-file category code year-storage period-institution (problem) code-file number-part number". 5. Selection of data format for directory database construction (1) A common data format should be selected for directory database construction, and the selected data format should be able to exchange data directly or indirectly through XML documents. The establishment of this database can be entered through a special file management system or scanning management software, or through a file directory table specially designed by EXCEL, and then the data can be imported into the file management system. (2) Archival description According to the requirements of Archives Description Rules (DA/T 18- 1999), an archive directory database is established and archive directory data is entered. (3) Quality check of catalog data In order to ensure the accuracy of data, the method of "single machine entry-manual proofreading" or "double machine entry-computer automatic proofreading" can be adopted. Whether it is manual proofreading or computer proofreading, it is necessary to check whether the description items are complete and whether the description contents are standardized and accurate. If unqualified data is found, it should be revised or re-recorded. 6. The data hook (1) summarizes the cataloging database and image files formed in the process of digital conversion of linked files. After passing the quality inspection, they are loaded into the data server in time through the network for summary. Directory database and mirror files should avoid slow and error-prone manual hooking, and try to use computer automatic hooking in batches. As long as the scanned digital file is named according to the file number of the paper file, the automatic search of related digital images and the addition of corresponding electronic address information can be realized by writing a hook program or with the help of corresponding software, so as to realize batch and fast hook. (2) Data association is based on the paper file directory database, and one or more images scanned from each paper file are stored as image files. When storing image files in the corresponding folders, it is necessary to carefully check whether the name of each image file is the same as the file number in the archive directory database, whether the number of pages in the image file is the same as that in the archive directory database, and whether the total number of image files is the same as that in the archive directory database. The file name of each image file is used to establish a one-to-one correspondence with the file number of the file in the archive directory database, which provides conditions for automatic batch connection between the archive directory database and the image files. (3) Handover Registration Carefully fill in the handover registration form of the digital conversion process of paper documents, record the number of pages after data association, and check whether the number of pages after each file association is consistent with the number of pages filled in during file sorting and scanning. If there is any inconsistency, the specific reasons and treatment methods should be indicated. 7. Data Acceptance Check the overall quality of all sampled and digitized data, including catalog databases, image files and data links. When there is an error in the link between the catalog database and the image file, or one of the catalog database and the image file is incomplete, unclear or wrong, the spot check will be marked as "unqualified". When the qualified rate of digital conversion quality sampling reaches more than 95% (including 95%), the whole file will be regarded as "passed". Qualified rate = number of documents passing sampling inspection/total number of documents passing sampling inspection × 100%. Fill in the registration form of digital acceptance of paper files carefully. The conclusion of "passing" acceptance must be reviewed and signed before it can take effect. 8. Data backup is complete, and qualified data should be backed up in time. In order to ensure data security, the choice of backup carriers should be diversified, and multiple sets of backups can be realized by combining online and offline, and attention should be paid to remote storage. You should also check the backup data. The inspection contents of backup data mainly include whether the backup data can be opened, whether the data information is complete and whether the number of files is accurate. After the data is backed up, it should be marked on the corresponding backup media for easy searching and management. Fill in the paper file digital backup management registration form. 9. Digital achievement management should strengthen the management of digital achievements of paper archives to ensure their safety, integrity and long-term availability. When providing online retrieval and utilization of digital results of paper archives, the electronic identification of the production unit should be provided, and the downloadable or non-downloadable data format should be adopted according to the specific situation.