Traditional Culture Encyclopedia - Tourist attractions - Hive: Partitioned table structure and data replication

Hive: Partitioned table structure and data replication

Summary: Hive, Shell

Hive replication table includes two types

For non-partitioned tables, if you want to completely copy one table to another table, directly Just use the CREATE TABLE ... AS statement. For example, copy two fields and field values ????of one table to another table as follows

For partitioned tables, if you use the CREATE TABLE ... AS statement, the partition will be invalid. , but it can be executed without error, and the fields and data can be completely copied

There is a partitioned table above, with the dt field as the partition, and CREATE TABLE... AS is used for full table replication

< p> Check that there are no problems with the table structure and table data volume

Check the partition and report an error: This table is not a partitioned table, but the original partition field dt does exist in the table structure. At this time The partition function of the dt field fails, but the data is retained

To copy the full name of the partition table with partitions, you need to use the LIKE statement to copy the partition information. The specific steps are as follows

The first step is to copy an empty table, which has the table structure and partition information of the original table

The next step is to use the hdfs command to copy the storage path of the original table in hdfs to the path of the new table. Storage of a table The path is a directory, and there are subdirectories under the directory. Each subdirectory represents a partition. Under the partition directory is the data file. The data file is in the format starting with part. The data under the same partition is divided by Hive's bucketing strategy.

The copy statement uses the * wildcard character to copy all the files in the original table directory to the new table path, and view the data files in the hdfs path of the new table

At this time, although the new table corresponds to the data warehouse There are data files in the directory, but the data is still not found in the Hive client. The empty table is because each data partition does not exist in the metadata of the new table. The data is aggregated in units of partition directories. The new table cannot currently be found. If you reach a partition, you will naturally not be able to find the data

The next step is to repair the partition metadata of the table and use the MSCK REPAIR TABLE command

It can be seen from the output execution process that the MSCK REPAIR TABLE command has been checked first. Whether the partition information of the table exists in the metadata, and then repair the non-existent partition information. After the repair, the table can be used normally

The function of MSCK REPAIR TABLE is that you only need to use this command. Quickly and automatically add (repair) all partitions. In Hive, if you create a partition table first and copy the data to the corresponding HDFS directory as initialization, you need to manually add partitions before it can be used. If there are too many partitions, use ALTER TABLE ADD. PARTITION is extremely unchanged. Let's do a test to see whether ALTER TABLE ADD PARTITION can also complete the complete copy of the partition table

The next step is to manually add a partition dt='20201209'

It has been verified that manual partitioning can be completed. MSCK REPAIR TABLE only automatically scans the partition information in the data warehouse directory (dt='20201209' to dt='20210317'). If you write a Shell script, you can also achieve the following

< p> The same effect can be achieved after running this Shell script, but this script takes 15 minutes to execute and requires frequent startup and shutdown of the Hive process