Traditional Culture Encyclopedia - Hotel reservation - Practice of RocketMQ in Distributed Message Governance and Micro-service Governance

Practice of RocketMQ in Distributed Message Governance and Micro-service Governance

Introduction: With the continuous development of the company's business, the traffic is also increasing. We find that some major accidents in production are often washed away by sudden traffic, so it is particularly important to control and protect the traffic and ensure the high availability of the system.

Hello has developed into a comprehensive mobile travel platform, including two-wheel travel (Hellobike, Hello moped, Hello electric car, Xiaoha for electricity exchange) and four-wheel travel (Hello hitchhiking, whole network taxi calling, Hello taxi calling), and explored many local living ecology such as hotels and shops.

With the continuous development of the company's business, the traffic is also increasing. We find that some major accidents in production are often washed away by sudden traffic, so it is particularly important to control and protect the traffic and ensure the high availability of the system.

In this article, we will share Hello's experience in the governance of message traffic and microservice invocation.

Liang Yong (Lao Liang), one of the columnists of RocketMQ Actual Combat and Advanced, participated in the manuscript review of RocketMQ Technology Insider. Lecturer at ArchSummit Global Architects Conference and QCon Case Study Society.

At present, it is mainly in the direction of back-end middleware. In the official WeChat account, Guannong Laoliang has published more than 100 articles on source code battles, covering RocketMQ series, Kafka series, GRPC series, Nacosl series, Sentinel series and Java NIO series. Currently working in Hellobike as a senior technical expert.

Let's talk about governance before we start. The following is Lao Liang's personal understanding:

The company has used RabbitMQ before. The following are the pain points when using RabbitMQ, many of which are caused by the current limitation of RabbitMQ cluster.

There is such an error that multiple enterprises use one database. During a late rush hour, the traffic increased sharply and the database was suspended.

Thinking: Both news and service need perfect governance measures.

Which are our key indicators and which are our secondary indicators is the primary issue of news supervision.

design objective

It aims to shield the complexity of the underlying middleware (RocketMQ/Kafka) and dynamically route messages by unique identification. At the same time, a message management platform integrating resource management, retrieval, monitoring, alarm, inspection, disaster tolerance and visual operation and maintenance is built to ensure the smooth and healthy operation of message middleware.

It is the ability to simplify complex problems.

Minimalist unified API

Providing a unified SDK encapsulates two kinds of message middleware (Kafka/RocketMQ).

The automatic creation of theme consumer groups is not suitable for the production environment, which will lead to out of control and is not conducive to life cycle management and cluster stability. The application process needs to be controlled, but as simple as possible. For example, apply for all environments to take effect at one time, and generate relevant alarm rules.

Monitor whether the use of the client is standardized and find appropriate measures to control it.

Scenario 1 Instantaneous Flow and Flow Control of Cluster

Assuming that there are 65,438+00000 Tps in the cluster now, it will suddenly become 20,000 or even more, and this excessively steep growth of traffic is likely to lead to cluster traffic control. For this scenario, it is necessary to monitor the sending speed of the client, and make the sending more gentle after reaching the threshold of speed and steep rise.

Scenario 2: Breaking News and Cluster Jitter

When the client sends a big message, such as sending a message of several hundred KB or even several megabytes, it may cause long IO time and cluster jitter. For this kind of scene management, it is necessary to monitor the size of the message sent. We adopt the service of identifying big news through post-event inspection, and promote the use of classmate compression or reconstruction. The message is controlled within 10KB.

Scenario 3 Low Client Version

With the iteration of functions, the version of SDK will also be upgraded, and changes may introduce risks except functions. When using the low-end version, one is that it cannot support functions, and the other is that there may be security risks. In order to understand the use of SDK, you can report the SDK version and promote the use of classmates through patrol inspection.

Scenario 4 Consumption Stream Extraction and Recovery

Consumer traffic removal and recovery usually have the following usage scenarios. The first is that the traffic needs to be removed when the application is released, and the second is that the traffic needs to be removed before checking when the problem is located. In order to support this situation, it is necessary to monitor delete/resume events at the client and suspend and resume consumption.

Scenario 5 Transmission/Consumption Time-consuming Detection

How long does it take to send/consume a message? By monitoring and patrolling the time-consuming situation, we can find out the applications with low performance, and promote the transformation in a targeted manner to achieve the purpose of improving performance.

Scenario 6 improves the efficiency of investigation and location.

When troubleshooting problems, it is usually necessary to retrieve information related to the life cycle of messages, such as what messages are sent, where they exist and when they are used. This part can connect the life cycle in the message through msgId. In addition, by embedding a link identifier similar to rpcId/traceId in the message header, the message is strung in a request.

Required monitoring information

Common control measures

Monitor the resource usage of theme consumer groups.

Scenario 1: Impact of consumption backlog on business

Some business scenarios are sensitive to consumption accumulation, and some businesses are not sensitive to backlog, as long as they catch up and consume later. For example, it takes only a few seconds to unlock the bicycle, and the batch processing scenarios related to information summary are not sensitive to the backlog. By collecting consumption backlog indicators, the students who are in charge of the application will be notified by real-time alarm for applications that meet the threshold, so that they can grasp the consumption situation in real time.

Scenario 2 Influence of Consumption/Sending Speed

Send/consume alarm with zero speed? In some cases, it is impossible to reduce the speed to zero. If it drops to zero, the business is not normal. By collecting speed indicators, you can send real-time alerts to applications that meet the threshold.

Scenario 3: The consumer node is disconnected.

When the consumer node is disconnected, it is necessary to inform the classmate in charge of the application. This node information needs to be collected, and when it is disconnected, it can trigger alarm notification in real time.

Scenario 4 Unbalanced Transmission/Consumption

Unbalanced transmission/consumption often affects its performance. I remember that in a consultation, some students set the key of sending messages to a constant, and by default, the partitions were hashed according to the key, and all messages entered one partition. This achievement can't come up anyway. In addition, it is necessary to detect the consumption backlog of each partition and trigger real-time alarm notification when excessive imbalance occurs.

Required monitoring information

Common control measures

What is the core index to measure the health status of the cluster?

Scenario 1 cluster health detection

The cluster health test answered a question: Is this cluster good? This problem is solved by detecting the number of cluster nodes, the heartbeat of each node in the cluster, the Tps water level written by the cluster and the Tps water level consumed by the cluster.

Scenario 2 Stability of Cluster

Cluster flow control often reflects the lack of cluster performance, and cluster jitter will also cause the client to send overtime. By collecting the heartbeat time consumption of each node in the cluster and the Tps water level change rate written by the cluster, we can know whether the cluster is stable or not.

Scenario 3 High Availability of Cluster

High availability is mainly aimed at the unavailability of an available area in extreme scenes, or the abnormality of some topics and consumer groups in the cluster, and some targeted measures need to be taken. For example, MQ can be solved by cross-deployment of master and slave in the same city across available areas, dynamic migration of themes and consumer groups to disaster-tolerant clusters, and multi-activities.

Required monitoring information

Common control measures

If which of these key indicators is the most important? I will choose the heartbeat detection of each node in the cluster, which is the response time (RT). Let's take a look at the possible reasons that affect RT.

We always meet a pit, and when we meet it, we will fill it.

**

RocketMQ slave nodes and master nodes frequently appear high CPU, which is an obvious burr. Many times, the slave node hangs up directly.

Only the system log has an error prompt

2020-03-16t17: 56: 07.505715+08: 00 vecs0xxxx kernel: []? _ _ alloc _ pages _ nodemask+0x7e1/0x9602020-03-16t17: 56: 07.505717: 08: 00 vecs0xxxx kernel: java: page. Order:0, mode: 0x202020-03-16t17: 56: 07.505719: 08: 00 vecs0xxxx kernel: Pid: 12845, Comm: java is not infected 2.6.32-754.17.1.el6.x86 _ 64 #10 _ _ alloc _ pages _ nodemask+0x7e1/0x9602020-. Dev _ queue _ xmit+0xd0/0x3602020-03-16t17: 56: 07.505729+08: 00 vecs0xxxx kernel: []? Ip _ finish _ output+0x192/0x3802020-03-16t17: 56: 07.505732+08: 00 vecs0xxxx kernel: []?

Various debugging system parameters can only be slowed down but not eradicated, and the burr still exceeds 50%.

All the systems in the cluster are upgraded from centos 6 to centos 7, and the kernel version is also upgraded from 2.6 to 3. 10, and the CPU glitch disappears.

The default version of RocketMQ Community Edition supports 18 delay levels, and each level is accurately consumed by consumers at the set time. For this reason, we also specially tested whether the consumption interval is accurate, and the test results show that it is very accurate. However, such an accurate feature is problematic. It's strange to receive a delayed message from a cluster on the report line of business students.

Moving "delayOffset.json" and "consumption queue /Schedule _ Topic _ XXXX" to other directories is equivalent to deleting; Restart the proxy nodes one by one. After the restart, after verification, the delayed message function is sent and consumed normally.

What are our core services and non-core services? This is the primary problem of service governance.

Services can cope with the sudden surge of traffic, especially to ensure the smooth operation of core services.

According to the two latitudes of user and business impact, the application is divided into four grades.

S 1: Core products, the failure of which will cause external users to be unable to use or cause large financial losses, such as the core links of the main business, such as the switch lock of bicycles and mopeds, the issuance and receipt of tickets for hitchhiking, and the applications on which the core links are strongly dependent.

S2: It does not directly affect the transaction, but it is related to the important configuration of the foreground business or the management and maintenance of the business background processing function.

S3: service failure has little impact on the logic of users or core products, has no impact on the main business, or has a small amount of new business; Important tools of internal users do not directly affect the business, and related management functions have little impact on the front desk business.

S4: For internal users, the system does not directly affect the business, or needs to be pushed offline later.

S 1 business is the core business of the company, and it is the key protection object, so it needs to be guaranteed not to be accidentally impacted by non-core business traffic.

**

**