Traditional Culture Encyclopedia - Hotel reservation - Analysis and early warning of online hotel user loss

Analysis and early warning of online hotel user loss

This paper is a summary of an online hotel user churn prediction and analysis project.

Content/analysis ideas:

0 1:? Project introduction

02: Problem Analysis

03: Data Exploration

04: Data preprocessing

05: Modeling and Analysis

06: User Portrait Analysis

I. Introduction of the Project

? This project is to analyze the customer reservation information data of a hotel reservation network for a period of time, predict the conversion results of customer visits through algorithms, dig out the key factors that affect the loss of users, and deeply understand the user's portraits and behavior preferences, so as to better improve product design, carry out personalized marketing services, reduce the loss of users and enhance the user experience.

Second, the problem analysis

This project is a problem diagnosis type, and the problem to be solved is about the loss of users. Among the officially provided fields and explanations, there is a label field, which is the target variable, that is, the value we need to predict. Label= 1 stands for customer churn, and label=0 stands for customer churn. Obviously, this is a classification prediction problem.

Our goal is to maximize the recall rate while improving the prediction accuracy. From the business point of view, it is to predict as many customers as possible, so as to keep them targeted. Because generally speaking, the cost of acquiring new users is much more than the cost of retaining old users.

Third, data exploration.

1, data overview

This data set userlostprob_data.txt is the access data of a hotel reservation network from May 20th16th to May 2nd12006.

The data set * * * has 689,945 rows and 565,438+0 columns, including sample id, labels and 49 variable features.

Considering the protection of users' privacy, the data is desensitized, which is somewhat different from the actual order, page views and conversion rate, but it does not affect the solvability of the problem.

2. Sort out the data indicators

Look at the data set, there are many variables in it. Therefore, firstly, the Chinese explanation in the data dictionary is replaced by the corresponding variable names to enhance readability, and then it is best to sort out the indicators and analyze them one by one.

Through research, it is found that the indicators can be roughly divided into three categories: one is order-related indicators, such as check-in date, order quantity, cancellation rate and so on. One is related to customer behavior, such as star preference, user preference price, etc. There is also a category of hotel-related indicators, such as average hotel rating, number of hotel ratings, average price and so on.

3. Descriptive analysis of relevant features.

3. 1? Date of visit and check-in time

Both the number of people staying in and the number of people visiting reached a peak on May 20th, which is probably the reason for "520" Valentine's Day. After May 2 1, the number of people staying in decreased obviously, and the two small peaks behind indicate that there will be more people on weekends than on weekdays.

3.2? Access time period

It can be observed that 3-5 am is the time period with the least visitors, because most people are sleeping at this time; At 9- 10 in the evening, the traffic is the largest.

3.3? Customer value

The two characteristics of "customer's recent 1 year value" and "customer's value" are very related, and both can be used to express customer's value; It can be seen that the value of most customers is in the range of 0- 100; Some customers have a value as high as 600, so it is necessary to focus on analyzing such high-value customers in the later stage.

3.4? Consumption ability index

It basically presents a normal distribution, and most people's spending power is around 30. There are still many people whose spending power reaches nearly 100, which shows that there are many high-consumption groups among the visiting and staying customers of our hotel.

3.5? Price sensitivity index

Excluding the extreme value, the data is distributed to the right, and most customers are not very sensitive to the price, so don't bother pricing too much; For customers whose price sensitivity index is 100, discounts can be used to attract them.

3.6? Average hotel price

Most people choose hotels with prices below 1000, and few people choose hotels with prices above 2000. Excluding "local tyrants", we can see that consumers' choice of hotel prices is basically a positive skewed distribution, and the average price that most people will choose is around 300 yuan (probably an express hotel).

3.7? User's annual order

Most users' annual orders are below 40. At the same time, some users often stay in hotels and need maintenance.

3.8? Order cancellation rate

The highest cancellation rate of users within one year is 100% and 0, respectively. For customers who cancel their orders at 100%, the reasons can be found by combining the order quantity.

3.9? Time from the last order in a year.

It can be observed that the longer the scheduled interval, the fewer the number of people, indicating that a considerable number of people often book hotels; The side reflects that "regular customers" often choose to book hotels and have more repeat customers.

3. 10? Session ID

An id assigned by the server to the visitor. 1 is a new visitor.

Old customers account for the majority of visiting customers; The booking probability of old customers is slightly higher than that of new customers.

Fourthly, data preprocessing.

4. 1 duplicate value processing

The data dimension has not changed, indicating that there are no duplicate values in the dataset.

4.2? Generate derived fields

Based on the understanding of business, considering that it may be more important for users to book hotel time in advance, two date features are transformed into a new feature to improve the accuracy and interpretability of the model.

4.3 Missing value processing

View missing values

***5 1 fields, missing fields: 44.

Thinking and process of missing value processing

View the distribution of features:

Looking at the distribution of all numerical features and choosing reasonable processing methods according to the data distribution, including abnormal value and missing value processing, is helpful to deeply understand user behavior.

***5 1 fields, missing fields: 44. Choose an appropriate method to handle missing values:

If the missing ratio is more than 80%: 1, and the "Number of historical orders of users in the last 7 days" is missing 88%, delete this field directly.

Fields tending to normal distribution are filled with average values; Fields with right deviation distribution are filled with median.

Check missing value padding.

It can be seen that the missing value data has been filled in.

4.4? Outlier processing

Extreme value processing:

(Based on actual business thinking, the shielding method is partially unreasonable, which may filter high-value users and need to be adjusted. )

Negative value processing:

4.5? Standardized treatment

Distance class model needs to standardize data in advance.

Modeling and Analysis of verb (abbreviation of verb)

First, divide the training set and the test set.

5. 1? Logistic regression

[0.73665292 16096935, 0.70 16048745527705]

5.2? Decision chart

[0.8728884 186420657, 0.844888 169 1422343]

5.3? Random forest

[0.893658 190 14559 13, 0.9399374 165 108 152]

5.4? Naive Bayes

[0.6224554 13 1 126394, 0.66 1075692 1767458]

5.5? XGBOOST

[0.8886 1430983629 13, 0.9383456626294802]

5.6? Model comparison

Draw ROC curve

It can be seen that naive Bayesian performance is the worst, and logistic regression performance is not very good, indicating that the data is not linearly separable; The performance of random forest and xgboost model is similar, with AUC values above 0.9, and the classification effect is good. The AUC value of random forest is slightly higher than 0.94, so random forest is used to predict user loss.

5.7? Optimization of stochastic forest model

Interactive validity analysis

Learning curve-take the classifier as 80.

[0.9333570067 179268, 0.978 16699979759]

That is, according to this random forest model, the recall rate can reach 97.8%, and the prediction accuracy of lost customers can reach 93.3%.

The model can be directly used to predict user churn.

5.8 Key factors affecting customer churn

Using random forest to analyze the factors that affect customer churn: Using feature_importance method, we can get the order of feature importance.

The most important pre-10 features:?

The number of visits per year, the length of the last visit in a year, the number of app uv visits in the current city on the same check-in date yesterday, the length of the last order placed in a year, the number of app orders submitted by the current city on the same check-in date yesterday, the average price of hotels visited within 24 hours, the average business attribute index of hotels visited within 24 hours, the lowest price of hotels visited most within 24 hours, the number of hotel ratings visited most within 24 hours, and customer value.

Six, user portrait analysis

Next, users are divided into three categories by K-Means clustering method, and the characteristics of different categories of customers are observed.

K-means clustering

?

It can be seen that the three types of users have their own obvious characteristics, and the personalized marketing suggestions for different types of users are as follows:

Category 0 is a medium group: the consumption level and customer value are low, the frequency of visiting and booking is high, and the time of booking in advance is the longest among the three categories; It takes a lot of time to browse before making a choice, so I am more cautious and speculate that it may be a user who travels abroad.

Suggestion: Push as much as possible, because such customers usually prefer browsing; Recommend hotels with relatively affordable prices; Push local tourism information, because such customers have a higher probability of traveling.

1 is a low-value customer: the consumption level and customer value are extremely low, the preferred price is low, and the frequency of visits and reservations is very low; Sid value is very low, indicating that there are many new customers.

Suggestion: deal with the lost customers, don't spend too much marketing cost, and don't do specific channel operation; Recommend promotional activities and low-cost hotels with large price discounts; The proportion of new users is relatively large, and there are many potential customers, which can maintain service push.

The second category is high-value customers: high consumption level, high customer value, pursuit of high quality and high price sensitivity; Long landing time, many visits, short advance booking time, but many returns.

Suggestion: Provide more hotel information for customers. ? Recommend business chain hotels with good reputation and high cost performance to attract users; Push messages during small peak hours of daytime traffic such as non-working days 1 1 and 17.

Some comments:

1. Correlation analysis can be performed when filtering data features, because some features may have high correlation. Variables whose correlation with the target variable is less than 0.0 1 can be eliminated by correlation analysis, and variables whose correlation with other variables is higher than 0.9 can be deleted. Principal component analysis can be used to reduce the dimension and integrate the indicators, which may achieve the best model effect.

2. If you want to classify users more finely, you can use RFM model to analyze user value. However, the characteristics of this project contain a lot of information, which may be lost by RFM.