This section contains two parts, they are feature extraction part and rules building part. In the feature extraction part, we describe how we extract features from the textual data. In the rule building part, we explain how we build rules based on the extracted features. This section fully expresses our ideas on telecommunication fraud detection which is the core content of this paper.
Feature extraction
In other researchers’ methods of telecommunication fraud detection, the features they extracted are mainly calling numbers, calling times, calling types, etc. Our research is based on the identification of call content to detect fraudulent calls. Hence, we need to extract features from more complex data, and this is a challenge to the extraction method. For text-type information, Natural Language Processing (Jackson and Moulinier 2007) is a suitable technique. The techniques of word segmentation and part-of-speech-tagging in Natural Language Processing in the Chinese domain can handle Chinese content well. It is what we need to deal with the Chinese textual information, so the Natural Language Processing techniques are our preferred choice.
Inspired by the decision tree algorithm in the data analysis phase, we designed a method to extract features from telecommunication fraud-related text by using the Natural Language Processing techniques. The extraction process is shown in Fig. 2.
The first step in feature extraction is to segment the text (Gao et al. 2003). This is because words and words are put together without blank spaces in Chinese, unlike English, where there are spaces between words. Therefore, the first step in Chinese Natural Language Processing is to separate words within sentences. For example, the process divides the sentence “今天天气真好啊!” (The weather today is really good!) into “今天 天气 真 好 啊” (The / weather / today / is / really / good!). Next part is part-of-speech tagging. As its name suggests, tagging attributes are marking attributes to a word, such as nouns, verbs, adjectives, and adverbs.
The next part after that is keywords selection. The process includes various steps: The first step is to remove the stop words (Chen and Chen 2001). Stop words are meaningless words that occur after the word segmentation, such as prepositions just like “on”, “to”, “of”, etc. There are various types of stop words lists on the Internet, and it doesn’t exist a special stop words list that can be applied to all natural language studies. Therefore, we establish a new stop words list. Our stop words list includes 1601 words which cover the most common stop words, as well as the words we selected based on the word what is commonly used on the phone, such as Hello, Good, Hang Up, Hold on, and so on. After removing the stop word, we write programs to select keywords based on the part-of-speech. For example, we remove prepositions, adverbs and other meaningless words, at the same time we retain nouns, verbs, and other meaningful words. We continue to filter the keyword list manually after selecting keywords by programs. This step is mainly to remove the words which meaningless such as personal names and geographic names. After that, we get a keywords list extracted from textual data came from telecommunication fraud related data.
After getting the keywords list, we calculate the value of the frequency of keywords in telecommunication fraud data and normal data which are not irrelevant to telecommunication fraud. The normal data comes from the Chinese text classification dataset THUCNews provided by the Tsinghua NLP Group (Sun et al. 2016). THUCNews is generated based on the historical data of the Sina News RSS subscription channel from 2005 to 2011. We selected some of these subsets to our dataset. Then we calculate a fraud tendency value of each keyword called “Degree of correlation” according to the value of the frequency of telecommunication fraud data and normal data. The keywords’ distribution of the value of frequency with correlation value is shown in Fig. 3. This is the result of feature extraction.
Figure 3a shows the relationship of keywords’ frequency of fraud data and the correlation value representing the data tends to distribute linearly. Note that the correlation between the correlation value and the keywords’ frequency of fraud data is high. Figure 3b shows the relationship of keywords’ frequency of normal data and the correlation value. On the contrary, this distribution is relatively scattered and it can be seen that the correlation between the correlation value and keywords’ frequency of normal data is not high. Figure 3c demonstrates the difference between Fig. 3a and b. From Fig. 3c, we find that these two distributions are clearly different, they have the same amount of data on the Y-axis. The data which has high correlation value has a low value on the frequency of normal data but the high value on the frequency of fraud data. These figures describe the different distributions of keywords in fraud data and normal data by frequency and correlation values, which proves that the features we extracted before could represent the characteristics of telecom fraud effectively. It also proves that our method of extracting features is appropriate to these textual data.
Detection rules building
After extracting the features, this paper builds detection rules on the basis of features extracted before. This paper proposes a method to detect telecommunication fraud based on the keywords we selected. In the previous data analysis, this study used the words as the basic unit to vectorize the text and use it as the input to the machine learning algorithm. In the same way, this study also uses the keywords as the basic elements in the step of feature extraction. At the same time, this study calculated the frequency of keywords and the correlation value of telecommunication fraud. The next step is how to use these characteristics and values to builds rules for the detection of telecommunications fraud.
In the decision tree algorithm, the features which have a greater impact on the results selected by calculating the information gain (IG) (Blockeel et al. 2006b). In the same way, the keywords which have a high influence on the correlation between telecommunication fraud data and normal data would be selected by calculating the distribution of keywords. Corresponding to the information gain, this study shows that this value is correlation value. And we calculate the correlation value by the difference between keywords’ frequency in telecommunication fraud data and normal data.
After that, we build rules for detecting telecommunication fraud. The decision tree algorithm builds nodes and branches by features. Different from the decision tree algorithm, this study sums up the correlation values of the keywords appearing in the text when predicting whether a text is related to telecommunication fraud. When the sum exceeds a threshold, the text is deemed to be related to telecommunication fraud. The formula for detecting rules is as follows:
$$ \mathrm{R}=\sum \limits_{i=1}^{w_i\in L}\left[{F}_f\left({w}_i\right)-{F}_n\left({w}_i\right)\right]-{T}_k $$
Among them, i represents the subscript of the keyword, wi represents the keyword detected from testing text, Ff(wi) and Fn(wi) are the frequencies of the keyword (wi) in the fraud data and normal data, L is the keywords’ list used for fraud detection, and Tk is the threshold of whether the text is the telecommunication fraud data. When the result R is greater than or equal to 0, the text is estimated as telecommunication fraud data, otherwise, it is a non-fraud data.