Use of subword tokenization for domain generation algorithm classification

Cybersecurity

Table 4 Examples of character and subword tokenization for different DGA types

DGA Type	Class name	URL	RIndex
Word-looking	pizd	animalforget.net Character tokenization: “a”, “n”, “i”, “m”, “a”, “l”, “f”, “o”, “r”, “g”, “e”, “t” Subword tokenization: “animal”, “for”, “get”	\(N\left({Tokens}_{char}\right)\)=12 \(N\left({Tokens}_{subword}\right)\)=3 RIndex = 0.75
Word-looking	rovnix	thesetobewarfarebecomes.net Character tokenization: “t”, “h”, “e”, “s”, “e”, “t”, “o”, “b”, “e”, “w”, “a”, “r”, “f”, “a”, “r”, “e”, “b”, “e”, “c”, “o”, “m”, “e”, “s” Subword tokenization: “these”, “to”, “be”, “war”, “fare”, “be”, “com”, “es”	\(N\left({Tokens}_{char}\right)\)=23 \(N\left({Tokens}_{subword}\right)\)=8 RIndex = 0.65
Random-looking	ramnit	ukphbhncsdpgo.com Character tokenization: “u”, “k”, “p”, “h”, “b”, “h”, “n”, “c”, “s”, “d”, “p”, “g”, “o” Subword tokenization: “uk”, “p”, “hb”, “hn”, “cs”, “dp”, “go”	\(N\left({Tokens}_{char}\right)\)=13 \(N\left({Tokens}_{subword}\right)\)=7 RIndex = 0.46