Skip to main content

Table 4 Examples of character and subword tokenization for different DGA types

From: Use of subword tokenization for domain generation algorithm classification

DGA Type

Class name

URL

RIndex

Word-looking

pizd

animalforget.net

Character tokenization:

“a”, “n”, “i”, “m”, “a”, “l”, “f”, “o”, “r”, “g”, “e”, “t”

Subword tokenization:

“animal”, “for”, “get”

\(N\left({Tokens}_{char}\right)\)=12

\(N\left({Tokens}_{subword}\right)\)=3

RIndex = 0.75

Word-looking

rovnix

thesetobewarfarebecomes.net

Character tokenization:

“t”, “h”, “e”, “s”, “e”, “t”, “o”, “b”, “e”, “w”, “a”, “r”, “f”, “a”, “r”, “e”, “b”, “e”, “c”, “o”, “m”, “e”, “s”

Subword tokenization:

“these”, “to”, “be”, “war”, “fare”, “be”, “com”, “es”

\(N\left({Tokens}_{char}\right)\)=23

\(N\left({Tokens}_{subword}\right)\)=8

RIndex = 0.65

Random-looking

ramnit

ukphbhncsdpgo.com

Character tokenization:

“u”, “k”, “p”, “h”, “b”, “h”, “n”, “c”, “s”, “d”, “p”, “g”, “o”

Subword tokenization:

“uk”, “p”, “hb”, “hn”, “cs”, “dp”, “go”

\(N\left({Tokens}_{char}\right)\)=13

\(N\left({Tokens}_{subword}\right)\)=7

RIndex = 0.46