We first narrow the confidence region of the target with coarse-grained geolocation method inspired by CBG and SLG. Based on traceroute data collected from landmarks in this region, we mine frequently occurred routers in all paths. In theory, if there is a router in more than three paths, it can be located by passive landmarks. As intermediate routers are usually closer to landmarks than vantage points, these routers are precisely located with the following algorithm.
Feature selection
Previous methods choose network latencies as geographical distance constraints. However, in the narrowed region, geographical distance constraints are loose. Therefore, we use both latencies (RTT) and hop counts (N) as network environment constraints. Denote the set of intermediate routers as R={Rm∣m∈[0,M]}, vantage points as V={Vk∣k∈[0,K]} and landmarks as L={Li∣i∈[0,I]}, where M, K, I are the amounts of routers, vantage points and landmarks. For each pair of intermediate router Rm and path Pth(Vk,Li), we calculate latency and hop count
$$ \begin{aligned} RTT(R_{m}, L_{i}) & = RTT(V_{k}, L_{i}) - RTT(V_{k}, R_{m}) \\ N(R_{m}, L_{i}) & = N(V_{k}, L_{i}) - N(V_{k}, R_{m}). \end{aligned} $$
(1)
Distance estimation maps measurement data to geographical distance. To find the best distance estimation algorithm in the narrowed region, we use three different ways to convert network constraints to geographical constraints.
Linear estimation
As network environment is bound to its geographical region, we assume that the inflated latency is small. Therefore the geographical distance between two nodes is partially proportional to the propagation delay. Geolocation methods usually measure the total delay (RTT) because propagation delay cannot be directly measured. We ignore detailed topologies among common routers and end nodes and represent them by N-term. The reason is that other delays (processing delay, queuing delay, transmission delay, etc.) are positively correlated to the number of intermediate nodes.
Denote latency and hop count between intermediate router Rm and landmark Li as RTTmi and Nmi, then linear estimated distance between two nodes can be presented as:
$$ \begin{aligned} d_{mi} & = d(R_{m}, L_{i}) \\ & = f_{1}({RTT}_{mi}, N_{mi}) \\ & = \theta_{0} + \theta_{1} \cdot {RTT}_{mi} + \theta_{2} \cdot N_{mi}. \end{aligned} $$
(2)
We train coefficients θ(θ0,θ1,θ2) with all relative paths between landmarks. Denote landmarks as L={L1,L2,…,Ln}, vantage points as V={V1,V2,…,Vm}. For each pair of landmarks Li,Lj∈L (with correlated vantage point Vk∈V), we use relative delay rRTTij and hop count rNij between Li and Lj:
$$ \begin{aligned} {rRTT}_{ij} & = RTT(V_{k}, L_{i}) + RTT(V_{k}, L_{j}) - 2RTT(V_{k}, R_{ij}) \\ {rN}_{ij} & = N(V_{k}, L_{i}) + N(V_{k}, L_{j}) - 2N(V_{k}, R_{ij}) \end{aligned} $$
(3)
as training data, use L1 distance:
$$ L = \sum_{i \neq j}{ \big\| g(L_{i}, L_{j}) - d(L_{i}, L_{j}) \big\| } $$
(4)
as loss function. Where Rij, gij, dij denotes the common router, estimated distance and geographical distance between Li and Lj. We can describe the feature of network in this region with existing linear regression methods (e.g. gradient descent algorithm, least square regression).
Non-linear estimation
Noticing that hop counts between landmarks in moderately connected Internet are usually large, we filter out paths that are above the thresold of hop count. The choice of threshold varies with different network environments. Another solution is statistical estimation. We still use dij as training data, (rRTTij,rNij) as training features and L1 as loss function. Instead of linear regression, we use truncated normal distribution:
$$ \begin{aligned} p(d \vert RTT, N) & = \frac{1}{\Phi\left(\mu/\sigma\right)} \cdot \frac{1}{\sqrt{2\pi} \cdot \sigma} \cdot \exp{\left(-\frac{{(d - \mu)}^{2}}{2\sigma^{2}}\right)}\\ \sigma & = \sigma(d \vert RTT, N) \\ \mu & = \mu(d \vert RTT, N) \end{aligned} $$
(5)
as the kernel function to estimate geographical distance with maximum likelihood probability, where Φ(μ/σ) is the cumulative distribution function of normal distribution. We choose normal distribution because it is well defined. We also use gamma distribution:
$$ \begin{aligned} p(d \vert RTT, N) & = \frac{1}{\beta^{\alpha} \cdot \Gamma(\alpha)} \cdot x^{\alpha - 1} \cdot \exp^{-x / \beta} \\ \alpha & = f_{\alpha}(RTT, N) \\ \beta & = f_{\beta}(RTT, N) \end{aligned} $$
(6)
as the kernel function to get a more general result.
Maximum likelihood estimation
As shown in Fig. 4, we use maximum likelihood estimation with landmarks to geolocate target router Rm. Likelihood function depends on distance estimation method. The main purpose of maximum likelihood estimation is to find a point \((x^{\prime }, y^{\prime })\) that maximize target likelihood function. Assuming that we have K landmarks with geographical locations (x1,y1),(x2,y2),…,(xK,yK), when locating an intermediate router, we search landmarks that connect to it. Denote them as (Lm1,Lm2,…,Lmk).
Linear estimation. Geographical distances can be calculated by coefficient θ trained before. Maximum likelihood results meet the following equations:
$$ \left\{ \begin{array}{r@{\;\,\;}l} g(R_{m}, L_{m1}) & = d(R_{m}, L_{m1}) \\ g(R_{m}, L_{m2}) & = d(R_{m}, L_{m2}) \\ & \dots \\ g(R_{m}, L_{mk}) & = d(R_{m}, L_{mk}) \\ \end{array} \right. $$
(7)
Great circle distance gij is written as
$$ \begin{aligned} g_{ij} & = R \cdot \arcsin\bigl(\sin{x_{i}}\sin{x_{j}} + \cos{y_{i}}\cos{y_{j}}\cos(x_{i} - x_{j})\bigr)\\ & \approx R' \cdot \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2}}. \end{aligned} $$
(8)
With this prerequisite, we can simplify Eq. 7
$$ \left\{ \begin{array}{r@{\;\,\;}l} {(x_{m1} - x_{m})}^{2} & + {(y_{m1} - y_{m})}^{2} = {(d_{m1} / R')}^{2} \\ & \dots \\ {(x_{mk} - x_{m})}^{2} & + {(y_{mk} - y_{m})}^{2} = {(d_{mk} / R')}^{2} \\ \end{array} \right. $$
(9)
Note that geographical distance between two point doesn’t precisely meet Eq. 9 unless they are close to each other. Our algorithm is localized so that this approximation is acceptable. We can reduce Eq. 9 to a linear function
$$ \boldsymbol{AX = b}, $$
(10)
where
$$ \boldsymbol{A} = \left[{ \begin{array}{cc} 2(x_{m1} - x_{mk}) & 2(y_{m1} - y_{mk}) \\ \dots & \dots \\ 2(x_{mk-1} - x_{mk}) & 2(y_{m1} - y_{mk}) \\ \end{array} }\right] $$
(11)
$$ \boldsymbol{b} = \left[{ \begin{array}{c} x^{2}_{m1} - x^{2}_{mk} + y^{2}_{m1} - y^{2}_{mk} + d^{2}_{k} - d^{2}_{1} \\ \dots \\ x^{2}_{mk-1} - x^{2}_{mk} + y^{2}_{mk-1} - y^{2}_{mk} + d^{2}_{k} - d^{2}_{k-1} \\ \end{array} }\right] $$
(12)
and
$$ \boldsymbol{X} = \left[{ \begin{array}{c} x_{m} \\ y_{m} \\ \end{array} }\right] $$
(13)
The least square estimation of X can be easily calculated by
$$ {X={(A^{T} A)}^{-1} A^{T} b}. $$
(14)
Non-linear estimation. As discussed before, linear estimation loses network structure. We use log likelihood function
$$ L_{i} = \sum_{i = 1}^{K}{\log{ \Bigl(P\bigl(d(x, L_{i}) | ({RTT}_{mi}, N_{mi}) \bigr) \Bigr) }}. $$
(15)
Target location xm is the point that maximize the likelihood function
$$ \hat{x}_{m} = \arg\max_{x \in C} L_{i}(x). $$
(16)
Location target host
Previous works usually focus on geolocating target host, while the fact is that intermediate routers are usually more stable than end hosts. As long as we locate these routers, we can easily find the nearest intermediate router which is usually closer than other landmarks. As shown in Fig. 5, when geolocating reachable target T or unreachable host U, we find the nearest router by searching traceroute data without any further calculation.