Listen to this story
Enabling the creation of complex infrastructure and DevOps pipelines.
Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.
Authors:
(1) Diwen Xue, University of Michigan;
(2) Reethika Ramesh, University of Michigan;
(3) Arham Jain, University of Michigan;
(4) Arham Jain, Merit Network, Inc.;
(5) J. Alex Halderman, University of Michigan;
(6) Jedidiah R. Crandall, Arizona State University/Breakpointing Bad;
(7) Roya Ensaf, University of Michigan.
3 Challenges in Real-world VPN Detection
4 Adversary Model and Deployment
5 Ethics, Privacy, and Responsible Disclosure
6 Identifying Fingerprintable Features and 6.1 Opcode-based Fingerprinting
6.3 Active Server Fingerprinting
6.4 Constructing Filters and Probers
7 Fine-tuning for Deployment and 7.1 ACK Fingerprint Thresholds
7.2 Choice of Observation Window N
7.4 Server Churn for Asynchronous Probing
7.5 Probe UDP and Obfuscated OpenVPN Servers
9 Evaluation & Findings and 9.1 Results for control VPN flows
12 Acknowledgement and References
Figure 10 shows an hour-level breakdown of the evaluation statistics, excluding control flows. Overall, both the Filter and Prober are able to reduce the number of suspected flows by several orders of magnitude, which when combined flagged 3,638 flows as OpenVPN connections. We manually analyze these flows to confirm our detection results.
Among the 3,638 flows, the destination servers for 469 of them respond to our Base Probe #1 with an explicit server reset, indicating the presence of a legitimate OpenVPN server not configured with HMAC protection. For the remaining 3,169 flows, we first noticed that 2,580 of them are between a single IP pair. Based on our log, the client initiates a connection every 4 minutes to the server on port 1194 (assigned to OpenVPN). Reverse DNS lookup associates the client IP with the “lib-locker” subdomain under a private university in the US. Furthermore, the server runs a TLS service listening on port 443, which sends a certificate belonging to a smart locker company with subject and issuer CN as “vpn.COMPANY- .com”. Based on these evidence, we believe the captured flows correspond to the secure communications between a deployed smart locker and the infrastructure that controls it. This also suggests that the fingerprintability of OpenVPN may
not only be a problem concerning censorship circumvention, but it may also be used for reconnaissance to identify and target IoT devices that communicate to their servers over an OpenVPN channel. Finally, we attempt to further characterize the remaining 589 flows based on circumstantial evidence about the destination endpoint.
Co-location with TLS In practice, TLS is the most common application we have seen that is co-located with an OpenVPN instance. For each of the remaining flows, we probe its destination endpoint with a TLS Client Hello and analyze the certificate and web page returned. Endpoints of 40 flows return certificates whose subject or issuer CN suggest VPN activity, such as *.vpn.ipvanish.com, *.vpn.wlvpn.com, *.virtualshield.org, and OpenVPN Web CA. In addition, 16 endpoints serve OpenVPN web interfaces over TLS.
WHOIS, DNS PTR, ISP Name We look up the WHOIS and DNS PTR records of the destination endpoints. 11 server IPs of 41 flows contain WHOIS records that can be linked back to a VPN provider, such as protonvpn-*, PRIVADO-*, and secureconnectivity-*. In addition, 2 servers have DNS PTR records as *.strong.blackoakcomputers.com and fosvpncluster.fos.*.com.
IP Context Service Several online platforms claim to offer VPN IP database or IP context services. We found 124 flows that can be linked to a commercial VPN server IP by the lookup service hosted on spur.us. However, these services do not disclose their specific methodology and their accuracy has not been systematically evaluated.
Our 7-day evaluation flagged 3,638 flows that are identified as “OpenVPN” from over 10 million flows that exceed our observation window. Among these, we are able find evidence that supports our detection result for 3,245 flows. The majority of the remaining 393 flows have server IPs belonging to cloud hosting services, and we are not able to further classify them. Conservatively, we can upper bound the false positive rate to 0.0039%, which is three orders of magnitude lower than previous ML-based approaches (1.4%-5.5%) [3, 14, 26]
This paper is available on arxiv under CC BY 4.0 DEED license.