When helping customer migrate to Aviatrix, most of the time, we have standard migration process, which is documented here. Few customer may need customized migration architecture due to their existing architecture and special requirements. These customized migration architecture require additional testing to understand what would be the potential impact to traffic flow. While most enterprise customer may have monitoring system in place, but in the lab/dev/QA phrase, we may not have the luxury to have full fledge monitoring system deployed. In this blog post, I will show you a simple method to log connectivity between desired data paths without breaking a sweat or the bank.
One of the scenario is that we will have to deploy Azure Route Server into customer’s existing vNet that contains Express Route Gateway. Microsoft posted this warning :
Warning
When you create or delete an Azure Route Server in a virtual network that contains a virtual network gateway (ExpressRoute or VPN), expect downtime until the operation complete.
In lab testing, creating Azure Route Server took approximately 17 minutes, does it means the customer need to plan for 17 minutes downtime? How do we monitor and measure the downtime?
On top of that, though ping (ICMP) have been most commonly used to validate connectivity, it’s possible that the initial traffic and return traffic take a different data path. While many enterprise customers are also using stateful firewalls in data path, stateful firewall keep tracks of TCP session table to only allow returning traffic matches the session table that was previously allowed. Stateful firewall doesn’t track session for UDP / ICMP, hence if the initial traffic and return traffic take a different data path, ICMP, DNS (using UDP), NTP would work just fine, but SSH, HTTP would fail. To understand this further, you may give this article a read: Finding & Fixing Asymmetric Routing Issues
When building lab environment, I often uses following terraform module to build Ubuntu test instances that allows ICMP, SSH, and HTTP (for public instance):
https://registry.terraform.io/modules/jye-aviatrix/aws-linux-vm-public/aws/latest
https://registry.terraform.io/modules/jye-aviatrix/aws-linux-vm-private/aws/latest
https://registry.terraform.io/modules/jye-aviatrix/azure-linux-vm-public/azure/latest
https://registry.terraform.io/modules/jye-aviatrix/azure-linux-vm-private/azure/latest
Since it’s not good enough to test using ICMP and HTTP isn’t available for all instance type, the question becomes, how do I test SSH connectivity and keep track of it?
There are utilities such as tcpping that works almost like ping, but using TCP instead of ICMP, but these utilities are often not built in Ubuntu image and require additional installation, when your private instance doesn’t have egress access to Internet, you cannot install these packages.
There’s a build in tool in Ubuntu to test TCP connectivity: nc
Usage:
nc -zv <target_ip> <target_port>
Example:
nc -zv 10.1.13.1 22
Success:
Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
When if the remote isn’t responding, nc will stuck there as a blinking cursor until it hits the default timeout from the Linux distro. Let’s tell nc to timeout after 1 second, just like ping command:
nc -zvw 1 10.1.13.1 22
Now after one second if the connection failed:
connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Great, now that if we can just add timestamp to each connection attempt and put it in a loop, we have our simple logging script!
while sleep 1; do echo "$(date): $(nc -zvw 1 <target_ip> <target_port> 2>&1)"; done >> <log_file>
Example:
while sleep 1; do echo "$(date): $(nc -zvw 1 10.16.0.132 22 2>&1)"; done >> mgmt_to_pci_spk.txt
Example output of the testing:
Thu May 18 13:44:41 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:44:42 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:44:43 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:44:44 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:44:46 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:44:47 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:44:49 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:44:51 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:44:53 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:44:55 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:44:57 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:44:59 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:01 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:03 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:05 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:07 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:09 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:11 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:13 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:15 UTC 2023: nc: connect to 10.1.13.1 port 22 (tcp) timed out: Operation now in progress
Thu May 18 13:45:17 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:45:18 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:45:19 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Thu May 18 13:45:20 UTC 2023: Connection to 10.1.13.1 22 port [tcp/ssh] succeeded!
Now that we know between: 13:44:47 UTC to 13:45:17 UTC, we have 30 seconds downtime due to deploying Azure Route server into the Express Route Gateway virtual network.
Keep in mind that Azure Route Server need to work with Azure Fabric to program both Express Route Gateway vNet and it’s peered spoke vNets, you will have to test this with all data paths, and between the spokes, I have logged around 110 seconds of downtime. This measurement also depends on number of routes been propagated between on-prem and the cloud, so your test result may vary.