Skip to content

Network Device Interferes with Always On Connections

Malcolm Stewart edited this page Aug 3, 2021 · 12 revisions

Network Device Interferes with Always-On Connections

The Players

IP Address Computer Role Listener Role
172.26.25.102 Client in Datacenter A
172.26.26.71 Always-On Listener IP Address in Data Center A Primary
172.26.121.194 Always-On Listener IP Address in Data Center B Secondary
172.26.6.54 Another Always-On Listener IP Address in Data Center B Secondary

Symptom

Several SSIS Jobs copy data to various Always-On clusters by connecting to the Listener name. Primary and secondary servers existed in separate data centers on separate subnets. Each Listener name had 2 IP addresses associated with it. The connection string used the MultiSubnetFailover=true keyword to connect to each IP address in parallel to optimize the connection speed.

After one weekend, the jobs started failing about 50% of the time with the following error message:

Client unable to establish connection because an error was encountered during handshakes before login. Common causes include client attempting to connect to an unsupported version of SQL Server, server too busy to accept new connections or a resource limitation (memory or maximum allowed connections) on the server.

Restarting the job would normally allow the job to complete.

As a temporary workaround, the Connection Managers in the jobs were configured to connect directly to the Primary node computer name rather than the Listener name. This stabilized the jobs but there would be issues if a cluster needed to be failed over.

Data Collection

Several network traces were taken but the failure was not readily apparent.

A driver BID Trace was collected to see what decisions the driver made during the failure.

BID Trace Analysis

< BID Trace Image >

In the BID trace, we can see the TcpConnection::FInit and TcpConnection::FInitForAsync API calls to connect to both the Secondary Listener IP Address (172.26.121.194 and ID 543# (yellow)) and the Primary Listener IP address (172.26.25.71 and ID 544# (green)).

Note: The order of IP addresses is random and depends on the order of IP addresses returned by the DNS API call to resolve the Listener name.

At the TcpConnection::CheckCompletedAsyncConnect (blue), we see that ID 543# (yellow), the secondary IP address "won" the race and the driver moves forward with logging into this IP address.

This is an unexpected finding. Only the Listener IP address for the Primary node should be tied to a MAC address (network card). The other IP address(es) should be floating, i.e. they are not not tied to a MAC address and therefore cannot respond to the connection request. SYN packets to the secondary should not be responded to. But they were.

TELNET Test

Since the SYN packets were responded to by an unknown device, we decided to try to eliminate SQL Server from the process, since there was no SQL Server that responded to the packets, and settled on TELNET as the client.

We were able to reproduce the issue with TELNET. In this case, we used another secondary Listener IP address, 172.26.6.54.

The correct behavior should be that TELNET cannot reach the destination IP address.
We did see this for about 1-3 minutes.
Then, TELNET was able to connect. This is the undesired behavior.
We saw that for about 1-3 minutes.
And then the behavior changed and flipped between the two behaviors every couple of minutes.

We took a network trace.

< TELNET network trace >

At the top of the trace, you see the expected behavior - a SYN packet, and then it is retransmitted after 3 seconds and 6 more seconds before determining the connection is bad, e.g. frames 961, 1468, 2593. This behavior is what keeps the driver from choosing the secondary IP address for the connection.

Starting at frame 45893 is the undesirable behavior. TELNET sends a SYN packet, and then in frame 45894, someone responds with an ACK+SYN packet and

Clone this wiki locally