Commit a337ea2
Optimize motion timeout transmission (#1365)
* The primary goal is to address various issues currently encountered during concurrent processes,
such as excessive motion retries, congestion, retransmission storms, and network skew.
The code addresses inefficient network retransmission handling in
unreliable network environments. Specifically:
Fixed Timeout Thresholds: Traditional TCP-style Retransmission Timeout
(RTTVAR.RTO) calculations may be too rigid for networks with volatile
latency (e.g., satellite links, wireless networks). This leads to:
• Premature Retransmissions: Unnecessary data resends during temporary
latency spikes, wasting bandwidth.
• Delayed Recovery: Slow reaction to actual packet loss when RTO is
overly conservative.
Lack of Context Awareness: Static RTO ignores real-time network behavior
patterns, reducing throughput and responsiveness.
Solution: Dynamic Timeout Threshold Adjustment
Implements an adaptive timeout mechanism to optimize retransmission:
if (now < (curBuf->sentTime + conn->rttvar.rto)) {
uint32_t diff = (curBuf->sentTime + conn->rttvar.rto) - now;
// ... (statistical tracking and threshold adjustment)
}
Key Components:
• Statistical Tracking:
\- min/max: Tracks observed minimum/maximum residual time (time
left until RTO expiry).
\- retrans_count/no_retrans_count: Counts retransmission vs.
non-retransmission events.
• Weighted Threshold Calculation:
unack_queue_ring.time_difference = (uint32_t)(
unack_queue_ring.max * weight_no_retrans +
unack_queue_ring.min * weight_retrans
);
Weights derived from historical ratios of retransmissions
(weight_retrans) vs. successful deliveries (weight_no_retrans).
How It Solves the Problem:
• Temporary Latency Spike: Uses max (conservative) to avoid false
retransmits, reducing bandwidth waste (vs. traditional mistaken
retransmissions).
• Persistent Packet Loss: Prioritizes min (aggressive) via
weight_retrans, accelerating recovery (vs. slow fixed-RTO reaction).
• Stable Network: Balances weights for equilibrium throughput (vs.
static RTO limitations).
EstimateRTT - Dynamically estimates the Round-Trip Time (RTT) and adjusts Retransmission Timeout (RTO)
This function implements a variant of the Jacobson/Karels algorithm for RTT estimation, adapted for UDP-based
motion control connections. It updates smoothed RTT (srtt), mean deviation (mdev), and RTO values based on
newly measured RTT samples (mrtt). The RTO calculation ensures reliable data transmission over unreliable networks.
Key Components:
* srtt: Smoothed Round-Trip Time (weighted average of historical RTT samples)
* mdev: Mean Deviation (measure of RTT variability)
* rttvar: Adaptive RTT variation bound (used to clamp RTO updates)
* rto: Retransmission Timeout (dynamically adjusted based on srtt + rttvar)
Algorithm Details:
1. For the first RTT sample:
srtt = mrtt << 3 (scaled by 8 for fixed-point arithmetic)
mdev = mrtt << 1 (scaled by 2)
rttvar = max(mdev, rto_min)
2. For subsequent samples:
Delta = mrtt - (srtt >> 3) (difference between new sample and smoothed RTT)
srtt += Delta (update srtt with 1/8 weight of new sample)
Delta = abs(Delta) - (mdev >> 2)
mdev += Delta (update mdev with 1/4 weight)
3. rttvar bounds the maximum RTT variation:
If mdev > mdev_max, update mdev_max and rttvar
On new ACKs (snd_una > rtt_seq), decay rttvar toward mdev\_max
4. Final RTO calculation:
rto = (srtt >> 3) + rttvar (clamped to RTO_MAX)
Network Latency Filtering and RTO Optimization
This logic mitigates RTO distortion caused by non-network delays in database
execution pipelines. Key challenges addressed:
* Operator processing delays (non-I/O wait) inflate observed ACK times
* Spurious latency amplification in lossy networks triggers excessive RTO_MAX waits
* Congestion collapse from synchronized retransmissions
Core Mechanisms:
1. Valid RTT Sampling Filter:
Condition: 4 * (pkt->recv_time - pkt->send_time) > ackTime && pkt->retry_times != Gp_interconnect_min_retries_before_timeout
Rationale:
* Filters packets exceeding 2x expected round-trips (4x one-way)
* Excludes artificial retries (retry_times=Gp_interconnect_min_retries_before_timeout) to avoid sampling bias
Action: Update RTT estimation only with valid samples via EstimateRTT()
2. Randomized Backoff:
Condition: buf->nRetry > 0
Algorithm:
rto += (rto >> (4 * buf->nRetry))
Benefits:
* Exponential decay: Shifts create geometrically decreasing increments
* Connection-specific randomization: Prevents global synchronization
* Dynamic scaling: Adapts to retry depth (nRetry)
3. Timer List Management (NEW_TIMER):
Operations:
RemoveFromRTOList(&mudp, bufConn) → Detaches from monitoring
AddtoRTOList(\&mudp, bufConn) → Reinserts with updated rto
Purpose: Maintains real-time ordering of expiration checks
We conducted multiple full-scale TPCDS benchmarks using both a single physical machine with 48 nodes and four physical machines with 96 nodes, testing with MTU values of 1500 and 9000.
In the single-node environment with no network bottlenecks, there were no significant performance differences between using MTU 1500 and 9000. In the 96-node environment, under single-threaded execution, there were no significant performance differences.
However, under multi-threaded execution (4 threads), SQL statements with a high percentage of data movement showed significant performance variations, ranging from 5 to 10 times, especially with MTU 1500.
* Cleaning up the code
---------
Co-authored-by: zhaoxi <oracleloyal@gmail.com>
Co-authored-by: zhaoxi <zhaoxi@hashdata.cn>1 parent 6e38a58 commit a337ea2
16 files changed
Lines changed: 1584 additions & 67 deletions
File tree
- contrib/interconnect
- test
- udp
- src
- backend
- cdb
- utils/misc
- include
- cdb
- utils
- test/regress
- expected/icudp
- sql/icudp
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
13 | 18 | | |
14 | 19 | | |
15 | 20 | | |
| |||
33 | 38 | | |
34 | 39 | | |
35 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
36 | 62 | | |
37 | 63 | | |
38 | 64 | | |
| |||
153 | 179 | | |
154 | 180 | | |
155 | 181 | | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
156 | 208 | | |
157 | 209 | | |
158 | 210 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
330 | 330 | | |
331 | 331 | | |
332 | 332 | | |
| 333 | + | |
333 | 334 | | |
334 | 335 | | |
335 | 336 | | |
| |||
374 | 375 | | |
375 | 376 | | |
376 | 377 | | |
| 378 | + | |
377 | 379 | | |
378 | 380 | | |
379 | 381 | | |
| |||
0 commit comments