Paper Study - HPCC High Precision Congestion Control
03 Feb 2020** This is a study & review of the HPCC paper. I’m not an author.
Paper Study: HPCC High Precision Congestion Control
Background
- New trends in datacenter architecture / applications
- Disaggregation datacenter & heterogeneous computing are on the rise: they need low latency and large bandwidth for performance at application level.
- High I/O applications (AI, ML, NVMe) are on the rise: they move large amount of data in a short amount of time. CPU & memory are fast enough –> network is the bottleneck
-
Other attempts failed to reconcile low latency, high bandwidth utility, and high stability because their proposed RDMA architecture can not handle control storms (low utility) and queueing (high latency) properly.
- They believe the existing congestion control algorithm within RDMA is flawed: slow convergence (many guesswork due to coarse-grained feedback signals low bandwidth utility), long packet queueing (long queue when congestion detected, high latency), complicated parameter tuning (more parameters to set up means more chance of human error, low reliability)
HPCC Intro
-
Basic Idea: make use of INT (in-network telemetry) to enable precise link load information transmission for precise calculation/estimation.
-
Challenges & Solutions
Challenges Solutions INT is piggybacked on a packet, which can be delayed in congestion » delay in CC Limit and control total inflight byes to prevent sender sending more bytes in to the network Possibility of overreaction from sender Uses improved INT + per-ACK and per-RTT reactions algorithm
Experience and Motivation
- Goals
- Latency: as low as possible
- Bandwidth utility: as high as possible
- Congestion & PFC pauses: as few as possible
- Operation complexity: as low as possible
- Trade-off in current RDMA CC, DCQCN: Data Center Quantized Congestion Notification
- Uses ECN (Explicit Congestion Notification) to detect congestion risk by default
- Throughput vs. Stability
- High utilization -> High sensibility & aggressive grow -> easier buffer overflow & oscillation in flow rate –> low stability, vise versa.
- Scheduling Timer
- aggressive timer in favor of throughput vs. forgiving timer in favor of stability.
- Traffic oscillation when link fails
- ECN congestion signaling: low ECN threshold for latency vs. high ECN threshold for bandwidth utility.
-
Metrics for next generation CC
- Fast converge: utility rate quick grows to network limits.
- Close-to-empty queue: short buffer queue
- Few parameters: adapt to environment and traffic itself, less human intervention.
- Fairness: fair distribution among traffic flows.
- Easy hardware deployment: compatible with NIC and switches
-
Commercialization circumstances: INT-enabled hardware and programmable NIC growing in the commercial market
Design
-
HPCC is sender-driven, each packet will be acknowledged by the receiver.
-
Uses INT-enabled switches for fine-grained information on timestamp, queue length, transmitted bytes, and link bandwidth capacity.
-
Idea: limit total inflight bytes in order to avoid congested feedback signal.
-
Sender limits inflight bytes with sending window of W = B * T, where B is NIC bandwidth, T is RTT, plus packet pacer at rate R = W/T.
-
When total inflight bytes is greater than bandwidth, its guaranteed to have congestion, thus to avoid congestion, controlling inflight bytes to be smaller than bandwidth.
-
Estimating inflight bytes by sender: I = qlen + txRate * T, qlen is queue length, txRate is output rate, and txRate = (ack1.txBytes - ack0.txBytes) / (ack1.ts - ack0.ts)
-
Reacting to the signals: Adjust window so that I~j~ is slightly smaller than B~j~ × T, Normalized inflight bytes of link j is :
U~j~ = I~j~ / (B~j~ × T) = (qlen~j~ + txRate~j~ × T) / (Bj × T) = qlen~j~ / (B~j~ × T) + (txRate~j~ / B~j~)
-
Sender i react to the most congested link:
W~i~ = W~i~ / max~j~ × k~j~ + W~AI~
-
- Idea: use RTT + ACK reaction to overcome trade-off between quick reaction vs. overreaction
-
Fast reaction without overreaction: to avoid react repeatedly for the same ACK regarding the same packet and queue.
-
Only adjust window for new set of packets. Remember first packet (Q) sent right after window adjustment, and only adjust window again upon Q’s ACK. Drawback: too slow to handle link failure or incasts.
-
Reference window size W^c^ ~i~ , a runtime state updated per RTT, assign current window size W~i~ to the reference window size.
W~!~ = n × Wc~i~ / max~j~(U~j~) + W~AI~
-
-
-
Workflow
- New ACK message triggers NewAck() function
- When ACK sequence number is new, updates window size, also recalculates INT info
- Function MeasureInflight estimates normalized inflight bytes with Eqn (2) at Line 5
- Function ComputeWind combines multiplicative increase/ decrease (MI/MD) and additive increase (AI) to balance the reaction speed and fairness.
-
Parameters
- η: bandwidth utility rate, default to 95%
- maxStage: stability vs. speed of reclaim free bandwidth, default to 5
- W~AI~: max number of concurrent flows + speed of converge vs. fairness
- Advantages of HPCC vs others
- Fewer parameters to tune.
- Accurate measurements of actual load of links using txRate.
-
Fast converge, see Appendix for details.
Implementation
- Prototyped with NIC with FPGA programmability + switching ASICs with P4 programmability. Compared to DCQCN on the same platform.
Evaluation
- Faster & better flow rate recovery: pic 9a, 9b
- Faster & better congestion avoidance: pic 9c, 9d
- Lower network latency: pic 9e, 9f
- Fairness: 9g and 9h
- Significantly reduces FCT for short flows: pic 10a, 10c
- Steadily close-to-zero queues: pic 10b, 10d
- Good for short flows as HPCC keeps near-zero queues and resolves congestion quickly: pic 11a, 11b, 11c, 11d
- Good for long flows as HPCC allows 5% bandwidth headroom, long flow has a higher slow-down: pic 11a, 11c
- Good for stability as HPCC reduces PFC pauses to almost zero: pic 11b
Related Works
- RDMA CC:
- TIMELY: incast congestion, long queues.
- iWarp: high latency, vulnerability to incast, high cost.
- General CC for data center networks:
- DCTCP & TCP Bolt: high CPU overhead, high latency, long queue
- Proposals to reduce latency
- qFabric: complex logics in switches
- HULL: allocate bandwidth headroom, similar to HPCC, but requires non-trivial implementations on switches for Phantom queues to get feedback before link saturation.
- DeTail: needs new switch architecture
- HOMA & NDP: receiver-driven, credit-based solution
- Flow controls for RDMA: to reduce hardware-based selective packet retransmission to prevent PFC pauses or to replace FPC
- IRN & MELO: complementary with HPCC, worse results than HPCC