Slow performance on pod to pod communication over vxlan in openshift-sdn
' }, P3 K; [* o8 ?# s环境! P" l$ v1 H) c# P
Red Hat OpenShift Container Platform6 u4 m1 e% `% X% G* n% P
Red Hat CoreOS8 d0 W0 C) ]- e( y# y
kernel-4.18.0-193.41.1.el8_2.x86_645 g ~1 E2 ?4 @& R% D. H/ N
iptables-nft
7 @0 @2 V, u5 z9 D) v! I+ |问题
' P( L3 ~) \ u2 `TCP and UDP iperf3 test between pods on different nodes over openshift-sdn is very slow. Even if the nodes are on the same Hypervisor or when an iperf between the node IPs is fast.
. P% P" k( f! u6 Y( ~' jRaw/ w- s7 Q5 e9 f9 E
[ 5] 0.00-1.00 sec 5.31 MBytes 44.5 Mbits/sec 18 620 KBytes
i. _! i1 z& @ r% ~5 P. s# o[ 5] 1.00-2.00 sec 4.12 MBytes 34.5 Mbits/sec 0 625 KBytes; F/ F4 \! {% q. }
[ 5] 2.00-3.00 sec 4.02 MBytes 33.7 Mbits/sec 0 628 KBytes
, S) \0 t6 _; j/ O0 b[ 5] 3.00-4.00 sec 4.13 MBytes 34.6 Mbits/sec 0 640 KBytes6 j5 Q. r( _+ ?! ?' D& Q+ ~/ C
[ 5] 4.00-5.00 sec 4.15 MBytes 34.8 Mbits/sec 0 665 KBytes
& z5 a# s3 @ T% V7 ]9 }. B8 j[ 5] 5.00-6.00 sec 3.95 MBytes 33.1 Mbits/sec 7 673 KBytes8 a4 q7 q: W6 a: f
[ 5] 6.00-7.00 sec 4.03 MBytes 33.8 Mbits/sec 3 675 KBytes5 y5 T8 p% I1 C$ K: B
The same iperf3 tests after adding the iptables rules in the resolution section where performance is boosted from MB/s to GB/s:. c9 i7 Q% G5 T0 @6 U% h
Raw5 C4 q/ I# B9 e g1 N1 y
[ 5] 490.00-491.00 sec 382 MBytes 3.20 Gbits/sec 4 1.02 MBytes
: T1 q7 O% z V5 b% p[ 5] 491.00-492.00 sec 403 MBytes 3.38 Gbits/sec 4 957 KBytes, h; r" W' P3 b* k. W, ~' P9 ~3 l `
[ 5] 492.00-493.00 sec 404 MBytes 3.39 Gbits/sec 12 869 KBytes
. H1 _$ `- U. I9 `$ b' M# r[ 5] 493.00-494.00 sec 398 MBytes 3.34 Gbits/sec 0 1.10 MBytes% k2 }+ N9 x" e' b% [ ~( Y8 X. p
[ 5] 494.00-495.00 sec 384 MBytes 3.23 Gbits/sec 11 1.02 MBytes2 w( ~4 ^* \3 ?- o( y
Conntrack shows several UNREPLIED entries at the vxlan port
" U: o- I. `3 r9 B/ IRaw) F+ q2 ?/ c9 @9 h
$ cat /proc/net/nf_conntrack | egrep udp | egrep dport=4789 | egrep UNREPLIED | wc -l
/ Q! d/ I! F8 e9 h232+ x0 E; D5 F0 A6 O4 x' k1 ?1 e
$ cat /proc/net/nf_conntrack | egrep udp | egrep dport=4789 | wc -l7 z9 L; Q; `) R. V3 {
232
/ t" G. n- g0 A$ O4 D( e& `决议- z+ i& @$ z9 K/ W1 O Y
This issue for IPv4 is resolved in releases: 4.9.0
$ Q7 Z* k! j( R4.8.102 ?9 m$ ~1 ]9 y
4.7.30- _9 w. I; M I$ G, g* P
4.6.452 Y# v. ]: a; O% o
Workaround* ?( N9 v, n3 C# k9 }2 r0 ]9 F
Apply these these iptables rules on the affected nodes: Raw1 |2 N; k; e; O( H0 j6 V
# iptables -t raw -A OUTPUT -p udp --dport 4789 -j NOTRACK
6 N* N9 l. u! I/ x5 J( i# iptables -t raw -A PREROUTING -p udp --dport 4789 -j NOTRACK
% O5 k" B6 Y: q+ f根源" F8 u( E: L4 u F8 Q
Unlike other protocols like DNS, VXLAN doesn't have conversations where one client sends a packet from ${IP1}:${PORT1} to ${IP2}:${PORT2} and expects an answer from the server coming from ${IP2}:${PORT2} to ${IP1}:${PORT1}. Instead, whenever some host wants to communicate with the other host, it will always send packets from ${IP1}:${RANDOM_PORT} to ${IP2}:${VXLAN_PORT:-4789} and if the other hosts sends a packet to the first host, then it would send a packet from ${IP2}:${RANDOM_PORT} to ${IP1}:${VXLAN_PORT:-4789}. What will never happen is that some packet gets replied to the random port used as client port of another packet, so doing connection tracking in VXLAN is not required and doesn't make sense. However, although not required, doing such connection tracking can have negative side effects in the performance on some scenario, specially if the number of iptables rules of the cluster is high due to the cluster having a very big number of services. Each vxlan packet would be unnecessarily traversing the iptables rules which can cause delays. As this is a sequential operation and a check needs to be done for each rule it slows UDP packets down considerably. The vxlan makes the following calls: Raw V: ?. e- s/ B8 {9 F6 Z
vxlan_xmit_one()->udp_tunnel_xmit_skb()->iptunnel_xmit()->ip_local_out()
2 e2 Y3 V5 E3 H: a$ d4 E# NOn the egress side the ip_local_out() routine will call into the netfilter routines as will incoming vxlan packets on the ingress side. With the iptables rules as per the resolution section in place the vxlan packets will not traverse the NAT rules as nf_conntrack is required to do that which mitigates the delay and improves bandwidth. 诊断步骤, o( [' T. @$ S1 o
Check the number of iptables rules on the nodes: Raw! \- ?& \5 @5 O/ V: s$ T6 ?6 C& \
# iptables-save | wc -l2 W! Y; b, Z$ F7 U
153900 |