易陆发现互联网技术论坛

 找回密码
 开始注册
查看: 4|回复: 4
收起左侧

redis问题与解决思路Timeout receiving bulk data from MASTER if the problem persists try

[复制链接]
发表于 2024-5-29 06:00:01 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?开始注册

x
Timeout receiving bulk data from MASTER if the problem persists try to set the 'repl-timeout'
1 x; u) b; c7 t) Y. e& f( H$ ?/ u5 ]+ l, U) s0 I
问题现象:; i4 G. ~( ^1 t0 C$ S! A

' f6 J2 Z) v3 i" l( O集群状态 1主 2从,主没有开启bgsave,从开启bgsave。现象所有redis可以访问,进行操作。主不断开始bgsave 1从停止bgsave。' R: j) h$ d. c" x: [
: D6 o- |/ J# w6 P2 N  R
主日志报错# Connection with slave XXXX lost.
( {8 d. n  k5 V8 H# d1 J6 E- ?+ }7 f; A
从日志报错# Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.
$ r2 I3 g. w+ [2 C4 t! d
3 }% i0 F; ]- J: w' n& }  j
- g+ ~! y2 Y, H9 A, f8 d
- D9 d" e8 @+ ?问题总结:
$ R: r9 P+ K5 s, N8 `6 e4 S; K* A( |) W3 i5 g+ z5 V
repl-backlog-size   repl-timeout 这2个参数 每次从和主可以同步的数据大小,如果进行同步的时候超过了这个限制,就到导致如上报错。
, F; I5 U* D- F! w- U5 S, n3 _& ?( C# n0 {. h7 u

6 u$ D2 g% @, P5 _/ r. l5 x7 y4 [& R! G5 m. Z7 j$ @( K
问题现象:. P+ z$ x/ d" R/ X- _. [* J8 F' p# r% K

# ~: c. B& |7 u) J+ D8 b重启从服务器,主报错Client id=1317049445 addr=10.10.3.112:7412 fd=39 name= age=394 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=4360 omem=76118609 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
$ [9 w) S% [4 g9 V( k# l4 p6 c( n6 x  g
* p- Y+ e6 r# W+ A, K6 B注意到这么一句话:psync scheduled to be closed ASAP for overcoming of output buffer limits。看起来是psync因为超过output buffer limits将被close。
$ |2 ~3 d( H. J6 G+ b- m% Z- q+ ]4 `% [- [  B9 h4 c8 ]! P/ S" H
于是查看了一下client-output-buffer-limit。发现这是Redis的一个保护机制。配置格式是:
7 O6 ?7 y  \$ h$ r' m0 g9 J+ v6 g( R$ c6 E  Q6 J
client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds>
9 d: K! {! m1 z6 k7 D/ b" {具体参数含义如下:
! Y& _! _) X% b% x3 A1 ?# L' z$ _, R. ?" |) _5 F
class: 客户端种类,包括Normal,Slaves和Pub/Sub2 {6 R  ^" j% O, [
Normal: 普通的客户端。默认limit 是0,也就是不限制。
( V8 q% Y5 I  Y* l( d" }& Y) f( ePub/Sub: 发布与订阅的客户端的。默认hard limit 32M,soft limit 8M/60s。5 @/ J2 g0 `3 h; Z4 y
Slaves: 从库的复制客户端。默认hard limit 256M,soft limit 64M/60s。* {0 E) {5 U4 A: ?
hard limit: 缓冲区大小的硬性限制。$ ^6 F: R8 r$ q' s& S
soft limit: 缓冲去大小的软性限制。! p4 J! L% O3 o0 s; d
soft seconds: 缓冲区大小达到了(超过)soft limit值的持续时间。% N8 Z; Z5 Y* ^# [
client-output-buffer-limit参数限制分配的缓冲区的大小,防止内存无节制的分配,Redis将会做如下自我保护:1 X9 S& \; e8 M) S% E
# P' u) n) J* C) E; E3 x
client buffer的大小达到了soft limit并持续了soft seconds时间,将立即断开和客户端的连接
6 W8 x. l; }# X: A6 G% Xclient buffer的大小达到了hard limit,server也会立即断开和客户端的连接
7 [, ~" a" o. p- M, I再看看我们从库的这个配置,其实就是默认配置:+ Y8 g: G$ y. t, S% l( ^' r6 t: u

: v$ v- f, L4 D# 客户端的输出缓冲区的限制,因为某种原因客户端从服务器读取数据的速度不够快,1 A, Y$ L, ?( _; ?; B

; i) ^/ i5 u, V2 o5 P) h- w4 V# 可用于强制断开连接(一个常见的原因是一个发布 / 订阅客户端消费消息的速度无法赶上生产它们的速度)。
; e' F( J+ j5 J: y) j  ^
2 i) v/ j0 c8 w. _1 ]( ~9 V# 可以三种不同客户端的方式进行设置:
5 L: Y3 p- t9 \$ e) N3 X2 n+ }" }: W! w- @) J
# normal -> 正常客户端4 i! A1 H! O* p1 f. W$ e/ F7 k

# i  T, W! j6 f# I: y! D) A; G# slave -> slave 和 MONITOR 客户端
8 |7 f" h2 E( W8 E. ?" k5 d6 X  g; ^/ \1 N) q$ L0 t
# pubsub -> 至少订阅了一个 pubsub channel 或 pattern 的客户端
8 s6 Z, x. s3 `7 a
/ H7 b8 T% \( U8 d# 每个 client-output-buffer-limit 语法 :
- Z1 u( R# W/ L" o8 {# z, _3 L- Q8 N* i
# client-output-buffer-limit <class><hard limit> <soft limit> <soft seconds>
% l5 o9 d2 i, J& |0 B1 i
/ O4 z; Z1 V: m  |0 a  k# 一旦达到硬限制客户端会立即断开,或者达到软限制并保持达成的指定秒数(连续)。9 ^' [/ N% u: z* f$ C9 x! Z5 _4 a2 t
# @' x7 Q) O, P* ]
# 例如,如果硬限制为 32 兆字节和软限制为 16 兆字节 /10 秒,客户端将会立即断开% e/ w! v  [: {4 Q. z
4 Z7 w8 B9 ^5 j0 X, o
# 如果输出缓冲区的大小达到 32 兆字节,客户端达到 16 兆字节和连续超过了限制 10 秒,也将断开连接。
' k" c) [% R/ R8 ]; b7 S/ B: C7 K# n- a
# 默认 normal 客户端不做限制,因为他们在一个请求后未要求时(以推的方式)不接收数据,
: h! a0 \1 `' R2 R1 f! y5 O+ P9 Z/ k% j. ?
# 只有异步客户端可能会出现请求数据的速度比它可以读取的速度快的场景。, |1 R0 O, y  m- q

' U/ q: J) `# O; p2 a8 W* H# 把硬限制和软限制都设置为 0 来禁用该特性2 Z2 y$ S- L& A; f* Y

+ K" G7 `5 d, ?9 T1 s) nclient-output-buffer-limit normal 0 0 05 G6 r- z: t$ N$ ]1 r# l5 |! x

' V1 M+ Q, P' F2 `  }client-output-buffer-limit slave 256mb 64mb 60
* C. B' U2 W$ ~
' Y" s& @; x" Qclient-output-buffer-limit pubsub 32mb 8mb 60
5 l1 x" d" G& J+ u2 G+ L- X8 {
2 H6 m& H1 d. z' w5 `9 Y8 j& v. mredis的replication buffer其实就是client buffer的一种。里面存放的数据是下面三个时间内所有的master数据更新操作:
0 i. L& N9 w- m
* @6 A3 H* ]7 r" w9 f+ Y: q! Z/ Qmaster执行rdb bgsave产生snapshot的时间
8 B6 }" Y; k, C4 \2 L& I6 ^master发送rdb到slave网络传输时间8 o  ~7 }1 p+ H5 I& G7 v
slave load rdb文件把数据恢复到内存的时间
9 j  \% {. I+ h: c( X- Z可以看到跟replication backlog是一模一样的!4 c' ]0 s* g' c" m9 N1 u
  N9 d. c1 d4 z
replication buffer由client-output-buffer-limit slave设置,当这个值太小会导致主从复制链接断开:
1 C$ j- b7 z1 C' ?, y5 G. R% ^$ B6 j% }. t+ p& U
当master-slave复制连接断开,server端会释放连接相关的数据结构。replication buffer中的数据也就丢失了,此时主从之间重新开始复制过程。
  S* L$ i4 [8 C8 r0 w: N还有个更严重的问题,主从复制连接断开,导致主从上出现rdb bgsave和rdb重传操作无限循环。
8 T+ U2 ]& z' X看起来确实server(这里就是master)会因为缓冲区的大小问题主动关闭客户端(slave)链接。因为我们的数据变更量太大,超过了client-output-buffer-limit。导致主从同步连接被断开,然后slave要求psync,但是由于repl-backlog-size太小,导致psync失败,需要full sync,而full sync需要Discarding previously cached master state,重新load RDB文件到内存,而这个加载数据过程是阻塞式的。所以导致slave出现间歇式的不可用。而切换到master之后,master的整个同步操作都是fork一个子进程进行的,所以不影响父进程继续服务。所有的现象都能清清楚楚的解释上。0 k! m4 t' p0 k

3 j2 B  p7 f( G' W  a( ?6 @+ ~! T
- g% ^! T1 f  {; N
更改配置 client-output-buffer-limit  client-output-buffer-limit slave 0 0 0 "  重启slave问题解决3 e; f/ f2 F8 c. y' @
' |* F; d# \7 r+ V6 |+ S
 楼主| 发表于 2024-5-29 06:00:02 | 显示全部楼层
集群状态 1主 2从,主没有开启bgsave,从开启bgsave。现象所有redis可以访问,进行操作。主不断开始bgsave 1从停止bgsave。  `0 {( x9 o) A+ ^, }2 n) {2 }

2 q" _' v, m8 S' `- A7 n  ]/ T( Q# s. X主日志报错# Connection with slave XXXX lost.6 |2 t" r) i) F5 z0 J0 z

- s" }7 V. h4 e+ w$ b从日志报错# Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.
1 r) g5 g  }4 a8 t- l: {* [# t' e$ S3 s) P8 t: c' i

! x* i7 P* @6 O: q% N
) w0 v- o9 n4 n% _9 Y问题总结:7 ~! j0 U- d5 [2 a* H  U: `

, r/ y) `( E: d# j) S( s# Y* g8 lrepl-backlog-size   repl-timeout 这2个参数 每次从和主可以同步的数据大小,如果进行同步的时候超过了这个限制,就到导致如上报错。
! Q" @# p2 f' b* x
4 W$ s& x$ {; Z$ z# ~2 F   E! \" w# U! b6 F$ b- U
3 w% {9 i8 [: b+ K$ k" n
问题现象:+ Y! S5 ?3 p1 D0 D# A

0 U- G; w/ L3 p! C重启从服务器,主报错Client id=1317049445 addr=10.10.3.112:7412 fd=39 name= age=394 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=4360 omem=76118609 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.3 E4 A  W6 f. w& {1 K  [
+ @" j8 _$ |4 t7 ^7 t+ R
注意到这么一句话:psync scheduled to be closed ASAP for overcoming of output buffer limits。看起来是psync因为超过output buffer limits将被close。
) a3 y. h/ B' B0 R+ ?: a9 u6 ?- I/ f3 A
于是查看了一下client-output-buffer-limit。发现这是Redis的一个保护机制。配置格式是:
0 B) ?  Q3 h# S
1 b: [# J! d  J7 l6 Aclient-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds>
 楼主| 发表于 2024-5-29 06:00:03 | 显示全部楼层
Answer
: G' ^" w& |- B5 D- d) ORedis replication timeout sets the maximum amount of time the master waits for a ping response from its slave before dropping the connection. This configuration is critical to prevent potential data loss or stale reads in case of network issues or unresponsive slaves.2 a, f/ }, x  O: x
+ f, J. ], V* \/ Q6 F: h! b2 j
The repl-timeout setting in the Redis configuration file controls this value, and it's specified in seconds. By default, it's set to 60 seconds. You can change it as follows:9 ]2 R2 I: ]. C7 S8 W  X7 s

+ P: u' h( ]) F& ]4 \* D# In your redis.conf file7 }( G, f% S! W; {
repl-timeout 1206 t. E. V. I" R0 w
This sets the replication timeout to 120 seconds. Save and exit the file, then restart the Redis server for the changes to take effect.
" Q0 f0 h2 E5 G8 z! z3 N5 |5 G: [3 t4 Y9 l: q" ?7 o6 X7 P# `* @
You can also modify this setting at runtime using the CONFIG SET command:
* P- V# C/ w. E+ [, t7 j; D. G$ y7 \& S
# Via Redis CLI$ K" h; H( V' c) h6 n
redis-cli CONFIG SET repl-timeout 120
8 {; o8 l6 c. U- j- P# c/ yBe cautious when adjusting this value. Setting it too low may lead to unnecessary disconnections due to transient network issues, while setting it too high may delay the detection of genuine problems with slave nodes.
 楼主| 发表于 2024-5-29 06:00:04 | 显示全部楼层
Here's the output from the master's log:
+ Z- _. L5 N' E/ k$ o>2 r* ]# W( V- u5 D7 w) L, M0 L
> [9470] 13 Sep 22:24:04.789 * Slave ask for synchronization
" E  d  C, d0 P1 w: o> [9470] 13 Sep 22:24:04.789 * Starting BGSAVE for SYNC4 H$ p" L9 [) {+ O7 f
> [9470] 13 Sep 22:24:09.454 * Background saving started by pid 18435
$ {6 P' k3 N  d4 ?% e+ {( V> [9470] 13 Sep 22:26:37.105 # Client addr=10.108.61.163:44422 fd=157 age=153
3 ]" }9 ?& U1 {+ w, Q> idle=153 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=8361
0 j! G/ w$ i5 e5 @: @> oll=3975 omem=100347928 events=r cmd=sync scheduled to be closed ASAP for4 e1 B: {' z& E( z
> overcoming of output buffer limits.3 c" y: D' y4 r$ K1 e$ t" r. \! i
> [18435] 13 Sep 22:29:06.199 * DB saved on disk
! ?# s0 X" i  W; W/ L> [9470] 13 Sep 22:29:07.138 * Background saving terminated with success0 p+ r3 t- i0 T1 J' u4 ]4 w
>
# _/ Z# ?% f6 _! j! \> The slave's view:$ c% G2 I+ E/ z# G& O
> [23037] 13 Sep 21:42:07.799 * The server is now ready to accept connections
" ^( M# B7 I- N7 }8 x> at /tmp/redis.sock1 Q5 G% i2 K! c9 m+ O3 Y
> [23037] 13 Sep 21:42:07.799 * Connecting to MASTER...+ U& s$ V! m8 r0 b( f7 z# X
> [23037] 13 Sep 21:42:07.799 * MASTER <-> SLAVE sync started
% G% }% d  R& ~# l& B& l8 R> [23037] 13 Sep 21:42:07.800 * Non blocking connect for SYNC fired the event.  f+ j7 E8 b5 ]$ N
> [23037] 13 Sep 21:45:43.167 # Timeout receiving bulk data from MASTER..." T8 r& d! `) V% I4 C
> [23037] 13 Sep 21:45:43.167 * Connecting to MASTER...
9 @. e  z0 }- c, v1 G> etc." Q/ B" v; V: l7 r
>7 t: {( |% R; N: f
> And the configuration on master:( Y# P% P" Z! M$ d, M% s
> redis-cli -p 6380 config get client-output-buffer-limit; e  X  Q3 J0 X. m( L6 r. o
> 1) "client-output-buffer-limit"
. P# \, ?" ]  y' L* ^6 [2 b; N> 2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"8 K# z2 n4 _7 L% r% J$ f/ c, U' d
>9 f) k' U3 c  L. O8 I7 s/ A1 r
> So there was only 100mb in the output buffer, but it was shut down even
" l& Z- w& H/ X> though the limit was 268mb. Any ideas what might be going on? Apologies if2 p) c+ u) L3 @: r$ Y/ N+ q
> this is a repost.
 楼主| 发表于 2024-5-29 06:00:05 | 显示全部楼层
I adopted pretty much all the recommendations, including the ssh: t9 l2 e& k: `
compression idea, from you two and so:* B# ]( g7 x* y# [; k
/ e# x6 P# ^7 f/ i, K' s$ H
root@redis3:~# redis-cli config get * | grep -A2 repl; G, P8 s& s3 x2 S4 [- x0 t
repl-ping-slave-period! K. k$ Y! e7 `3 W* B7 |
10
7 Q+ j) m* L) ^0 x" F! H: k7 Erepl-timeout
( F! f8 G5 x7 c9 F1800  f9 Y" K- x/ N: I
repl-backlog-size7 ^$ n6 a7 @9 H4 V' f* Y
1048576000# n! ~; U! a& X, F! ~5 `% m
repl-backlog-ttl8 b  `6 t  u7 d6 _# M- x$ q
3600
; b" s& k# H! Q' ~3 ^5 v6 }% R) U! c3 \) S1 D6 F/ V/ j
I evetualy also changed this, from no to yes (although I know it may
& V- k& ^6 f; \increase the bandwidth requirements):2 C$ \. h# P; T/ ?2 b
repl-disable-tcp-nodelay
: W1 a2 a" F5 ]+ a6 ryes( G! ?9 w" T0 h2 F* q) ^
, B) h. {0 h4 ]: Z" C' W$ d1 q
But the result is actually worse.... by that I mean the master kicks9 Y9 q& P5 ^! y# ?
off the slave much faster.+ [; J$ j6 H$ t: m$ n: i

. _, t; d8 [1 I+ U2 DMaster:* k6 B! a7 p8 f+ K: X
[3355] 30 May 02:56:28.875 * Slave asks for synchronization
+ g# P2 A. `" ]& T2 p. B[3355] 30 May 02:56:28.875 * Full resync requested by slave.
. N7 p% i1 e8 q9 L[3355] 30 May 02:56:28.875 * Starting BGSAVE for SYNC
8 k# L$ ]$ E% V0 r0 v7 w: B  \[3355] 30 May 02:56:35.376 * Background saving started by pid 16330" n; o5 r. u- H
[3355] 30 May 02:56:43.733 # Client addr=127.0.0.1:49630 fd=185 name=
* q' K' B( P: Y/ b0 |age=15 idle=15 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0' X# ?& k1 Y$ c3 G& I, o6 ~
obl=16372 oll=2896 omem=268753696 events=r cmd=psync scheduled to be' o8 O4 M& h3 O
closed ASAP for overcoming of output buffer limits.
1 {  t& y- I6 ^' F# L6 K# K8 \[16330] 30 May 02:59:44.675 * DB saved on disk. D* W- x2 X4 B2 F' x  l0 h6 b
[16330] 30 May 02:59:46.216 * RDB: 2397 MB of memory used by copy-on-write! z& S, Z. R/ q$ r4 i# G8 g. `: }
[3355] 30 May 02:59:48.806 * Background saving terminated with success5 n& z/ e$ o5 U5 Y3 U0 n& V
...
) e2 y8 K  j7 X8 s) l( e# b7 p2 w) r' s7 M
Slave:3 s; Z% F  y; `& K- W: d. @
[19460] 30 May 02:56:27.975 * DB loaded from disk: 210.107 seconds$ X8 b- q1 V$ l! y
[19460] 30 May 02:56:27.976 * The server is now ready to accept+ Y/ t9 e, {8 Y+ l3 J' ]. Q
connections on port 6379+ D3 h& H$ C/ O) X
[19460] 30 May 02:56:28.869 * Connecting to MASTER localhost:62801 [5 B- a3 D( H+ B) Q1 \8 d9 W
[19460] 30 May 02:56:28.869 * MASTER <-> SLAVE sync started9 F. X- O  `8 |) Z0 f$ C, `
[19460] 30 May 02:56:28.870 * Non blocking connect for SYNC fired the event.
' J" ]4 Q0 g) x/ C$ O[19460] 30 May 02:56:28.873 * Master replied to PING, replication can
  X% c  o  x; L3 {, ~) S' k- acontinue...
0 c4 L( y5 H, }8 z( \% n[19460] 30 May 02:56:28.875 * Partial resynchronization not possible
( C4 `. I$ r# H! N(no cached master)& @* c5 S( c. N7 U7 L* q- _+ g
[19460] 30 May 02:56:28.877 * Full resync from master:
. n4 A3 Y+ ^; e621480e9295872416266e563939b4fd6724eb5b7:68385253408+ K" y1 |) _1 e8 ?& D% Z
...0 X" J% f) D6 W7 }$ U3 J% k
0 m4 b. u- M9 z! @$ h5 Y- U
So, 3 questions please:
& z+ E9 {0 R6 R/ s& _
2 V- P0 A* ~) _2 i0 Q1. Do I need to make the same and all of the configurations on the slave too?) x) i0 j, H$ n4 P% m

* v/ W$ ~, f: m2. Can the "Partial resynchronization not possible (no cached master)"1 @3 H1 Q! @( A4 ?$ ]" ^( v, u
be overcome somehow? I have full backups via BGSAVE of the master each
; \- f' o6 I: O2 y+ e+ m' Mhour (yes, I stopped those while attempting the slaving stuff) so I' w; U' |7 U: z/ J+ O  A) \
could use the latest one to load it masnually in the salve and then* \1 a6 i# T3 W! n) f0 b" b
attach the slave to the master - last time I tried that however, the! N- \. h  l5 V9 t; G
slave still requested a full resync....
您需要登录后才可以回帖 登录 | 开始注册

本版积分规则

关闭

站长推荐上一条 /4 下一条

北京云银创陇科技有限公司以云计算运维,代码开发

QQ|返回首页|Archiver|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )点击这里给我发消息

GMT+8, 2026-4-8 21:27 , Processed in 0.047745 second(s), 22 queries .

Powered by Discuz! X3.4 Licensed

© 2012-2025 Discuz! Team.

快速回复 返回顶部 返回列表