redis问题与解决思路Timeout receiving bulk data from MASTER if the problem persists try

admin · 发表于 2024-5-29 06:00:01

Timeout receiving bulk data from MASTER if the problem persists try to set the 'repl-timeout'

问题现象：

集群状态 1主 2从，主没有开启bgsave，从开启bgsave。现象所有redis可以访问，进行操作。主不断开始bgsave 1从停止bgsave。

主日志报错# Connection with slave XXXX lost.

从日志报错# Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.

问题总结：

repl-backlog-size repl-timeout 这2个参数每次从和主可以同步的数据大小，如果进行同步的时候超过了这个限制，就到导致如上报错。

问题现象：

重启从服务器，主报错Client id=1317049445 addr=10.10.3.112:7412 fd=39 name= age=394 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=4360 omem=76118609 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.

注意到这么一句话：psync scheduled to be closed ASAP for overcoming of output buffer limits。看起来是psync因为超过output buffer limits将被close。

于是查看了一下client-output-buffer-limit。发现这是Redis的一个保护机制。配置格式是：

client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds>
具体参数含义如下：

class: 客户端种类，包括Normal，Slaves和Pub/Sub
Normal: 普通的客户端。默认limit 是0，也就是不限制。
Pub/Sub: 发布与订阅的客户端的。默认hard limit 32M，soft limit 8M/60s。
Slaves: 从库的复制客户端。默认hard limit 256M，soft limit 64M/60s。
hard limit: 缓冲区大小的硬性限制。
soft limit: 缓冲去大小的软性限制。
soft seconds: 缓冲区大小达到了（超过）soft limit值的持续时间。
client-output-buffer-limit参数限制分配的缓冲区的大小，防止内存无节制的分配，Redis将会做如下自我保护：

client buffer的大小达到了soft limit并持续了soft seconds时间，将立即断开和客户端的连接
client buffer的大小达到了hard limit，server也会立即断开和客户端的连接
再看看我们从库的这个配置，其实就是默认配置：

# 客户端的输出缓冲区的限制，因为某种原因客户端从服务器读取数据的速度不够快，

# 可用于强制断开连接（一个常见的原因是一个发布 / 订阅客户端消费消息的速度无法赶上生产它们的速度）。

# 可以三种不同客户端的方式进行设置：

# normal -> 正常客户端

# slave -> slave 和 MONITOR 客户端

# pubsub -> 至少订阅了一个 pubsub channel 或 pattern 的客户端

# 每个 client-output-buffer-limit 语法 :

# client-output-buffer-limit <class><hard limit> <soft limit> <soft seconds>

# 一旦达到硬限制客户端会立即断开，或者达到软限制并保持达成的指定秒数（连续）。

# 例如，如果硬限制为 32 兆字节和软限制为 16 兆字节 /10 秒，客户端将会立即断开

# 如果输出缓冲区的大小达到 32 兆字节，客户端达到 16 兆字节和连续超过了限制 10 秒，也将断开连接。

# 默认 normal 客户端不做限制，因为他们在一个请求后未要求时（以推的方式）不接收数据，

# 只有异步客户端可能会出现请求数据的速度比它可以读取的速度快的场景。

# 把硬限制和软限制都设置为 0 来禁用该特性

client-output-buffer-limit normal 0 0 0

client-output-buffer-limit slave 256mb 64mb 60

client-output-buffer-limit pubsub 32mb 8mb 60

redis的replication buffer其实就是client buffer的一种。里面存放的数据是下面三个时间内所有的master数据更新操作：

master执行rdb bgsave产生snapshot的时间
master发送rdb到slave网络传输时间
slave load rdb文件把数据恢复到内存的时间
可以看到跟replication backlog是一模一样的！

replication buffer由client-output-buffer-limit slave设置，当这个值太小会导致主从复制链接断开:

当master-slave复制连接断开，server端会释放连接相关的数据结构。replication buffer中的数据也就丢失了，此时主从之间重新开始复制过程。
还有个更严重的问题，主从复制连接断开，导致主从上出现rdb bgsave和rdb重传操作无限循环。
看起来确实server(这里就是master)会因为缓冲区的大小问题主动关闭客户端(slave)链接。因为我们的数据变更量太大，超过了client-output-buffer-limit。导致主从同步连接被断开，然后slave要求psync，但是由于repl-backlog-size太小，导致psync失败，需要full sync，而full sync需要Discarding previously cached master state，重新load RDB文件到内存，而这个加载数据过程是阻塞式的。所以导致slave出现间歇式的不可用。而切换到master之后，master的整个同步操作都是fork一个子进程进行的，所以不影响父进程继续服务。所有的现象都能清清楚楚的解释上。

更改配置 client-output-buffer-limit client-output-buffer-limit slave 0 0 0 " 重启slave问题解决

admin · 发表于 2024-5-29 06:00:02

集群状态 1主 2从，主没有开启bgsave，从开启bgsave。现象所有redis可以访问，进行操作。主不断开始bgsave 1从停止bgsave。

主日志报错# Connection with slave XXXX lost.

从日志报错# Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.

问题总结：

repl-backlog-size repl-timeout 这2个参数每次从和主可以同步的数据大小，如果进行同步的时候超过了这个限制，就到导致如上报错。

问题现象：

重启从服务器，主报错Client id=1317049445 addr=10.10.3.112:7412 fd=39 name= age=394 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=4360 omem=76118609 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.

注意到这么一句话：psync scheduled to be closed ASAP for overcoming of output buffer limits。看起来是psync因为超过output buffer limits将被close。

于是查看了一下client-output-buffer-limit。发现这是Redis的一个保护机制。配置格式是：

client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds>

admin · 发表于 2024-5-29 06:00:03

Answer
Redis replication timeout sets the maximum amount of time the master waits for a ping response from its slave before dropping the connection. This configuration is critical to prevent potential data loss or stale reads in case of network issues or unresponsive slaves.

The repl-timeout setting in the Redis configuration file controls this value, and it's specified in seconds. By default, it's set to 60 seconds. You can change it as follows:

# In your redis.conf file
repl-timeout 120
This sets the replication timeout to 120 seconds. Save and exit the file, then restart the Redis server for the changes to take effect.

You can also modify this setting at runtime using the CONFIG SET command:

# Via Redis CLI
redis-cli CONFIG SET repl-timeout 120
Be cautious when adjusting this value. Setting it too low may lead to unnecessary disconnections due to transient network issues, while setting it too high may delay the detection of genuine problems with slave nodes.

admin · 发表于 2024-5-29 06:00:04

Here's the output from the master's log:
>
> [9470] 13 Sep 22:24:04.789 * Slave ask for synchronization
> [9470] 13 Sep 22:24:04.789 * Starting BGSAVE for SYNC
> [9470] 13 Sep 22:24:09.454 * Background saving started by pid 18435
> [9470] 13 Sep 22:26:37.105 # Client addr=10.108.61.163:44422 fd=157 age=153
> idle=153 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=8361
> oll=3975 omem=100347928 events=r cmd=sync scheduled to be closed ASAP for
> overcoming of output buffer limits.
> [18435] 13 Sep 22:29:06.199 * DB saved on disk
> [9470] 13 Sep 22:29:07.138 * Background saving terminated with success
>
> The slave's view:
> [23037] 13 Sep 21:42:07.799 * The server is now ready to accept connections
> at /tmp/redis.sock
> [23037] 13 Sep 21:42:07.799 * Connecting to MASTER...
> [23037] 13 Sep 21:42:07.799 * MASTER <-> SLAVE sync started
> [23037] 13 Sep 21:42:07.800 * Non blocking connect for SYNC fired the event.
> [23037] 13 Sep 21:45:43.167 # Timeout receiving bulk data from MASTER...
> [23037] 13 Sep 21:45:43.167 * Connecting to MASTER...
> etc.
>
> And the configuration on master:
> redis-cli -p 6380 config get client-output-buffer-limit
> 1) "client-output-buffer-limit"
> 2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"
>
> So there was only 100mb in the output buffer, but it was shut down even
> though the limit was 268mb. Any ideas what might be going on? Apologies if
> this is a repost.

admin · 发表于 2024-5-29 06:00:05

I adopted pretty much all the recommendations, including the ssh
compression idea, from you two and so:

root@redis3:~# redis-cli config get * | grep -A2 repl
repl-ping-slave-period
10
repl-timeout
1800
repl-backlog-size
1048576000
repl-backlog-ttl
3600

I evetualy also changed this, from no to yes (although I know it may
increase the bandwidth requirements):
repl-disable-tcp-nodelay
yes

But the result is actually worse.... by that I mean the master kicks
off the slave much faster.

Master:
[3355] 30 May 02:56:28.875 * Slave asks for synchronization
[3355] 30 May 02:56:28.875 * Full resync requested by slave.
[3355] 30 May 02:56:28.875 * Starting BGSAVE for SYNC
[3355] 30 May 02:56:35.376 * Background saving started by pid 16330
[3355] 30 May 02:56:43.733 # Client addr=127.0.0.1:49630 fd=185 name=
age=15 idle=15 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0
obl=16372 oll=2896 omem=268753696 events=r cmd=psync scheduled to be
closed ASAP for overcoming of output buffer limits.
[16330] 30 May 02:59:44.675 * DB saved on disk
[16330] 30 May 02:59:46.216 * RDB: 2397 MB of memory used by copy-on-write
[3355] 30 May 02:59:48.806 * Background saving terminated with success
...

Slave:
[19460] 30 May 02:56:27.975 * DB loaded from disk: 210.107 seconds
[19460] 30 May 02:56:27.976 * The server is now ready to accept
connections on port 6379
[19460] 30 May 02:56:28.869 * Connecting to MASTER localhost:6280
[19460] 30 May 02:56:28.869 * MASTER <-> SLAVE sync started
[19460] 30 May 02:56:28.870 * Non blocking connect for SYNC fired the event.
[19460] 30 May 02:56:28.873 * Master replied to PING, replication can
continue...
[19460] 30 May 02:56:28.875 * Partial resynchronization not possible
(no cached master)
[19460] 30 May 02:56:28.877 * Full resync from master:
621480e9295872416266e563939b4fd6724eb5b7:68385253408
...

So, 3 questions please:

1. Do I need to make the same and all of the configurations on the slave too?

2. Can the "Partial resynchronization not possible (no cached master)"
be overcome somehow? I have full backups via BGSAVE of the master each
hour (yes, I stopped those while attempting the slaving stuff) so I
could use the latest one to load it masnually in the salve and then
attach the slave to the master - last time I tried that however, the
slave still requested a full resync....

		自动登录	找回密码
密码			注册