易陆发现/openstack区/docker区/bbs168x/openstack/云计算/ceph分布式存储/neturon

admin 发表于 2026-3-28 10:23:06

修复ceph分布式存储mon节点异常问题解决过程并重新加入到ceph平台

停止有问题ceph的mon服务：
# systemctl stop ceph-mon@host11

删除有问题的mon数据：
# rm -rf /var/lib/ceph/mon/ceph-host11/*

尝试使用rsync方式同步：
# rsync -avz root@host10:/var/lib/ceph/mon/ceph-host10/ /var/lib/ceph/mon/ceph-host11/
-bash: rsync: command not found
# dnf install -y rsync
-bash: dnf: command not found
报错：无法使用rsync方式同步：

使用scp方式复制：
# scp root@host10:/var/lib/ceph/mon/ceph-host10/ /var/lib/ceph/mon/ceph-host11/
scp: /var/lib/ceph/mon/ceph-host10: not a regular file

# scp -r root@host10:/var/lib/ceph/mon/ceph-host10/ /var/lib/ceph/mon/ceph-host11/
kv_backend                                                                                                                         100% 8 9.3KB/s 00:00
LOCK                                                                                                                               100% 0 0.0KB/s 00:00
CURRENT                                                                                                                            100% 17 24.8KB/s 00:00
IDENTITY                                                                                                                            100% 37 2.7KB/s 00:00
OPTIONS-9023316                                                                                                                      100% 4943 5.9MB/s 00:00
MANIFEST-9024281                                                                                                                   100% 4822KB49.2MB/s 00:00
OPTIONS-9024284                                                                                                                      100% 4943 6.8MB/s 00:00
9106767.log                                                                                                                         100% 14MB51.4MB/s 00:00
9106769.sst                                                                                                                         100% 57MB55.0MB/s 00:01
keyring                                                                                                                            100% 77 43.2KB/s 00:00
done                                                                                                                               100% 0 0.0KB/s 00:00
systemd                                                                                                                            100% 0 0.0KB/s 00:00
min_mon_release                                                                                                                      100% 3 0.2KB/s 00:00

# ls
donekeyringkv_backendmin_mon_releasestore.dbsystemd
# mv * ..
# ls
# cd ..
# ls
ceph-host10donekeyringkv_backendmin_mon_releasestore.dbsystemd

# ls
donekeyringkv_backendmin_mon_releasestore.dbsystemd
# ll
total 12
-rw-r--r-- 1 root root 0 Mar 28 07:20 done
-rw------- 1 root root77 Mar 28 07:20 keyring
-rw------- 1 root root 8 Mar 28 07:20 kv_backend
-rw------- 1 root root 3 Mar 28 07:20 min_mon_release
drwxr-xr-x 2 root root 157 Mar 28 07:20 store.db
-rw-r--r-- 1 root root 0 Mar 28 07:20 systemd
# cd ..
# ls
ceph-host11
# ll
total 0
drwxr-xr-x 3 ceph ceph 105 Mar 28 07:20 ceph-host11
# chown -R ceph:ceph ceph-host11/
# cd ceph-host11/
# ls
donekeyringkv_backendmin_mon_releasestore.dbsystemd
# ll
total 12
-rw-r--r-- 1 ceph ceph 0 Mar 28 07:20 done
-rw------- 1 ceph ceph77 Mar 28 07:20 keyring
-rw------- 1 ceph ceph 8 Mar 28 07:20 kv_backend
-rw------- 1 ceph ceph 3 Mar 28 07:20 min_mon_release
drwxr-xr-x 2 ceph ceph 157 Mar 28 07:20 store.db
-rw-r--r-- 1 ceph ceph 0 Mar 28 07:20 systemd

启动mon服务：
# systemctl start ceph-mon@host11.service
Job for ceph-mon@host11.service failed because start of the service was attempted too often. See "systemctl status ceph-mon@host11.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed ceph-mon@host11.service" followed by "systemctl start ceph-mon@host11.service" again.

根据提示修改：
# systemctl reset-failed ceph-mon@host11.service

再次启动：
# systemctl start ceph-mon@host11.service
检查状态：
# systemctl status ceph-mon@host11.service
● ceph-mon@host11.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2026-03-28 07:22:00 CST; 11s ago
Main PID: 68995 (ceph-mon)
CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@host11.service
      └─68995 /usr/bin/ceph-mon -f --cluster ceph --id host11 --setuser ceph --setgroup ceph

Mar 28 07:22:00 host11 systemd: Started Ceph cluster monitor daemon.
Mar 28 07:22:06 host11 ceph-mon: 2026-03-28 07:22:06.757 7fc8eef35700 -1 mon.host11@2(electing) e3 failed to get devid for : udev_device_new_from_sub...iled on ''
Mar 28 07:22:06 host11 ceph-mon: 2026-03-28 07:22:06.793 7fc8eef35700 -1 mon.host11@2(electing) e3 failed to get devid for : udev_device_new_from_sub...iled on ''
Hint: Some lines were ellipsized, use -l to show in full.
#
#
# ceph -s
cluster:
id: 9d22e36a-2bdd-4d2d-8394-ead777
health: HEALTH_WARN
         3 nearfull osd(s)
         5 pool(s) nearfull
         5 daemons have recently crashed

services:
mon: 3 daemons, quorum host09,host10,host11 (age 22s)
mgr: host09(active, since 6w), standbys: host11, host10
osd: 40 osds: 40 up (since 6w), 40 in (since 6w)

data:
pools: 16 pools, 3072 pgs
objects: 7.15M objects, 27 TiB
usage: 65 TiB used, 83 TiB / 147 TiB avail
pgs: 3069 active+clean
         3 active+clean+scrubbing+deep

io:
client: 16 MiB/s rd, 27 MiB/s wr, 339 op/s rd, 380 op/s wr

mon问题解决。

页: [1]

易陆发现互联网技术论坛's Archiver

修复ceph分布式存储mon节点异常问题解决过程并重新加入到ceph平台