admin 发表于 2021-7-25 20:37:36

HEALTH_WARN 1 failed cephadm daemon(s)

HEALTH_WARN 1 failed cephadm daemon(s)
ceph health detail
HEALTH_WARN 2 failed cephadm daemon(s)
CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon alertmanager.controller on controller is in error state
    daemon grafana.controller on controller is in error state


admin 发表于 2021-7-25 20:52:05

经过排查,应该是系统层间安装过ceph集群,没有清理干净。新版本还不知道怎么全部清除,还在测试中。

admin 发表于 2021-7-25 20:59:12

# ceph status
cluster:
    id:   4c1f752a-ed1a-11eb-8ce5-0025908471d6
    health: HEALTH_WARN
            2 failed cephadm daemon(s)
            clock skew detected on mon.compute01

services:
    mon: 2 daemons, quorum controller,compute01 (age 3h)
    mgr: compute01.getqhn(active, since 3h), standbys: controller.kxfttd
    osd: 3 osds: 3 up (since 3h), 3 in (since 3h)

data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   3.0 GiB used, 1.2 TiB / 1.2 TiB avail
    pgs:   1 active+clean


# systemctl status ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6
ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@crash.compute01.service          ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@osd.0.service
ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@mgr.compute01.bunbzp.service   ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6.target
ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@node-exporter.compute01.service
# systemctl status ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6
ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@crash.compute01.service          ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@osd.0.service
ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@mgr.compute01.bunbzp.service   ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6.target
ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@node-exporter.compute01.service




# systemctl disable ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@crash.compute01.service          ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@osd.0.service ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@mgr.compute01.bunbzp.service   ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6.target ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@node-exporter.compute01.service
Removed /etc/systemd/system/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6.target.wants/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@crash.compute01.service.
Removed /etc/systemd/system/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6.target.wants/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@mgr.compute01.bunbzp.service.
Removed /etc/systemd/system/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6.target.wants/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@node-exporter.compute01.service.
Removed /etc/systemd/system/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6.target.wants/ceph-1e87bca4-e7ce-11eb-aa90-0025908471d6@osd.0.service.


# cd /var/lib/ceph

# cd /var/lib/ceph
# rm -rf 1e87bca4-e7ce-11eb-aa90-0025908471d6/


# ceph status
cluster:
    id:   4c1f752a-ed1a-11eb-8ce5-0025908471d6
    health: HEALTH_OK

services:
    mon: 2 daemons, quorum controller,compute01 (age 84s)
    mgr: compute01.getqhn(active, since 30s)
    osd: 3 osds: 3 up (since 14s), 3 in (since 4h)

data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   3.0 GiB used, 1.2 TiB / 1.2 TiB avail
    pgs:   1 active+clean

问题竟然解决。

admin 发表于 2023-8-19 09:50:20

admin 发表于 2021-7-25 20:59
# ceph status
cluster:
    id:   4c1f752a-ed1a-11eb-8ce5-0025908471d6

可能原因是因为生产了一个不一样的cluster_id导致信息不一致不对称引起。如果一个纯净的系统也出现这个问题,需要排除原因。
页: [1]
查看完整版本: HEALTH_WARN 1 failed cephadm daemon(s)