|
|
本文介绍ceph集群中所有mon服务均无法启动或者说mon节点所在服务器os全部无法启动情况下的恢复方法,当然,这种极端情况出现的概率非常低,这里前提是要做好mon节点的配置文件和元数据备份,不然就没办法恢复了。我的环境是使用kolla部署的,恢复方法也都是基于kolla工具下才有效,如果采用物理机部署,具体操作上会有所区别,但大致的思路和原理是一样的。9 Z4 j6 L. X1 e
5 Q% Q: f3 {& |备份配置文件及元数据
- }& z# y3 h0 [9 q8 ] ^: x; A5 B% P0 z2 ?
采用kolla部署,默认的元数据存储路径如下
+ T+ t9 \! M- ?5 t" }9 v) m) J0 O- h1 M. W! k+ ~
[root@node01 mon]# cd /var/lib/docker/volumes/ceph_mon/_data/mon/ceph-172.21.196.11/- ?2 |" K6 I2 I2 }9 C) H0 U i6 X
[root@node01 ceph-172.21.196.11]# ls+ K }$ U; Y/ o" z- |
keyring store.db
8 \' g p: F, B. P8 g+ J5 K1
9 }+ @$ g7 h$ z4 A- q: @2
2 B z; u8 y3 C2 \39 ^; Q' c" j L9 f% z- D0 I
4( W0 |- x7 S" I3 E+ Z
将该目录完整备份) d$ B$ Y1 ^# |0 |( \$ [0 ]
/ H% m. g! n$ W8 A# `[root@node01 ~]# cp -r /var/lib/docker/volumes/ceph_mon/_data/ /root/ceph-mon-bak/: k, L O) w/ b6 E0 M( m
1) N! @! F$ I) b! Z z' A
2* o1 Q$ L- V( r1 @7 [- }0 M* b5 l
另外还有 配置文件,里面保存了key,默认路径如下:# |) S- {& Q; R2 c Y$ h
5 {% {; o: o8 K
[root@node01 ~]# cd /etc/kolla/ceph-mon/7 D/ k3 R' o. N) t6 I% G/ U
[root@node01 ceph-mon]# ls1 T' D) \- Q5 r* K
ceph.client.admin.keyring ceph.client.mon.keyring ceph.client.radosgw.keyring ceph.conf ceph.monmap config.json* M0 P$ ~3 A x1 n% H5 |
1
; P2 V* ~' U/ R$ e) e- e20 Q. `) _5 w2 u6 l3 g1 g: `
34 Q" V* I8 |) j# p" }1 [
4. [- J9 y' H$ ^2 |8 p
同样,将该目录完整备份$ l1 _ ~# N2 K3 d
) W8 E2 Z) l$ a. [' R0 P[root@node01 ceph-mon]# cp -r /etc/kolla/ceph-mon /root/ceph-mon-bak/7 X0 t4 Z+ M3 U) \. M- k8 r
1, q& P/ p9 l" m* r5 Y2 R7 a9 M& H0 w
2
" D M! h! q3 C* G我这里所有的相关进程都是跑在docker里,删除mon服务和数据之前先看下集群的整体状态# Z% L% Q3 K* t% {
/ t; b; v" [! U: ^
[root@node01 kolla]# docker exec -ti ceph_mon /bin/bash
4 X. I3 z( x" X( Y(ceph-mon)[root@node01 /]# ceph -s8 @+ B. N5 `3 i) P3 a0 ~$ j1 z
cluster 84ff3941-2337-40ca-bd76-fb3be71b0fdd4 }- z% O0 \% Q/ K
health HEALTH_OK
$ l- E9 R4 `. w: u0 W+ L monmap e4: 2 mons at {172.21.196.11=172.21.196.11:6789/0,172.21.196.12=172.21.196.12:6789/0}
+ Q" D3 v8 I' _3 K* J. w( U election epoch 44, quorum 0,1 172.21.196.11,172.21.196.12( L; w% G; m7 \5 d. z/ V
osdmap e491: 15 osds: 15 up, 15 in; L( i, W: B! n+ N! ~' _9 U
flags sortbitwise,require_jewel_osds
% r( Q: E; \. ]! E/ Z pgmap v679746: 7096 pgs, 25 pools, 184 GB data, 50069 objects1 H9 }5 V# s/ g T
555 GB used, 39643 GB / 40199 GB avail
8 S& ?" q$ o# {) V! n 7096 active+clean
1 d8 \2 Y* V t6 h3 O18 c2 u! _: U: n9 M3 e
28 j3 M$ \8 F0 ~! q: N) s4 N
3
; L9 Z1 Y1 C0 |5 I# V4
# _7 }0 G5 p! @- L& g' S5! y) W/ N6 ]$ w l7 ^# _( G" p
60 @' d# O9 |1 s$ W8 N! J
7
* ]+ d# w4 [! F* I B8
9 t- T+ M" M1 g6 n7 t2 t" G, F- H9# k! n, N/ Z( I4 y& |9 F( M8 }0 d
10( c3 W# w' h2 ~' f$ q& @ T/ ]1 o! w
119 |' B7 J3 M7 g; D# `9 }, i
12
: m' w3 Z! d7 g9 N( B6 K" J0 i有2个mon节点,15个osd。现在我们将mon数据和容器都删除掉
& g9 ^0 N; V' A j/ `; G! e1 h) Z+ q& F) U9 B
[root@node01 kolla]# docker rm ceph-mon -f0 J8 K* q0 ?5 ~
[root@node01 kolla]# rm -rf /var/lib/docker/volumes/ceph_mon/7 ~+ s6 v, d0 l/ }, S4 K* S& I
[root@node01 kolla]# rm -rf /etc/kolla/ceph-mon/
9 J& w/ z4 e0 @* Q1- h# |$ B8 H. h8 o! b6 _
2
5 l; O% o1 p! w) H# a36 g- K5 W; W" w. [
另外一个节点执行同样的操作。删除完之后,修改kolla的配置文件,执行部署命令,会发现有错误,在ceph创建keyring时中断了,报错信息如下:
4 X9 A5 u! _7 Z9 k* h& b) f7 T
! C5 S, c! u, T- ]: s: kTASK [ceph : Fetching Ceph keyrings] *******************************************
s5 ?# ?# P( [! t# ?$ K( {
7 W- K; Q4 s( e. x; Bfatal: [controller01]: FAILED! => {“failed”: true, “msg”: “The conditional check ‘{{ (ceph_files_json.stdout | from_json).changed }}’ failed. The error was: No JSON object could be decoded”! T5 F7 I# q6 F+ P F) j2 S
1
* }. ^9 V9 H7 L% l3 f2' ^7 y8 P1 c. _% B( Z. p
3
5 e" ^0 y4 s5 B! z9 U+ m/ y原因是在我删除容器和配置文件后,kolla生成的相关volume是没有删除的。其还存在于/var/lib/docker/volume下。因此当再次构建kolla时,这些已经存在的volume会阻止ceph_mon的启动,会导致上述错误Ceph keyring无法获取而产生的一些错误。因此 删除掉docker volume ls下的卷。再次部署就能够成功的解决问题。2 k6 b* j- I; Q$ o7 m3 F; K9 _: E- m
- M) W$ k$ C' A' \# ?: g[root@node02 kolla]# docker volume ls
( m% t5 L2 P, }DRIVER VOLUME NAME, {2 d F4 X9 Q, y
local ceph_mon0 K' l. q/ F# ?8 O) N- v
local ceph_mon_config4 J" T: B) v9 A; s
local kolla_logs
( ^6 B$ {$ J" k$ X* B% {* P5 S; a6 ?local libvirtd* Z# s# l4 p' h% ^5 U9 ?9 l
local nova_compute
/ m, h7 d7 t$ f: q$ Xlocal nova_libvirt_qemu6 f8 f( ^4 D5 y$ x' g0 i; h$ c' [
local openvswitch_db' D- `. D& j2 c0 R9 ^7 E
[root@node02 kolla]# docker volume rm ceph_mon ceph_mon_config
% O: d7 L* X3 P7 s& O. }8 t1: u) g M! `$ ~9 i# B$ c
2
5 b1 U2 p3 o. ^. G( s: f+ ^1 A; e6 B3
D% o$ o* c {4
2 d8 R1 ], U7 @- w5 d5
3 }# x% j. r% m; L% {' R( m9 Q6 y6 x9 f" v$ o
72 u G; a! x8 M! g, C' s
8
Y% q+ v+ D1 x5 S( Y; U9. l8 }1 P8 U( d
10
- C. _* T9 Y% w2 s这时再部署即可成功。这里有2个mon节点,我是一个一个的加入的,添加完一个mon节点后,查看集群状态
( K5 l: a/ v$ H# z) j/ m, N5 D1 T9 S2 `; U
(ceph-mon)[root@node02 /]# ceph -s
# U% P- _1 s* O cluster 84ff3941-2337-40ca-bd76-fb3be71b0fdd
! q9 G8 x! M+ F3 g health HEALTH_ERR( m4 m% X8 q1 B8 C6 h" u7 m
no osds
# n# [+ U7 K8 m" R; S monmap e1: 1 mons at {172.21.196.12=172.21.196.12:6789/0}0 |( M2 L# ?$ y! F) W/ m( s
election epoch 3, quorum 0 172.21.196.121 k6 S# o0 `& W8 W( X+ k' R9 U
osdmap e2: 0 osds: 0 up, 0 in
; L9 n$ r6 D$ q n: u flags sortbitwise,require_jewel_osds4 I$ e- @9 s" y$ \4 \
pgmap v3: 320 pgs, 2 pools, 0 bytes data, 0 objects
9 U# z6 x1 u* Y 0 kB used, 0 kB / 0 kB avail* f. x( q) |7 ~- s' @2 ]
320 creating% s7 G. W% k% T Z6 G
13 T! p' X( F# T9 n
2
1 x$ G9 p- V$ _7 ]! t3
1 b5 r8 n) q8 Z, D4( o) ?$ W, t& f2 ?: O7 V* _9 P
5
9 q9 X% N, ~0 V6
; F; F/ m" C, }' @- C7
: \0 V& R9 W1 d r; @: n8
8 ?% p5 I9 e0 {9
" E- m2 [1 U- K! C. K P N10
8 z5 l+ X+ b" r118 p) Y% R, b$ e ~8 ]
这其实 是一个全新的ceph集群,现在把之前备份的数据拷贝过来,覆盖新的数据
9 P5 N, b! _+ q$ O
# R9 F& l1 w' s% i$ a[root@node02 ~]# cp -r ceph-mon-bak/_data/* /var/lib/docker/volumes/ceph_mon/_data/
# ^9 P: g' _$ Fcp: overwrite ‘/var/lib/docker/volumes/ceph_mon/_data/mon/ceph-172.21.196.12/store.db/LOCK’? y
0 w+ Q9 l5 a1 T+ Jcp: overwrite ‘/var/lib/docker/volumes/ceph_mon/_data/mon/ceph-172.21.196.12/store.db/CURRENT’? y
: d$ X L; _# H, r4 W0 Pcp: overwrite ‘/var/lib/docker/volumes/ceph_mon/_data/mon/ceph-172.21.196.12/keyring’? y
5 {+ S( p8 m* Y[root@node02 ~]# cp -r ceph-mon-bak/ceph-mon/* /etc/kolla/ceph-mon/( O% ]% H5 V# | ?: u* o" l9 W
cp: overwrite ‘/etc/kolla/ceph-mon/ceph.client.admin.keyring’? y3 |# T: J i" }
cp: overwrite ‘/etc/kolla/ceph-mon/ceph.client.mon.keyring’? y5 E% X* X% F7 [9 @. X
cp: overwrite ‘/etc/kolla/ceph-mon/ceph.client.radosgw.keyring’? y
3 k" p: ?2 g. K8 B6 }cp: overwrite ‘/etc/kolla/ceph-mon/ceph.conf’? y5 `9 q, X: p/ O6 }: Y. N& _
cp: overwrite ‘/etc/kolla/ceph-mon/ceph.monmap’? y0 G5 _3 {4 b, {% a4 c4 b; K
cp: overwrite ‘/etc/kolla/ceph-mon/config.json’? y
9 C! C" D3 m* I y1
" e, X* F! z( {4 z9 T2* z$ C5 p" E6 D, Y) \9 c4 N
3
- ]: J, S9 q, a9 q8 b+ C4
& j! N) |' }3 U8 p& i51 U! `: H- O0 T# t0 p- R
6
) B3 n+ W1 ]6 s7
. [) L$ d2 }% J9 `' `2 F8
& R6 Z, e- W* B- W y% k91 S B" ?9 N9 q9 W& f
10; N! |: G: E- [2 Y
112 O' q- y/ O5 F F+ y, B
重启mon服务后,查看集群状态6 j' W# Y8 x- }* W7 o
' A" R* m7 r5 K( @(ceph-mon)[root@node02 /]# ceph -s0 K& G2 A4 n: u: w, f3 }# |
Error connecting to cluster: TimedOut
/ u9 x7 _7 b$ u8 F1
6 g( o4 n7 Y' k% l2
$ F0 A$ J* C/ F: O7 h3 e6 w这里只启动了一个mon节点,整个集群状态还不正常,需要把另外一个mon节点再恢复,整个集群状态才能恢复
2 S4 q2 v; c& }$ H1 e
! R! _3 X o+ V5 T! D恢复第二个节点的配置文件. q; P" y3 V/ x2 R, `
$ j' a7 V% M0 H( V n, W& @
[root@node01 ceph-mon]# cp -r /root/ceph-mon-bak/ceph-mon/* .* r, u! w5 w' ?% e8 b, m
cp: overwrite ‘./ceph.client.admin.keyring’? y
+ B- b" G1 y4 vcp: overwrite ‘./ceph.client.mon.keyring’? y9 r7 D8 j( e: L1 p4 }
cp: overwrite ‘./ceph.client.radosgw.keyring’? y
+ r" r4 s! u$ w# j! A: M; Z }cp: overwrite ‘./ceph.conf’? y
1 D1 k# I ~8 Mcp: overwrite ‘./ceph.monmap’? y! a' E) l9 b* u' q4 s
cp: overwrite ‘./config.json’? y" b, D" r2 T4 t
[root@node01 ceph-mon]# 3 Q+ n$ _6 B8 r9 r5 p E
[root@node01 ceph-mon]# cp -r /root/ceph-mon-bak/_data/
1 n- D( C* a) pbootstrap-mds/ bootstrap-osd/ bootstrap-rgw/ mds/ mon/ osd/ radosgw/ tmp/
n' F8 v; @$ B* i[root@node01 ceph-mon]# cp -r /root/ceph-mon-bak/_data/* /var/lib/docker/volumes/ceph_mon/_data/: b. R& |5 ?. k- N3 M
cp: overwrite ‘/var/lib/docker/volumes/ceph_mon/_data/mon/ceph-172.21.196.11/store.db/LOCK’? y
! @/ E6 p1 _1 A' x! k: k0 H8 ncp: overwrite ‘/var/lib/docker/volumes/ceph_mon/_data/mon/ceph-172.21.196.11/store.db/CURRENT’? y
( \- I, [) J" ~! Z3 i" ncp: overwrite ‘/var/lib/docker/volumes/ceph_mon/_data/mon/ceph-172.21.196.11/keyring’? y
( K6 ?6 a4 Z: b1 @1 h12 v* o w/ l1 [ g
2* D$ R. Z! P/ n/ }3 j/ y
3/ M% e" k2 B; Y$ [6 L# w
4 H, W# C6 s! N0 N1 `
5
. ^0 P' k" f8 G* ^5 K6
7 _9 O- l+ N5 V1 q7
. }) Q) U. f1 Q% j/ a8
; T+ r3 E8 A. L# I- j91 K3 X3 U8 {( Y* U4 j
10
5 x5 I' W% A: o( G$ m3 x5 d; ]11$ K0 a; M" H/ M, z" t0 d: r4 v/ v
12# V. {: T% b( W
13! N+ Z6 x5 W- t; t# R* a
144 b1 ~; o0 k. b- v
15% E! S! o. V7 j6 l( H
再次查看集群状态$ d, D( x% b& c) X
: c' \8 t) K+ ^$ ~$ h[root@node01 ceph-mon]# docker exec -ti ceph_mon bash O o. f* \/ `1 i F @5 w- W
(ceph-mon)[root@node01 /]# ceph -s5 D# o" I* a0 o5 V2 ~- w- K
cluster 84ff3941-2337-40ca-bd76-fb3be71b0fdd
% m5 s2 C; _& A$ ]6 P" X1 C+ D health HEALTH_OK
9 ?; ^8 k/ x% [7 @3 ~ monmap e4: 2 mons at {172.21.196.11=172.21.196.11:6789/0,172.21.196.12=172.21.196.12:6789/0}3 s/ T6 m" u8 J
election epoch 44, quorum 0,1 172.21.196.11,172.21.196.12
0 q- z. r3 j; x9 | osdmap e491: 15 osds: 15 up, 15 in6 f% u9 f2 e* F/ w
flags sortbitwise,require_jewel_osds! g$ C6 ~6 j3 n0 P5 h0 x5 I
pgmap v679711: 7096 pgs, 25 pools, 184 GB data, 50069 objects" ^" N5 a3 Z2 | s- c7 h
555 GB used, 39643 GB / 40199 GB avail
0 Y4 U7 H$ @+ | 7096 active+clean3 F- \' ?) }- W
|
|