|
|
0 当前Ceph版本和CentOS版本:
9 g5 s$ n5 ^* m' o) Y3 s5 |0 H[root@ceph1 ceph]# ceph -v9 x& S+ l" o0 a0 z o( l/ E
ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)' Y; f6 n. @& [5 E" S5 N7 Q
[root@ceph1 ceph]# cat /etc/redhat-release + f E: e/ t' _- G" \1 L
CentOS Linux release 7.5.1804 (Core)
# R" Z$ t% [# `. D, q0 p6 o: m: M( I. {5 ?2 m, g3 K+ N
$ T: L2 p' K& i1 Q" ]; b: g1 K$ k1.节点间配置文件内容不一致错误
) i' [2 B$ \4 ?% r6 `" U7 v输入ceph-deploy mon create-initial命令获取密钥key,会在当前目录(如我的是~/etc/ceph/)下生成几个key,但报错如下。意思是:就是配置失败的两个结点的配置文件的内容于当前节点不一致,提示使用--overwrite-conf参数去覆盖不一致的配置文件。! l: z* N# Q" T* {: G# v1 ]& V% ^
[root@ceph1 ceph]# ceph-deploy mon create-initial( J% G& ]. f. u1 m& d+ ]
...
, ~! ]$ o8 @, ^" T1 F7 q[ceph2][DEBUG ] remote hostname: ceph2
8 V0 G0 b1 j0 Z# W8 e6 g[ceph2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf' \9 X* F& H1 L0 R; F [* c5 E
[ceph_deploy.mon][ERROR ] RuntimeError: config file /etc/ceph/ceph.conf exists with different content; use --overwrite-conf to overwrite. ~8 s5 W( B; g+ b, v
[ceph_deploy][ERROR ] GenericError: Failed to create 2 monitors1 C7 f4 r2 F1 v% Y
...
6 g$ J- d2 k% D) w ?& G" l7 o+ t, n! \9 R/ ~8 A7 S
输入命令如下(此处我共配置了三个结点ceph1~3):
( |5 V$ R/ [5 p. I5 w+ x8 F[root@ceph1 ceph]# ceph-deploy --overwrite-conf mon create ceph{3,1,2}
2 @; I m. b! \1 u$ X) `9 C...
: K' f+ m* N! g% D4 q- \8 {9 T[ceph2][DEBUG ] remote hostname: ceph2# C3 L, e) x8 |5 k1 L, i
[ceph2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf7 x/ W1 V0 k# Q
[ceph2][DEBUG ] create the mon path if it does not exist
9 a* R0 p) I# m[ceph2][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-ceph2/done
4 r0 m$ H& l h, l% p. N... N' M0 _) a: a! {. y* z
0 @! ~9 W% I) s. W, l之后配置成功,可继续进行初始化磁盘操作。9 p$ |: L% e- `. w; `4 g
2.too few PGs per OSD (21 < min 30)警告# c9 X2 ?( _- j& j8 a- |. p4 x
[root@ceph1 ceph]# ceph -s
* _! K0 x7 M# a/ T6 z) k cluster:* R& z3 a* B9 d8 l
id: 8e2248e4-3bb0-4b62-ba93-f597b1a3bd40
9 X6 O. K! k- e8 k3 [ health: HEALTH_WARN0 C. b% x9 w `2 F# v, |4 f
too few PGs per OSD (21 < min 30)8 y( c$ K0 k7 z3 F1 l
( r% y, P* m, ~; G, Q1 N! v
services:8 e+ f! ~1 h* j. X4 m0 h7 E
mon: 3 daemons, quorum ceph2,ceph1,ceph3% e% i$ J7 K1 ]) c: U
mgr: ceph2(active), standbys: ceph1, ceph3$ L! ~$ b, ]+ @' w' U
osd: 3 osds: 3 up, 3 in6 h8 Q1 P4 K7 d/ `# Z4 K8 N
rgw: 1 daemon active
3 f8 M# I5 C! d0 k: C$ E/ @ 0 j2 c2 s3 w$ W1 _" E/ J$ Q9 q
data:( q5 ]8 U( p) x
pools: 4 pools, 32 pgs/ L/ ~6 r" c& M8 I. [
objects: 219 objects, 1.1 KiB- T# s1 d; P( `
usage: 3.0 GiB used, 245 GiB / 248 GiB avail
' a8 A b& q7 _! C$ c pgs: 32 active+clean
# b9 r/ c B8 b" d/ k: S
- Z6 q% t8 {' m* a& q: S5 n. y
9 q0 \- R& J' J! w2 }4 @, a; f从上面集群状态信息可查,每个osd上的pg数量=21<最小的数目30个。pgs为32,因为我之前设置的是2副本的配置,所以当有3个osd的时候,每个osd上均分了32÷3*2=21个pgs,也就是出现了如上的错误 小于最小配置30个。
# i$ i& K0 @* J$ j4 ]集群这种状态如果进行数据的存储和操作,会发现集群卡死,无法响应io,同时会导致大面积的osd down。' t( t( J* b6 }& h8 w t7 _
解决办法:增加pg数$ M! J$ U" I5 A3 E( M5 d- w
因为我的一个pool有8个pgs,所以我需要增加两个pool才能满足osd上的pg数量=48÷3*2=32>最小的数目30。
5 F+ I' `, v: K; y0 R5 a& D[root@ceph1 ceph]# ceph osd pool create mytest 8+ `$ E$ [ I' h; p; p
pool 'mytest' created
$ R3 i* w) A- G0 S) p) f, {1 o[root@ceph1 ceph]# ceph osd pool create mytest1 8' h# F& J' X8 ^
pool 'mytest1' created
7 k& }: V8 y4 ]! \+ V' b[root@ceph1 ceph]# ceph -s! \- T3 {) C, }3 K5 D" @* r
cluster:" T$ [7 t" F! F0 G$ |( k
id: 8e2248e4-3bb0-4b62-ba93-f597b1a3bd408 R( j% u0 J7 ^8 O
health: HEALTH_OK
7 r* ^- p: N( {2 T4 ~2 n 9 J. Y& O, |6 E+ g
services:
( L- }0 M0 Y3 K- h mon: 3 daemons, quorum ceph2,ceph1,ceph30 g, b& |5 q6 f# ^) Y$ E
mgr: ceph2(active), standbys: ceph1, ceph3: S3 g/ R A7 U: B0 L
osd: 3 osds: 3 up, 3 in
" p/ F$ c n( A i rgw: 1 daemon active, I' P% i z$ c7 `6 R" ^. U! i
4 {$ f1 K) B- H# L) E' |6 | data:
7 X- i( Z9 O: F" v: I; X pools: 6 pools, 48 pgs: u; `7 m0 \& ^3 [3 [# Q
objects: 219 objects, 1.1 KiB
0 U% e x- k' V usage: 3.0 GiB used, 245 GiB / 248 GiB avail
1 o4 O2 f9 ]. Q5 s pgs: 48 active+clean
3 K! p' [$ x$ L; _+ E: n) Z9 ~2 h) L. q6 V' S- X4 Q+ @
集群健康状态显示正常。' }. Y9 D: l9 d$ j
3.集群状态是HEALTH_WARN application not enabled on 1 pool(s)! n* l3 D. ^; ^3 }: E; o
如果此时,查看集群状态是HEALTH_WARN application not enabled on 1 pool(s):: M! h$ m5 W2 i& N, m* v W
[root@ceph1 ceph]# ceph -s
6 h) X* |- ]7 S( U t9 e$ f+ f cluster:
5 G4 J& K! p& M, ]: {0 ?9 U# | id: 13430f9a-ce0d-4d17-a215-272890f47f28
3 V1 D/ e* }3 b0 v health: HEALTH_WARN
& X7 r) I0 g( e6 ]6 J' U application not enabled on 1 pool(s)
# e0 A1 n- n- |) P
- E* E9 i0 c4 q5 T[root@ceph1 ceph]# ceph health detail
1 q6 }; C% b$ C$ E+ cHEALTH_WARN application not enabled on 1 pool(s)# w6 m' d0 }3 B0 Y& M2 S) l% Y4 o
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
; @/ `7 d2 Z$ T5 t) M2 z6 S2 ` application not enabled on pool 'mytest'/ z, q0 a# x) N7 s% |/ x$ k# g; d2 {% q
use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.' B2 l$ \$ a5 S! I. r$ X
j9 X- z! R- x, U/ h m
运行ceph health detail命令发现是新加入的存储池mytest没有被应用程序标记,因为之前添加的是RGW实例,所以此处依提示将mytest被rgw标记即可:, r7 n9 F( L" k# U j% d' O& l
[root@ceph1 ceph]# ceph osd pool application enable mytest rgw
% e l$ i! q% c" G& D: _2 j5 @enabled application 'rgw' on pool 'mytest'' a9 t' Z# J& m) ~
% b/ l: R1 ?* W+ q再次查看集群状态发现恢复正常: o" i. q# J+ H" @& b9 j
[root@ceph1 ceph]# ceph health6 I: Z! E$ d3 _4 l' w5 E
HEALTH_OK$ | l& {. _7 M( Q
+ Y- [7 D& t6 r- h, f r# J4.删除存储池报错" w9 l: ?7 d) B" K g
以下以删除mytest存储池为例,运行ceph osd pool rm mytest命令报错,显示需要在原命令的pool名字后再写一遍该pool名字并最后加上--yes-i-really-really-mean-it参数1 h0 g0 t8 g1 r2 n
[root@ceph1 ceph]# ceph osd pool rm mytest7 r R' i% O& \; m
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool mytest. If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.
. ]( E. o$ d \
; u- S4 @7 x9 o按照提示要求复写pool名字后加上提示参数如下,继续报错:
/ S; Y# m9 d6 O% x; [+ Z& ~[root@ceph1 ceph]# ceph osd pool rm mytest mytest --yes-i-really-really-mean-it
- y* \+ z% x/ K3 m( R1 P' K* P4 N5 {Error EPERM: pool deletion is disabled; you must first set the
' [- l7 A# ^0 }" `* {mon_allow_pool_delete config option to true before you can destroy a pool
: m: A: z% y0 {9 q! x. J
) P6 @4 p/ z7 m J* }! c6 c5 |错误信息显示,删除存储池操作被禁止,应该在删除前现在ceph.conf配置文件中增加mon_allow_pool_delete选项并设置为true。所以分别登录到每一个节点并修改每一个节点的配置文件。操作如下:
5 r* }+ O6 f- v' j$ G[root@ceph1 ceph]# vi ceph.conf % C' R1 N" y- y6 Y* C5 L+ ]
[root@ceph1 ceph]# systemctl restart ceph-mon.target
% D6 z$ k d" Z
7 a: h8 Z7 X+ N$ `8 j! j& _8 u9 l在ceph.conf配置文件底部加入如下参数并设置为true,保存退出后使用systemctl restart ceph-mon.target命令重启服务。* p5 T, Y. A5 J( P6 |( w+ `3 D6 O5 ^
[mon]
$ F B: ?9 ] W/ A+ Y; Z; Ymon allow pool delete = true
% i9 `' f2 y/ E. J% `" |3 }$ ], \/ U; F( m, g! b- K: P
其余节点操作同理。
9 k! b+ b4 w3 z. L4 v, ?[root@ceph2 ceph]# vi ceph.conf 6 W" P" S9 _( {) S
[root@ceph2 ceph]# systemctl restart ceph-mon.target% O! ?6 Y+ e( V1 Z9 K
[root@ceph3 ceph]# vi ceph.conf
7 q9 y: ~$ V( }" |! u[root@ceph3 ceph]# systemctl restart ceph-mon.target/ k( z9 `( j) r7 w
. T" _9 ]. T# a7 A8 L0 N) F4 @再次删除,即成功删除mytest存储池。
) Z |! v, u: t( M( D5 z# s' g% b P[root@ceph1 ceph]# ceph osd pool rm mytest mytest --yes-i-really-really-mean-it
- y, f! ]2 Y9 h" j- Wpool 'mytest' removed( [( l% [. H* W% F+ ^/ M
: c( j; X |9 w5.集群节点宕机后恢复节点排错 N5 k$ W% n+ X- |$ @. l* ^
笔者将ceph集群中的三个节点分别关机并重启后,查看ceph集群状态如下:5 {$ G6 B4 W5 k. R+ v
[root@ceph1 ~]# ceph -s
5 c- z& P& U2 G3 C8 |: x0 Y. i cluster:
7 L! H5 s, c1 D" |, m2 N/ ~; W id: 13430f9a-ce0d-4d17-a215-272890f47f28$ q6 P' Z6 c7 L; e
health: HEALTH_WARN% M8 o7 G' g8 u7 ]9 M$ e& O4 x4 Z
1 MDSs report slow metadata IOs. p- f' I! }/ f, e5 O+ ^
324/702 objects misplaced (46.154%)
, T0 l. J* e6 g9 B% d Reduced data availability: 126 pgs inactive
* c$ r _$ a# i t' }7 o0 I& Q Degraded data redundancy: 144/702 objects degraded (20.513%), 3 pgs degraded, 126 pgs undersized2 F Z" i) J& x) @5 k: ^
" @/ s/ h$ R3 ?$ n
services:
; ]* S7 ~' I0 W1 g! Z mon: 3 daemons, quorum ceph2,ceph1,ceph3
! z5 A0 m% q$ ]: }( a mgr: ceph1(active), standbys: ceph2, ceph3
, ]% g+ m) a) p7 O) M" B3 D s( [$ { mds: cephfs-1/1/1 up {0=ceph1=up:creating}
* C. P, C) K7 E H) D% K+ J3 x% ] osd: 3 osds: 3 up, 3 in; 162 remapped pgs, u0 y/ h- W1 A) m. d/ S
9 l5 F8 Z" c# K7 H- W
data:
3 s4 H; z- m( ], ?' @0 r pools: 8 pools, 288 pgs% L+ f1 ^% s0 n% H1 k; Z; {
objects: 234 objects, 2.8 KiB
; ?& ~5 o- A. {' a0 D3 I3 I& A; U/ B usage: 3.0 GiB used, 245 GiB / 248 GiB avail
0 Y8 N" v2 V$ z C2 g. c" S0 u pgs: 43.750% pgs not active
3 C' L6 K2 }. Y2 Z" m# o$ U0 _ 144/702 objects degraded (20.513%)
5 H! u, g7 H3 S6 ?1 f+ b# \6 [- v; _4 G 324/702 objects misplaced (46.154%)
9 ~+ y- f/ E+ d# u+ p 162 active+clean+remapped! r7 ?7 l3 \9 k& l* S
123 undersized+peered( f7 x6 N- R0 Q1 M5 F7 G
3 undersized+degraded+peered5 `; K7 W: `9 Y& _- m# `
4 ]1 q. c; V- N# H) q
查看. a: j { b9 V! d/ W
[root@ceph1 ~]# ceph health detail. d, h) y) |/ ~
HEALTH_WARN 1 MDSs report slow metadata IOs; 324/702 objects misplaced (46.154%); Reduced data availability: 126 pgs inactive; Degraded data redundancy: 144/702 objects degraded (20.513%), 3 pgs degraded, 126 pgs undersized
+ c' w H& n/ y: p" B s+ lMDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
8 g& ^# i4 h' W; i) D. V* N mdsceph1(mds.0): 9 slow metadata IOs are blocked > 30 secs, oldest blocked for 42075 secs
! { I' A( k1 j& D1 BOBJECT_MISPLACED 324/702 objects misplaced (46.154%)2 F1 d( @* I# [7 ]
PG_AVAILABILITY Reduced data availability: 126 pgs inactive
8 F7 F6 C0 E5 ^' n$ w" y5 ~) [ pg 8.28 is stuck inactive for 42240.369934, current state undersized+peered, last acting [0]9 T, Y/ L. ]" h- S* [; J x/ i* {
pg 8.2a is stuck inactive for 45566.934835, current state undersized+peered, last acting [0]8 j4 C3 d! W q3 \9 _! X' ^6 k
pg 8.2d is stuck inactive for 42240.371314, current state undersized+peered, last acting [0]
; F+ ]+ q8 }4 B; H- m, ~ pg 8.2f is stuck inactive for 45566.913284, current state undersized+peered, last acting [0]
7 ?8 ^# ~4 K- V5 n pg 8.32 is stuck inactive for 42240.354304, current state undersized+peered, last acting [0]
, m" d, X( R% i2 [0 P: m" ]% Y ....
- [+ M% D8 G9 E5 P& U1 z" G pg 8.28 is stuck undersized for 42065.616897, current state undersized+peered, last acting [0]( K6 l6 J+ ?7 a
pg 8.2a is stuck undersized for 42065.613246, current state undersized+peered, last acting [0]
+ n+ n3 Z- a' v, `& B2 W) F! w pg 8.2d is stuck undersized for 42065.951760, current state undersized+peered, last acting [0]
3 i$ O, I9 b, ?+ f pg 8.2f is stuck undersized for 42065.610464, current state undersized+peered, last acting [0]
+ g' t3 G: b# M8 ]4 E5 _ pg 8.32 is stuck undersized for 42065.959081, current state undersized+peered, last acting [0]
* E- e7 @; ^3 d r6 y8 u" v/ | ....
% ^( d) F8 k& y) T
. F+ Q) N( v2 o, o5 M可见在数据修复中, 出现了inactive和undersized的值, 则是不正常的现象5 q; S. Z( s) q' d9 Q8 Z
解决方法:+ x% F6 W) C& O# a1 ~+ @
①处理inactive的pg:
# |* [' u) n* b0 a! g6 ?7 P6 o重启一下osd服务即可% W; m8 `1 G& E6 u0 f
[root@ceph1 ~]# systemctl restart ceph-osd.target 4 p4 z$ R! @' K% \! Y
1
* }* `+ z! y8 u3 y% D1 ` A继续查看集群状态发现,inactive值的pg已经恢复正常,此时还剩undersized的pg。1 k/ T8 Y2 U! n+ z$ @! [, I) _
[root@ceph1 ~]# ceph -s# {) ]! D7 m- s' T; {6 Q
cluster:
# w4 z% X" J% b2 j- _! \! R id: 13430f9a-ce0d-4d17-a215-272890f47f28
& z- j* o: I4 E; p" i" s0 } ^ health: HEALTH_WARN+ ?4 |( g2 G6 v( [, S) O7 T
1 filesystem is degraded: K& p9 j$ e, u+ E% V
241/723 objects misplaced (33.333%)
. H1 E2 ~0 b6 S! \4 J1 n- {0 U/ S' d/ t Degraded data redundancy: 59 pgs undersized
" ^% J) ? w2 ~$ u! s8 p
9 y7 Z5 J: @' n" ` services:: D2 L( B1 Y% C$ v7 y @
mon: 3 daemons, quorum ceph2,ceph1,ceph37 H, B9 r3 Y! z/ ~% H( G f: B
mgr: ceph1(active), standbys: ceph2, ceph3
' z3 E* Q! u+ @ p& E! a mds: cephfs-1/1/1 up {0=ceph1=up:rejoin}
# j2 c/ [5 f" { osd: 3 osds: 3 up, 3 in; 229 remapped pgs
) P5 s9 h5 p$ c2 A; C, ? rgw: 1 daemon active
/ c n! V" O. N6 j 3 @; n" W/ T9 B$ _- Y
data:2 G/ P: s$ P" y; }" i
pools: 8 pools, 288 pgs7 c1 p3 L$ {1 p- @ {
objects: 241 objects, 3.4 KiB
" t8 L$ i- j5 @; C3 z, C2 o5 | usage: 3.0 GiB used, 245 GiB / 248 GiB avail- z6 f. z8 b f% ]! I2 O/ R
pgs: 241/723 objects misplaced (33.333%)9 W; N6 C& A1 X+ Z
224 active+clean+remapped( N- ~8 P/ z* T; ^' M$ r) p& {
59 active+undersized
. Y F' S# r0 ?* E+ P 5 active+clean
& M/ h' r6 r9 B7 _2 J1 T
* J" B q6 ]. F9 `( }( A- b- M io:$ ?4 T) R# r- w* l
client: 1.2 KiB/s rd, 1 op/s rd, 0 op/s wr. p) A9 V. k# [4 ?' ]/ e% q
3 g' _1 }4 v9 e2 h& ~0 A( S1 `
②处理undersized的pg:, }* a) n/ c6 s1 x" L1 p( Q) o
& C0 D- j; v: x
学会出问题先查看健康状态细节,仔细分析发现虽然设定的备份数量是3,但是PG 12.x却只有两个拷贝,分别存放在OSD 0~2的某两个上。
* a4 _/ B' k) x* g+ {[root@ceph1 ~]# ceph health detail
; M+ t( }) ~9 ~: p6 q- PHEALTH_WARN 241/723 objects misplaced (33.333%); Degraded data redundancy: 59 pgs undersized; w9 ?+ B& o6 x$ ]' p3 f3 P
OBJECT_MISPLACED 241/723 objects misplaced (33.333%)
& i# d2 W# @/ T: X6 n- X9 O- |PG_DEGRADED Degraded data redundancy: 59 pgs undersized1 ^5 ^( H$ d" e1 R' U, A
pg 12.8 is stuck undersized for 1910.001993, current state active+undersized, last acting [2,0]
2 f+ k7 }( I1 T. G' d' B: o7 j( ] pg 12.9 is stuck undersized for 1909.989334, current state active+undersized, last acting [2,0]' N1 Q8 I3 U$ [* ~; X- J
pg 12.a is stuck undersized for 1909.995807, current state active+undersized, last acting [0,2], N+ d+ }% }% c* ~5 ]
pg 12.b is stuck undersized for 1910.009596, current state active+undersized, last acting [1,0]
+ d3 ]( k+ s' _ pg 12.c is stuck undersized for 1910.010185, current state active+undersized, last acting [0,2]* q; X' V- b( X" ]5 t% M
pg 12.d is stuck undersized for 1910.001526, current state active+undersized, last acting [1,0]% ~# r* V6 @8 X p' |! v% S( J$ l. G ~
pg 12.e is stuck undersized for 1909.984982, current state active+undersized, last acting [2,0]* k, n' Z5 V7 {# Y
pg 12.f is stuck undersized for 1910.010640, current state active+undersized, last acting [2,0]
& x* x( r. I; ?; u5 ~3 y0 |/ V5 |; `0 C6 k' w3 Q$ X; X
进一步查看集群osd状态树,发现ceph2和cepn3宕机再恢复后,osd.1 和osd.2进程已不在ceph2和cepn3上。6 o& ]( {, {3 {0 Q j
[root@ceph1 ~]# ceph osd tree A" w5 ]2 U% X/ t
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
' v1 k5 V3 D$ [; J-1 0.24239 root default , z. ^2 g0 J& P+ T9 e' v# o6 H
-9 0.16159 host centos7evcloud ?, m; A- p3 }/ Z1 L. j8 T* t4 E) M
1 hdd 0.08080 osd.1 up 1.00000 1.00000 4 k7 u3 H3 b" T8 P& b) o# \6 z
2 hdd 0.08080 osd.2 up 1.00000 1.00000
, A6 u4 W( w( z, |-3 0.08080 host ceph1
. H. B; W' F k& P: }% p5 D 0 hdd 0.08080 osd.0 up 1.00000 1.00000 - V" i# \6 V1 `' Q2 D( s4 L
-5 0 host ceph2
+ q( F3 E7 G* u-7 0 host ceph31 |- L- c6 P2 P9 y7 k
/ @( L) z! l3 @9 @0 @. q
分别查看osd.1 和osd.2服务状态。* Y U" ^2 ^* o D
P A% @6 \3 }4 _4 x解决方法:
4 b& @0 _; b5 m: f- j分别进入到ceph2和ceph3节点中重启osd.1 和osd.2服务,将这两个服务重新映射到ceph2和ceph3节点中。
+ O% m7 H- b H+ e- l[root@ceph1 ~]# ssh ceph2" N' N5 T- [. ~6 D* j+ z% p$ Y
[root@ceph2 ~]# systemctl restart ceph-osd@1.service
/ ]! C. T4 J3 K/ u! _" R* B9 A[root@ceph2 ~]# ssh ceph3. @% z, |% u G% Z9 T0 p5 v" i
[root@ceph3 ~]# systemctl restart ceph-osd@2.service1 @+ n& R4 K9 r; y2 [
: p) [" K& c/ O2 t( K$ s$ G最后查看集群osd状态树发现这两个服务重新映射到ceph2和ceph3节点中。( [. n( F# s- H8 @3 ]: p
[root@ceph3 ~]# ceph osd tree
4 j4 g) z- @* L! _' ^4 IID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF 3 M/ ~5 t; u% @. G7 \; x* ?
-1 0.24239 root default
# ~8 ^" }8 |4 O) \0 S: E! m-9 0 host centos7evcloud
, u' d6 p6 l' M# k4 a7 ^' s8 _-3 0.08080 host ceph1
. X# ^* C$ o1 p) ` 0 hdd 0.08080 osd.0 up 1.00000 1.00000
1 L7 b0 Z& H# K0 b+ V. C-5 0.08080 host ceph2 , T+ L I5 n! p; L
1 hdd 0.08080 osd.1 up 1.00000 1.00000 + H* Y! v( {' R
-7 0.08080 host ceph3 i X% j- b/ \* j* k# X+ d
2 hdd 0.08080 osd.2 up 1.00000 1.00000
- E% s9 \+ X/ g( [) v0 \3 ~/ C/ t3 ~* y
集群状态也显示了久违的HEALTH_OK。3 @4 h% w P+ B! }1 o" y
[root@ceph3 ~]# ceph -s7 b2 ^* n. N& P+ D- H
cluster:
- }, p* u8 I- ?8 W4 I }! T id: 13430f9a-ce0d-4d17-a215-272890f47f28
: i% B/ [- i; R/ s4 B health: HEALTH_OK% J& X6 W* e' ^8 W+ ^
! `2 _: i5 i5 k% ?4 L
services:
$ {2 Y$ Y8 W$ z1 f' y( A( b mon: 3 daemons, quorum ceph2,ceph1,ceph3: Z+ t/ l( M; \
mgr: ceph1(active), standbys: ceph2, ceph31 N8 @; {" `/ a j8 e* e
mds: cephfs-1/1/1 up {0=ceph1=up:active}
: D3 m" i2 H: d3 O" E, O! _; D osd: 3 osds: 3 up, 3 in
4 Y+ F( \7 [ j/ p rgw: 1 daemon active
. k7 [+ B7 e/ e. n ( v, h% c! F& b
data:: O6 U) q* c. O
pools: 8 pools, 288 pgs
' M0 W) m4 B/ h- Y4 C# R1 e objects: 241 objects, 3.6 KiB
. `3 _; V7 e4 T( h) y7 q# s7 ~ P4 B usage: 3.1 GiB used, 245 GiB / 248 GiB avail
& ?, s7 |( P5 |6 P3 n0 W; ^. f pgs: 288 active+clean
! N- w0 `0 x. b" n: p0 J
4 |" L/ J1 X k- i& m6 U6.卸载CephFS后再挂载时报错
' [% V3 p' y& f挂载命令如下:
5 c, B5 k5 P- `( W9 I- K1 ?mount -t ceph 10.0.86.246:6789,10.0.86.221:6789,10.0.86.253:6789:/ /mnt/mycephfs/ -o name=admin,secret=AQBAI/JbROMoMRAAbgRshBRLLq953AVowLgJPw==% Q: L3 U0 I4 _4 G$ l% L; h+ i- x
2 w. K8 U0 Y6 z/ G: x+ w/ H卸载CephFS后再挂载时报错:mount error(2): No such file or directory) Q! `- f& j3 R3 U' d) x
说明:首先检查/mnt/mycephfs/目录是否存在并可访问,我的是存在的但依然报错No such file or directory。但是我重启了一下osd服务意外好了,可以正常挂载CephFS。
2 {0 [8 `$ _ A: L/ _+ J% I[root@ceph1 ~]# systemctl restart ceph-osd.target
$ E- R8 E# B' }[root@ceph1 ~]# mount -t ceph 10.0.86.246:6789,10.0.86.221:6789,10.0.86.253:6789:/ /mnt/mycephfs/ -o name=admin,secret=AQBAI/JbROMoMRAAbgRshBRLLq953AVowLgJPw==3 J+ @) @7 I% R2 Z* {
/ a6 Q i0 k9 g( ]+ E1 d
可见挂载成功~!
8 M3 k- l) b# r+ r- K6 d[root@ceph1 ~]# df -h
6 P% q; z( y% G9 e CFilesystem Size Used Avail Use% Mounted on' W8 i2 J0 k/ t8 |: V
/dev/vda2 48G 7.5G 41G 16% /! T8 t& {% y4 y0 c. s
devtmpfs 1.9G 0 1.9G 0% /dev
; C' }9 H4 o* J, F9 j. m6 Mtmpfs 2.0G 8.0K 2.0G 1% /dev/shm
# V0 s: D% A d! r! itmpfs 2.0G 17M 2.0G 1% /run$ {. u8 P# n6 \
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup- p8 ` f0 c8 l
tmpfs 2.0G 24K 2.0G 1% /var/lib/ceph/osd/ceph-0
" l% W; w* Q# N& i. P/ R3 Ctmpfs 396M 0 396M 0% /run/user/0- V; M, h/ a- j
10.0.86.246:6789,10.0.86.221:6789,10.0.86.253:6789:/ 249G 3.1G 246G 2% /mnt/mycephfs2 d3 Q4 M( x: X/ H! h, y+ b) }
* N3 y2 I; ?$ a, m- P积累中。。。
- ^8 D1 [" S& H! _/ R; i=========================================================================
. B. A" V) K* l) k总结:
( s/ R: ]" V* v* b; I1 R/ Q3 ^查看集群状态发现报错或警告后,往往通过ceph health detail命令可以查看到系统给出的处理建议。通过这些建议一般可以处理大多数集群出现的问题。
; z$ v( J! z/ q# u% B3 T+ d
9 P8 ~0 F0 O* ~* \: L |
|