找回密码
 注册
查看: 3222|回复: 0

Ceph集群报错解决方案笔记

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2021-7-26 14:43:14 | 显示全部楼层 |阅读模式
0 当前Ceph版本和CentOS版本:
; B: R( c9 X) S0 ]6 I9 ~[root@ceph1 ceph]# ceph -v
0 P3 F+ B! h# r6 Kceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)3 `3 }/ Z3 A" L5 R9 ?
[root@ceph1 ceph]# cat /etc/redhat-release % L3 \$ x. {$ K% S: c/ q
CentOS Linux release 7.5.1804 (Core)1 E' j# Q. ^& U1 }5 y
1 P$ ^) A; ?0 ~* L
5 E: K8 a( e9 I' C3 a# r. ?
1.节点间配置文件内容不一致错误
. q( T8 A8 D/ o2 ~; t输入ceph-deploy mon create-initial命令获取密钥key,会在当前目录(如我的是~/etc/ceph/)下生成几个key,但报错如下。意思是:就是配置失败的两个结点的配置文件的内容于当前节点不一致,提示使用--overwrite-conf参数去覆盖不一致的配置文件。* y+ W- U8 f) ^, Q2 p& ^; M" ]
[root@ceph1 ceph]# ceph-deploy mon create-initial
. f+ E% Z6 J, C9 Q6 Z...
" b  H; `8 @" J% f[ceph2][DEBUG ] remote hostname: ceph2  y+ U3 I' i+ t, o
[ceph2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf( h7 p. i+ |) u+ O  M1 S8 r2 D/ `+ _
[ceph_deploy.mon][ERROR ] RuntimeError: config file /etc/ceph/ceph.conf exists with different content; use --overwrite-conf to overwrite
0 w; v: M& Z. H( P[ceph_deploy][ERROR ] GenericError: Failed to create 2 monitors3 k; Q! |" `. m# ]0 A4 m
...* r. E) C+ P/ E1 M

& T$ o' T7 W  g输入命令如下(此处我共配置了三个结点ceph1~3):7 ^0 j1 Q+ V1 R1 u7 c/ Q
[root@ceph1 ceph]# ceph-deploy --overwrite-conf mon create ceph{3,1,2}  ?2 p6 L1 A; j2 r
...7 X* ?5 W0 R  L4 w0 ~
[ceph2][DEBUG ] remote hostname: ceph2
3 e( c0 p: ~+ r  z9 Y2 a* v[ceph2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf. l2 Q. J& j; Y: b/ g! \2 L8 g" C
[ceph2][DEBUG ] create the mon path if it does not exist- E* S) p$ |1 {3 ~
[ceph2][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-ceph2/done1 n5 D9 {9 J" L4 Q, r+ B9 r
...: @/ |7 r% ]; Z, g* ]

; o2 e$ G! P% j7 c5 {之后配置成功,可继续进行初始化磁盘操作。+ k" H7 n; r* V* z
2.too few PGs per OSD (21 < min 30)警告) j0 n, T. t  A# X! _. U1 b
[root@ceph1 ceph]# ceph -s  t! z  ~9 V' j% r. P8 }  ~
  cluster:
( P4 {9 e3 C# K/ J% ~$ V+ S    id:     8e2248e4-3bb0-4b62-ba93-f597b1a3bd40" ^% @4 F* E+ b) k7 F
    health: HEALTH_WARN: u6 z, ]; _* c* N
            too few PGs per OSD (21 < min 30). w& C0 q: L) b& y& C
7 D: B. K9 ^  f+ L. w& O
  services:+ w6 Z2 }+ S& x" x2 L1 B5 U# m8 q8 |
    mon: 3 daemons, quorum ceph2,ceph1,ceph33 p6 f3 j; I2 e
    mgr: ceph2(active), standbys: ceph1, ceph3
4 y* y4 O/ D% J. l# V    osd: 3 osds: 3 up, 3 in
2 Z$ c& U( k3 S# a- T9 |    rgw: 1 daemon active
3 B/ U" v+ B) G; [' I% ?% n+ _: R
/ _  W( |; |* R  data:8 M/ M6 L( S8 J: E0 U
    pools:   4 pools, 32 pgs/ D. J7 `. M5 Z$ Y1 H
    objects: 219  objects, 1.1 KiB
0 f4 u* t% c5 w' F    usage:   3.0 GiB used, 245 GiB / 248 GiB avail
5 ~7 L# Y" T$ Z* V    pgs:     32 active+clean
. A! R3 t' s- S6 |; [# `
4 o, W$ v  h  D
0 {' h2 h. g! Z& G/ i从上面集群状态信息可查,每个osd上的pg数量=21<最小的数目30个。pgs为32,因为我之前设置的是2副本的配置,所以当有3个osd的时候,每个osd上均分了32÷3*2=21个pgs,也就是出现了如上的错误 小于最小配置30个。
) [7 D. n* }- I4 C集群这种状态如果进行数据的存储和操作,会发现集群卡死,无法响应io,同时会导致大面积的osd down。
& F# `7 F/ b& Q: X) n. M3 n; w解决办法:增加pg数
; G' A; \  r1 ~5 W因为我的一个pool有8个pgs,所以我需要增加两个pool才能满足osd上的pg数量=48÷3*2=32>最小的数目30。
/ s4 C( ^# g# X$ h[root@ceph1 ceph]# ceph osd pool create mytest 8
) C9 t: M% o* O4 bpool 'mytest' created" }) J6 |. D# j# e
[root@ceph1 ceph]# ceph osd pool create mytest1 8
$ d; Y$ z# l4 P0 l# l/ fpool 'mytest1' created
5 }! G; u/ {6 r# S' F) P[root@ceph1 ceph]# ceph -s
6 R, R% C  Z+ I  cluster:
5 U) @; m( S  \" G. F% \* ^    id:     8e2248e4-3bb0-4b62-ba93-f597b1a3bd40: d% P+ n5 G  V0 |  @- Y' Q6 M+ `1 k
    health: HEALTH_OK
. w! b" M* S- |8 D & N  X; v9 f( f3 u0 P4 r* _& z, ^+ C
  services:
  V0 B/ R$ G5 t; m) t, Q    mon: 3 daemons, quorum ceph2,ceph1,ceph34 w+ k7 t' y1 r7 U  p
    mgr: ceph2(active), standbys: ceph1, ceph3
9 c6 S7 V. Z1 O- I+ E0 F" A6 D+ F    osd: 3 osds: 3 up, 3 in
! o+ ~" f1 f* z0 z- i5 y    rgw: 1 daemon active$ N$ I1 t2 }% T% w. }
5 I9 c$ W- V9 h6 [
  data:
$ i, u6 L$ ^: \' G( s    pools:   6 pools, 48 pgs
+ \. M5 _! p3 l, K6 K+ J    objects: 219  objects, 1.1 KiB% {5 k4 {: h1 {8 D! C
    usage:   3.0 GiB used, 245 GiB / 248 GiB avail
) B$ n" x. q- x; y  R  M    pgs:     48 active+clean
5 r* T% z; \3 S% f1 B6 O9 g& ?6 R7 c' P) ^: e
集群健康状态显示正常。+ U5 _  m: U3 ?" @0 C
3.集群状态是HEALTH_WARN application not enabled on 1 pool(s)) t" p3 K$ X* V8 e7 q' x
如果此时,查看集群状态是HEALTH_WARN application not enabled on 1 pool(s):) Z) r  M5 a, I1 ?
[root@ceph1 ceph]# ceph -s1 d$ F- c; ^: |# X4 A) S2 `
  cluster:
' i5 H7 K' M& |2 M    id:     13430f9a-ce0d-4d17-a215-272890f47f28
& I! Q3 A2 t& G2 f7 c) y    health: HEALTH_WARN* m. O9 u, P  |1 m3 j
            application not enabled on 1 pool(s)8 y5 z/ U: |& H/ i3 g

& ]" V* L$ P: |/ O) I5 {[root@ceph1 ceph]# ceph health detail
3 u* Z; S: y% n4 A" dHEALTH_WARN application not enabled on 1 pool(s)
6 Z; T* k1 ~7 v' N9 VPOOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
6 o' l& M2 Y; g/ X+ v- E    application not enabled on pool 'mytest'3 ]! z" Z& u' Y
    use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications./ e5 F& h4 R! {9 \+ M1 u1 B

2 G4 F* Y5 f1 A7 N) t运行ceph health detail命令发现是新加入的存储池mytest没有被应用程序标记,因为之前添加的是RGW实例,所以此处依提示将mytest被rgw标记即可:
* J8 v8 K+ U5 N. E8 L: |[root@ceph1 ceph]# ceph osd pool application enable mytest rgw
* z) W$ v4 a" t& e1 ~/ P# ]enabled application 'rgw' on pool 'mytest'
+ V  Z2 P/ n9 ?" f8 z  L) a
) B3 e. q9 d. G  D" e! |再次查看集群状态发现恢复正常' r5 P+ e: b2 e6 T2 o" ~$ c1 ?
[root@ceph1 ceph]# ceph health
; `6 y5 U. q4 n4 G1 |7 U+ q& hHEALTH_OK
& E' d' g7 @# B4 z% d/ a9 _8 E
/ [# G, H* \* z8 A4 r4.删除存储池报错6 G$ V9 D, j9 f$ b# Q
以下以删除mytest存储池为例,运行ceph osd pool rm mytest命令报错,显示需要在原命令的pool名字后再写一遍该pool名字并最后加上--yes-i-really-really-mean-it参数
6 T: E; o: L, M" s, O[root@ceph1 ceph]# ceph osd pool rm mytest" |+ ^# H# _) c6 v, H8 w5 ~- v
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool mytest.  If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.' r, \- f7 Z* W6 R

+ f; b& e  o) q- U0 D8 @8 u9 q& U按照提示要求复写pool名字后加上提示参数如下,继续报错:5 A. y/ {/ x6 R; N. D  R
[root@ceph1 ceph]# ceph osd pool rm mytest mytest --yes-i-really-really-mean-it1 R% A. c( x: x- w/ g9 q
Error EPERM: pool deletion is disabled; you must first set the 5 |; p, K6 V( ^1 H# h: e' S) W# e
mon_allow_pool_delete config option to true before you can destroy a pool8 _6 N) |( {7 l3 \
* f4 I3 _, l! a1 J7 p) {
错误信息显示,删除存储池操作被禁止,应该在删除前现在ceph.conf配置文件中增加mon_allow_pool_delete选项并设置为true。所以分别登录到每一个节点并修改每一个节点的配置文件。操作如下:  i% Z7 r; R' G7 A$ R! R
[root@ceph1 ceph]# vi ceph.conf 9 U3 x/ L: P7 e
[root@ceph1 ceph]# systemctl restart ceph-mon.target
. }. W$ w; C0 x+ c- P: Y  V# T8 `; O9 b2 I! u/ |  t
在ceph.conf配置文件底部加入如下参数并设置为true,保存退出后使用systemctl restart ceph-mon.target命令重启服务。
$ |; _. I; _! j1 y  F1 F[mon]% P- O% q9 ^) f8 M
mon allow pool delete = true
1 J# `  B5 W6 l7 N8 y( `6 j- c) g+ X- z: h
其余节点操作同理。# |+ p3 i9 X3 @: M! G& r3 T( n5 e$ F
[root@ceph2 ceph]# vi ceph.conf 4 H5 h5 I1 V, |5 [. n7 B' I
[root@ceph2 ceph]# systemctl restart ceph-mon.target
$ h/ ^- Y1 [9 V0 U- j. f7 G[root@ceph3 ceph]# vi ceph.conf   j6 q: @6 y" O: z8 W3 D6 o9 `
[root@ceph3 ceph]# systemctl restart ceph-mon.target
0 ?; T" Z' v- h! \/ `7 z- V4 n0 i* S+ `
再次删除,即成功删除mytest存储池。
* j- Q, U( r# t3 V8 e[root@ceph1 ceph]# ceph osd pool rm mytest mytest --yes-i-really-really-mean-it
! \4 k6 d/ l3 [pool 'mytest' removed
. ]. d& c0 Z5 @3 v9 @
6 Y& ~0 W6 u% [, b5.集群节点宕机后恢复节点排错5 f$ w" k3 t+ {" R
笔者将ceph集群中的三个节点分别关机并重启后,查看ceph集群状态如下:
7 ]0 S  j1 m. \[root@ceph1 ~]# ceph -s) o5 _! p: r% q- u% O" n  K
  cluster:) P: H; m5 O0 [+ r6 [& H
    id:     13430f9a-ce0d-4d17-a215-272890f47f281 G; b( E' `' E. m
    health: HEALTH_WARN; a9 @( G, |+ Q" z
            1 MDSs report slow metadata IOs/ L" J6 {% d5 }( |* m
            324/702 objects misplaced (46.154%)
/ l' z" C  `/ P& G' U            Reduced data availability: 126 pgs inactive
& a' Y9 Z8 U: J) r+ A5 G            Degraded data redundancy: 144/702 objects degraded (20.513%), 3 pgs degraded, 126 pgs undersized  f* G: l% T9 C

! r* M0 Z7 ~5 j6 g& R  services:
5 R& o' I8 s/ t    mon: 3 daemons, quorum ceph2,ceph1,ceph3
( L" X" T( u. a- O4 {7 s    mgr: ceph1(active), standbys: ceph2, ceph3! a% f, N) E, d- v" i: \7 Y
    mds: cephfs-1/1/1 up  {0=ceph1=up:creating}
  `. U0 u* e" {4 F: Q    osd: 3 osds: 3 up, 3 in; 162 remapped pgs
  R  |( {3 {& O+ W
3 s% o0 b' v* u" E- l/ W6 p! x  data:# u3 X% ]* T. q% H/ Q' R9 E( B
    pools:   8 pools, 288 pgs  W* J$ c! d7 `9 ~$ r, a
    objects: 234  objects, 2.8 KiB% U8 O( ]3 M( X% y7 H( e& {. \
    usage:   3.0 GiB used, 245 GiB / 248 GiB avail
. M% [, \; |6 M% N3 K2 l' p    pgs:     43.750% pgs not active  ]3 V. T) l- X1 a( u' c, S
             144/702 objects degraded (20.513%)6 [4 f! q$ m. E9 O. Y. \
             324/702 objects misplaced (46.154%)4 s; J- L; y. Q. _0 U
             162 active+clean+remapped
3 O* G0 @& m9 W3 x# T( w  f             123 undersized+peered; @, h" G9 }' Y& |- D4 ]2 L8 C
             3   undersized+degraded+peered2 c# Z! U8 Z& C3 S  t
( F2 ?" |! i* ^9 i3 {6 ~
查看( G# j, D* y% @+ U8 ^7 |3 A
[root@ceph1 ~]# ceph health detail
# H5 @. E6 U( f8 M4 _  E) o- c1 FHEALTH_WARN 1 MDSs report slow metadata IOs; 324/702 objects misplaced (46.154%); Reduced data availability: 126 pgs inactive; Degraded data redundancy: 144/702 objects degraded (20.513%), 3 pgs degraded, 126 pgs undersized" T* e4 ?- G. \6 F! y4 h
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
( ~! Y/ K0 W* s% ?6 m) n/ @* ]6 @- U    mdsceph1(mds.0): 9 slow metadata IOs are blocked > 30 secs, oldest blocked for 42075 secs* N, N8 O4 [- e; v  o3 O
OBJECT_MISPLACED 324/702 objects misplaced (46.154%)
; B$ X6 }# F9 l. NPG_AVAILABILITY Reduced data availability: 126 pgs inactive! V8 m; i" m& E
    pg 8.28 is stuck inactive for 42240.369934, current state undersized+peered, last acting [0]2 E- Q6 h$ E1 J
    pg 8.2a is stuck inactive for 45566.934835, current state undersized+peered, last acting [0]
; j6 O2 M  l8 I1 U, T; @7 D- K    pg 8.2d is stuck inactive for 42240.371314, current state undersized+peered, last acting [0]
5 w$ k8 P2 v  E3 j; x* l    pg 8.2f is stuck inactive for 45566.913284, current state undersized+peered, last acting [0]
, _1 m4 {7 g! q$ `    pg 8.32 is stuck inactive for 42240.354304, current state undersized+peered, last acting [0]
+ d/ G' U3 W" {9 t0 S, r5 I! x    ....
9 ^6 O6 Z. Q% `    pg 8.28 is stuck undersized for 42065.616897, current state undersized+peered, last acting [0]. l# S7 B% Q+ J6 s2 l
    pg 8.2a is stuck undersized for 42065.613246, current state undersized+peered, last acting [0]! O' ^7 O1 c2 C: [" a' b" m
    pg 8.2d is stuck undersized for 42065.951760, current state undersized+peered, last acting [0]8 i6 ]. V. t1 J( l
    pg 8.2f is stuck undersized for 42065.610464, current state undersized+peered, last acting [0]7 |5 b& Z* `; s! C3 a1 j* G, q- ]
    pg 8.32 is stuck undersized for 42065.959081, current state undersized+peered, last acting [0]
: a4 V, P, z# C' z' a8 c) \( z! f    ....
0 w$ C* @" d* {# x4 V( G, u
. V* g" q! y# M  [" b8 F- ^5 K可见在数据修复中, 出现了inactive和undersized的值, 则是不正常的现象
8 ?  R6 t* r0 \( D% o6 \解决方法:' G, Z( A) o  Z$ A- l) ]
①处理inactive的pg:* ?1 J0 I1 Y  N8 ?* `6 ?# f! P+ o
重启一下osd服务即可
( r/ J0 b/ B" Y[root@ceph1 ~]# systemctl restart ceph-osd.target 7 N+ `( K4 k" y" t1 r
1
$ r/ x3 Z! V" S继续查看集群状态发现,inactive值的pg已经恢复正常,此时还剩undersized的pg。
3 _+ {% f6 @5 e! J[root@ceph1 ~]# ceph -s
' F5 `% I' Y7 l# C# x  cluster:: T7 N+ L/ ^# R& I! S3 d
    id:     13430f9a-ce0d-4d17-a215-272890f47f28, X, U8 L" g7 R9 w, v6 \( x/ _
    health: HEALTH_WARN7 G7 @- D8 y' X5 A2 M
            1 filesystem is degraded
3 T# W" t+ v) X. c, }1 f            241/723 objects misplaced (33.333%)  J" U+ H2 j+ K4 g. z. e4 f
            Degraded data redundancy: 59 pgs undersized
% L! k/ r. ~4 J8 {) ]0 r8 h# y 7 ^& V" z6 ?2 q
  services:; \, T+ |+ ^/ d' g$ e
    mon: 3 daemons, quorum ceph2,ceph1,ceph3; ]& b5 `% Y- O" o" B
    mgr: ceph1(active), standbys: ceph2, ceph3( {: p3 M" p) b! K- R* _0 \8 H' G6 f& j
    mds: cephfs-1/1/1 up  {0=ceph1=up:rejoin}" r: U+ u3 {0 s! d( u. ~6 R
    osd: 3 osds: 3 up, 3 in; 229 remapped pgs
# i0 j! ?8 t5 e2 i; T5 Q# ?1 Q    rgw: 1 daemon active
& N0 l$ Y4 C) ] 9 Y. \: x# M  ^; D& a. y
  data:. x7 [; p# [" m
    pools:   8 pools, 288 pgs
9 P2 }9 L7 T$ ?7 }, q# U6 F8 Q    objects: 241  objects, 3.4 KiB9 }& X1 N" {+ H6 s. P# S
    usage:   3.0 GiB used, 245 GiB / 248 GiB avail# {0 u4 m0 {; X( r7 f/ w
    pgs:     241/723 objects misplaced (33.333%)
7 m* B4 N0 o( H2 D             224 active+clean+remapped
$ |, Y7 P- l+ ?2 b. C             59  active+undersized% v9 Z& C2 {! V2 k7 d, m
             5   active+clean5 w5 K. M2 Y: r4 N0 e+ G  H! k

% U6 Z" U- Z& P* \# P/ K  io:
+ Q) u4 f- ]0 ]    client:   1.2 KiB/s rd, 1 op/s rd, 0 op/s wr
1 f; c+ R7 E6 |* a4 O* D
8 l6 \9 i; ?# M6 v! x- a* }0 W" C②处理undersized的pg:
% J- |1 E; n" t- `. G3 Y! y. N
学会出问题先查看健康状态细节,仔细分析发现虽然设定的备份数量是3,但是PG 12.x却只有两个拷贝,分别存放在OSD 0~2的某两个上。8 s$ V" F2 ?. _: k9 B( N( J
[root@ceph1 ~]# ceph health detail , r7 R  e* U) z3 n! {" g- N
HEALTH_WARN 241/723 objects misplaced (33.333%); Degraded data redundancy: 59 pgs undersized* ?- `: R+ s0 s
OBJECT_MISPLACED 241/723 objects misplaced (33.333%)
+ s4 k# e( b- ZPG_DEGRADED Degraded data redundancy: 59 pgs undersized
& g% ?" R0 v( b: q    pg 12.8 is stuck undersized for 1910.001993, current state active+undersized, last acting [2,0]$ G" K& K8 l2 O  q4 d) }! a; s/ P9 P0 r
    pg 12.9 is stuck undersized for 1909.989334, current state active+undersized, last acting [2,0]# _' Y: y( x) L/ o1 O* @
    pg 12.a is stuck undersized for 1909.995807, current state active+undersized, last acting [0,2]
" T9 p3 _4 Z- |  m0 N    pg 12.b is stuck undersized for 1910.009596, current state active+undersized, last acting [1,0]* P+ A( T) w8 ]: U, ?" s- y
    pg 12.c is stuck undersized for 1910.010185, current state active+undersized, last acting [0,2]
1 K3 s: x+ {+ d& l    pg 12.d is stuck undersized for 1910.001526, current state active+undersized, last acting [1,0]
7 O! \0 `4 \( D9 R+ o/ p    pg 12.e is stuck undersized for 1909.984982, current state active+undersized, last acting [2,0]
. [2 A2 F3 `7 \. [    pg 12.f is stuck undersized for 1910.010640, current state active+undersized, last acting [2,0]$ d* i2 [9 v# L7 y) F" N& }% t0 t' M* D
4 S: D1 x$ b5 J9 j# g
进一步查看集群osd状态树,发现ceph2和cepn3宕机再恢复后,osd.1 和osd.2进程已不在ceph2和cepn3上。; g0 x$ U4 I  K$ l# q/ g
[root@ceph1 ~]# ceph osd tree& R6 R7 a. e; o: A. W8 f! x
ID CLASS WEIGHT  TYPE NAME               STATUS REWEIGHT PRI-AFF . V; z8 X2 E% @7 b# Q6 S) o
-1       0.24239 root default                                    - t0 E. D- |! K' I  W6 v
-9       0.16159     host centos7evcloud                        
% X) W; D/ ~7 n$ C  \) q 1   hdd 0.08080         osd.1               up  1.00000 1.00000
* T' Y" n7 O+ r9 } 2   hdd 0.08080         osd.2               up  1.00000 1.00000 4 |: ?+ O3 x' p) \
-3       0.08080     host ceph1                                 
7 V6 Z2 K" K2 S! U; j! @& L 0   hdd 0.08080         osd.0               up  1.00000 1.00000
% S1 I/ R9 \. e: [8 |-5             0     host ceph2                                 
! T) O6 ?- P% v6 v-7             0     host ceph38 V1 }+ ]# H7 {) z& f
* [  V! [$ C, ~5 S
分别查看osd.1 和osd.2服务状态。9 ]: s: g! S0 O  S( {) K9 v9 v5 O

: a% S% y, @2 W) t9 ?解决方法:5 h( y9 Q& y* I3 L6 o
分别进入到ceph2和ceph3节点中重启osd.1 和osd.2服务,将这两个服务重新映射到ceph2和ceph3节点中。
2 j+ w$ ^! m2 d/ r$ E& T[root@ceph1 ~]# ssh ceph22 [* p+ c5 t3 G
[root@ceph2 ~]# systemctl restart ceph-osd@1.service
) q% P0 j4 I# R- S/ ]' R[root@ceph2 ~]# ssh ceph3& j( G' I# G% p3 C
[root@ceph3 ~]# systemctl restart ceph-osd@2.service
+ s8 \! q3 Q! m9 G; p% d
8 X0 V- N5 N  R- t最后查看集群osd状态树发现这两个服务重新映射到ceph2和ceph3节点中。
  r0 F+ d# w/ I- {/ \5 @0 c2 k[root@ceph3 ~]# ceph osd tree
5 s- P6 @' d! R* E2 AID CLASS WEIGHT  TYPE NAME               STATUS REWEIGHT PRI-AFF
- f+ {+ ^5 v/ r4 s-1       0.24239 root default                                    7 d- E' H) M: z# f1 D
-9             0     host centos7evcloud                         4 _1 w$ [* @( J
-3       0.08080     host ceph1                                 
( k' P4 ~# o, |% Y$ N1 w 0   hdd 0.08080         osd.0               up  1.00000 1.00000 5 _3 Q" p3 ?3 D# A+ d
-5       0.08080     host ceph2                                  7 @! n: g7 V; X/ {" [6 l8 K4 n2 l5 W
1   hdd 0.08080         osd.1               up  1.00000 1.00000
1 w5 K, U; B( S8 ^-7       0.08080     host ceph3                                  $ L* I! n0 g  x9 u: v( ^
2   hdd 0.08080         osd.2               up  1.00000 1.00000) F* W7 d3 P( `/ N. e

6 l) M+ R1 Z4 l+ P集群状态也显示了久违的HEALTH_OK。. Z/ t/ x8 M: S! I8 \
[root@ceph3 ~]# ceph -s
' ~8 I( M  G9 ^5 L% D1 |  cluster:
) I6 b! K6 q; x7 X    id:     13430f9a-ce0d-4d17-a215-272890f47f28' ~7 ^( J* ^( t( }* B: C6 g
    health: HEALTH_OK- r; u, q: K/ u% u& t: {/ M

! C; _: [. ^: U2 T/ W  services:+ b5 M9 n+ H  [
    mon: 3 daemons, quorum ceph2,ceph1,ceph3
, _" ~, f2 w* @/ e3 ~    mgr: ceph1(active), standbys: ceph2, ceph3
! z' j- C9 Z3 z' O9 a/ W! g5 R    mds: cephfs-1/1/1 up  {0=ceph1=up:active}" e; t' k4 e, l) c# C8 e/ ]
    osd: 3 osds: 3 up, 3 in" u7 v9 @$ q3 C
    rgw: 1 daemon active
+ w( T& s  q" D$ L! S* f
- v  V, U$ y; D! A7 k* v7 c$ H  data:' N& t% s! t4 S, W8 e
    pools:   8 pools, 288 pgs
$ J; T; |% Z+ r9 i4 B    objects: 241  objects, 3.6 KiB1 T+ |, @* |1 o5 T& P
    usage:   3.1 GiB used, 245 GiB / 248 GiB avail5 d/ I7 @2 r, Y5 ^5 l
    pgs:     288 active+clean, ^1 X2 e7 Q, ]5 _/ `' V4 A

' S9 l& h1 ~# {# c' m  ]; S4 Z6.卸载CephFS后再挂载时报错3 Z" Y3 q4 e  N0 J8 a9 L2 {2 f
挂载命令如下:5 g/ k; U/ e. P0 U
mount -t ceph 10.0.86.246:6789,10.0.86.221:6789,10.0.86.253:6789:/ /mnt/mycephfs/ -o name=admin,secret=AQBAI/JbROMoMRAAbgRshBRLLq953AVowLgJPw==) Z# }/ K$ c/ |$ B0 ?6 [# e
, i8 ~2 F7 n. s" c0 H" O
卸载CephFS后再挂载时报错:mount error(2): No such file or directory2 i. E- Z0 ?3 n0 t3 i2 N  m7 q
说明:首先检查/mnt/mycephfs/目录是否存在并可访问,我的是存在的但依然报错No such file or directory。但是我重启了一下osd服务意外好了,可以正常挂载CephFS。
: c" n9 x/ H# ][root@ceph1 ~]# systemctl restart ceph-osd.target
0 o6 s! v: J" e7 L, X[root@ceph1 ~]# mount -t ceph 10.0.86.246:6789,10.0.86.221:6789,10.0.86.253:6789:/ /mnt/mycephfs/ -o name=admin,secret=AQBAI/JbROMoMRAAbgRshBRLLq953AVowLgJPw==
% T4 K6 s* U4 ^$ d% ]
- I9 O# i* `: Q/ j& ~9 r! _3 R可见挂载成功~!
( o  z( ]) K) W3 z[root@ceph1 ~]# df -h
. x, O6 K& [9 t4 W7 R+ nFilesystem                                            Size  Used Avail Use% Mounted on# S# G, g! \1 x1 H
/dev/vda2                                              48G  7.5G   41G  16% /1 P9 g. m+ r1 m( S
devtmpfs                                              1.9G     0  1.9G   0% /dev/ [; P' q6 {# w
tmpfs                                                 2.0G  8.0K  2.0G   1% /dev/shm$ X# R: v: H1 z: G' z
tmpfs                                                 2.0G   17M  2.0G   1% /run
9 Y# t2 T* K) j0 wtmpfs                                                 2.0G     0  2.0G   0% /sys/fs/cgroup+ M9 Z1 G: u& n' n: v# m
tmpfs                                                 2.0G   24K  2.0G   1% /var/lib/ceph/osd/ceph-0
. ^* c6 y  K  v# C0 Btmpfs                                                 396M     0  396M   0% /run/user/0
* o1 X7 ^1 u; P6 E5 p10.0.86.246:6789,10.0.86.221:6789,10.0.86.253:6789:/  249G  3.1G  246G   2% /mnt/mycephfs
/ k# v: V3 D% N# k/ V0 c2 p
" o- ~7 P& g/ o9 b积累中。。。
) b) ?* M. O) {5 K7 P- |=========================================================================
! h: F+ R4 d; _1 ~5 Y' x6 h0 h总结:
  l; @5 L( q- {+ |* f) e2 x查看集群状态发现报错或警告后,往往通过ceph health detail命令可以查看到系统给出的处理建议。通过这些建议一般可以处理大多数集群出现的问题。
0 u3 T, B9 l3 u1 c- B
$ U1 e5 B' e3 F' o& o; o- M# F
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 00:08 , Processed in 0.015662 second(s), 23 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表