找回密码
 注册
查看: 4244|回复: 1

Degraded data redundancy: 1 pg undersized ceph status状态异常

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2021-6-9 15:00:16 | 显示全部楼层 |阅读模式
[root@controller ~]# ceph -s
/ p. v! @4 R9 {/ o% q  cluster:( {7 l$ s- C' s8 m% R7 y
    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
9 V& M! Y, Y& K0 q: |    health: HEALTH_WARN8 j8 v- u' a# d
            Degraded data redundancy: 1 pg undersized
* F& {+ u+ b/ Y4 W
* P5 R% S# o5 K& J7 b  services:
5 x) L3 m8 h# w, d    mon: 1 daemons, quorum controller (age 87m)
- ?! Y  c3 }0 r/ h( j3 c    mgr: controller.horbtx(active, since 87m)& _" s5 \* O) C; b9 F+ W9 g  ~
    osd: 6 osds: 6 up (since 6m), 6 in (since 6m); 1 remapped pgs
( C( q8 `9 n5 y5 ~. y
+ a% L1 Y, p8 C  data:7 K0 I+ `7 S) `4 O# H
    pools:   1 pools, 1 pgs
" s( b3 p& y4 Z% p" }( B    objects: 0 objects, 0 B
+ ?/ f) J3 b; N' `9 _    usage:   6.0 GiB used, 114 GiB / 120 GiB avail
  Z+ j) n9 r" N9 J1 ]    pgs:     1 active+undersized+remapped
1 v& o& [8 b- r. U 4 X- H/ ~2 D% o
解决过程:4 n+ F$ \8 j+ ?% ^
$ h; n! y0 S' j' J9 C, ]2 m" s
[root@controller ~]# vim /etc/ceph/ceph.conf
1 _2 D& D4 M  a3 @: h, h* d# w2 {7 K! ^
  osd_class_update_on_start = false/ b/ Y; b: `. P0 T$ V
* h  ]5 c3 S# y5 I. i) c' J

1 }% x% }; w. m4 O. ?! R, L* g1 A$ Z( s! @[root@controller ~]# ceph health detail
) L/ {1 I: `4 Y" I# X" ]. X5 VHEALTH_WARN Degraded data redundancy: 1 pg undersized
7 }" n' Z+ d  c) S3 F9 p9 S! q1 N[WRN] PG_DEGRADED: Degraded data redundancy: 1 pg undersized
4 G! y3 }/ `9 ~$ |8 D    pg 1.0 is stuck undersized for 86m, current state active+undersized+remapped, last acting [1,0]3 P( m5 ~, i, H: d1 ?; F- A- R0 d! h
" h5 V* ]& c) p+ J: g# e9 \1 _
修改配置后,需要重启osd服务:
0 @  l6 [4 c# Y& s6 S4 L* j6 j2 b5 i. D; V& u* N9 x. F. M; S
ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.0.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.3.service
8 [- i' Q% E' T+ G. z5 F. Nceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.1.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.4.service& P' p) p7 N- E! G! R% b
ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.2.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.5.service, B+ e3 l* W  l! j
[root@controller ~]# systemctl restart ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.+ W* [  s  i% `8 Z3 t. H- w; S, M
ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.0.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.3.service5 f# @9 n& n' M/ a1 q9 K5 H) I
ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.1.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.4.service
0 R' B2 ]& x, i1 h. z7 ?( _1 Eceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.2.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.5.service
5 X' @. Z! f- c/ F- m[root@controller ~]# systemctl restart ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.*! O& Z3 Y: a5 o4 i
[root@controller ~]# ceph -s3 y& ?- c9 u4 U- i1 M9 Y
  cluster:) [+ @9 }* `5 ]  j/ l' v0 V8 u$ e
    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
7 |2 X: E  u( A9 M3 n. `    health: HEALTH_WARN
+ A- n1 B& Q" S7 W7 e            4 osds down
% A! ?) N& O' F+ `0 q            Degraded data redundancy: 1 pg undersized9 l7 H3 n" N( U8 F: ^7 _
( f: d$ m2 Q0 H8 g6 D# \9 W. ^  O
  services:
  [( v7 V( B& G  _! Q6 P7 q: g" l# w    mon: 1 daemons, quorum controller (age 89m)
3 o: u. S0 J. r    mgr: controller.horbtx(active, since 88m)
4 k8 V" j. C5 R" Y5 R9 E+ A" {4 X3 d    osd: 6 osds: 2 up (since 0.641904s), 6 in (since 8m)
  [, V0 U! a9 l5 [" z6 T
: }8 p% A6 f: ~( p+ Y" o% v6 b  data:6 A# g3 y8 N5 R& [
    pools:   1 pools, 1 pgs
+ `6 ]- I9 [; B3 q    objects: 0 objects, 0 B
0 I, R+ D, }" E7 V" ?) H    usage:   6.0 GiB used, 114 GiB / 120 GiB avail5 j' ]4 \9 l' F( I( E
    pgs:     1 stale+active+undersized+remapped
% {" o2 u8 w# i
. D) n* N, V) A- H0 ^" R. x[root@controller ~]# ceph -s: f" z) |" B7 j! ^" x- H
  cluster:
; e) D7 D" `# G7 A1 k8 q" M    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
/ z1 a1 p( ?* J! Z; v( ~' j    health: HEALTH_OK
0 |$ \" N8 g7 ?0 x/ L6 j2 ?
$ \" I8 D% m) o9 D+ x: g* ^1 O  j% d  services:* Y+ C0 }8 [: d1 ~, f1 v
    mon: 1 daemons, quorum controller (age 89m)
' T( k, W' C0 e$ t8 n    mgr: controller.horbtx(active, since 89m)
, j2 K* Y) P6 u/ E7 d    osd: 6 osds: 6 up (since 6s), 6 in (since 8m); 1 remapped pgs
! u+ h: G9 r4 W6 T - p2 {' {$ z% Z) q8 N
  data:. v: E+ q) [: f6 U( h
    pools:   1 pools, 1 pgs, x7 y: b2 o; u9 z: x3 w
    objects: 0 objects, 0 B& h- ?4 u. U8 i/ X3 C( o& I
    usage:   6.0 GiB used, 114 GiB / 120 GiB avail! n9 c! L8 I6 f7 i3 r) i
    pgs:     1 active+undersized+remapped
  k0 r% r8 ^* X& H$ ?8 l5 H% s2 I
9 }8 M* X; p3 H1 T, V3 g8 m8 i, q! I[root@controller ~]# ceph -s4 Z7 J  _& N) }; C
  cluster:* T+ G* H% U" I% j
    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de$ K: D+ r5 s. O+ j) t& y& j
    health: HEALTH_OK
! \8 B9 @' `7 u7 q
5 n" i6 i' M% `# W2 k% v! ]  services:! _0 R! [0 N9 ?
    mon: 1 daemons, quorum controller (age 89m)
. ]( ]0 U8 w: D7 E2 D' o    mgr: controller.horbtx(active, since 89m)
! u+ n+ z1 c1 R    osd: 6 osds: 6 up (since 8s), 6 in (since 9m); 1 remapped pgs3 i) d6 {  n! Q+ K! F& C3 o

) I2 m% J4 F: k2 ~3 ^  data:% ]+ P/ g# J, E
    pools:   1 pools, 1 pgs
9 O7 r* w" K7 j1 n    objects: 0 objects, 0 B2 L3 m+ O& H8 f$ s& @2 e
    usage:   6.0 GiB used, 114 GiB / 120 GiB avail
1 k4 o) E- M7 J: I    pgs:     1 active+undersized+remapped
  T$ }* ~# b9 b2 I0 \2 Q
! @; k% |  ?; H" }[root@controller ~]# ceph -s* Z) ?5 ~% ~1 v5 J8 T- c$ q4 W
  cluster:6 B2 G" B; Z& _3 ~, k
    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de- `/ G3 n, h- L: t) X
    health: HEALTH_OK% S- K2 g4 _- U: }* }8 a: l

) n0 c' Z7 H5 w/ F6 z* {$ ^. O* E0 P7 l  services:
6 z& K0 L1 G0 |6 R8 h9 N    mon: 1 daemons, quorum controller (age 89m): g3 @4 m- n. r& N4 j2 T
    mgr: controller.horbtx(active, since 89m)5 W) q, Q$ q0 w, O2 i& s) g
    osd: 6 osds: 6 up (since 9s), 6 in (since 9m); 1 remapped pgs' Z& \4 g  y9 f! Y1 ]8 g- [

* c) ^% n: ?2 M; r4 B9 o  data:: H& ~/ m! M5 u. _3 y2 ?7 Q
    pools:   1 pools, 1 pgs
( V& Q* s, a* \. J3 y    objects: 0 objects, 0 B8 g  D* S- J6 R) l5 f' f
    usage:   6.0 GiB used, 114 GiB / 120 GiB avail
0 G3 F7 }. g& x    pgs:     1 active+undersized+remapped
1 f( t( k+ c3 K$ q4 g; J* A ( }, z- ?' h0 P
[root@controller ~]# ceph -s: G4 P' ], v9 d7 S3 X" V
  cluster:
3 s- @- B0 r/ K: r$ P    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
* c% G3 a! p9 P$ U! f# {$ Z' ]    health: HEALTH_OK
/ z' H1 X& R5 l, O3 w  _ / |- ?. z$ z/ ]% m0 J% I
  services:; z# Z# e; L9 x/ U
    mon: 1 daemons, quorum controller (age 89m)
  r+ `/ u% i, x. L' k  D/ }0 w    mgr: controller.horbtx(active, since 89m)
- W& [* V& f9 H    osd: 6 osds: 6 up (since 10s), 6 in (since 9m); 1 remapped pgs$ h0 s# f, }* k7 k5 |5 y7 k
' \# j; p/ ?: X8 I2 e
  data:! l6 y% R- m; Q: ?: n' E+ |
    pools:   1 pools, 1 pgs, L  T1 r1 @+ A! V- ~6 h
    objects: 0 objects, 0 B9 b" b& f- ^) [; J$ Z" m
    usage:   6.0 GiB used, 114 GiB / 120 GiB avail: j- W1 h& l( k: Q
    pgs:     1 active+undersized+remapped. }' H% [  I( m" ]! O: q3 D5 Y& w
2 B& p( A! ]! i
[root@controller ~]# ceph health detail ; j. q" h$ _+ r0 I# q% f0 _( r
HEALTH_OK1 |# h$ R6 G3 z% b) z* e! i2 G
[root@controller ~]# ceph health detail # `+ _9 A  K: [. f" M  Y
HEALTH_OK
; P# ^, ?2 {9 w1 ]- U- I% j[root@controller ~]# , v3 a8 B3 _0 c

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
 楼主| 发表于 2021-6-9 15:00:17 | 显示全部楼层
3.1.1 说明: n1 j6 U, D) e0 p' Y2 h! n
降级:由上文可以得知,每个PG有三个副本,分别保存在不同的OSD中,在非故障情况下,这个PG是active+clean 状态,那么,如果PG 的 副本osd.4 挂掉了,这个 PG 是降级状态。+ ]  z  j0 s" W7 @+ N
3.1.2 故障模拟
) K7 v" `' {/ @( U* e$ Na. 停止osd.1
6 O' p3 S! k/ ]! I4 ~ $ systemctl stop ceph-osd@1
1 E" F5 T4 D  W+ H6 Tb. 查看PG状态6 x1 M. K# j) M/ m* L9 {+ l
$ bin/ceph pg stat 20 pgs: 20 active+undersized+degraded; 14512 kB data, 302 GB used, 6388 GB / 6691 GB avail; 12/36 objects degraded (33.333%)
1 U2 f9 _! G, l7 O/ A' N- r  zc. 查看集群监控状态
5 Q' [) ~- n# P7 ^ $ bin/ceph health detail
9 h, X% z( ?. H* R1 _) s$ \HEALTH_WARN 1 osds down; Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s)
9 N; }! ^6 ~( d1 a3 E# [( |OSD_DOWN 1 osds down     
3 P5 ]3 v* v& M4 y+ H& ]. a   osd.1 (root=default,host=ceph-xx-cc00) is down * i' J1 p& }" K5 i7 r
PG_DEGRADED Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded     
8 V" `& G! A( N7 Y3 F* |   pg 1.0 is active+undersized+degraded, acting [0,2]     # ^/ a9 Z1 k& [) u2 D4 x+ R) H, e
   pg 1.1 is active+undersized+degraded, acting [2,0] 4 M% L2 R  j* h1 }& {+ u% D# q
d. 客户端IO操作$ [3 Z) ]! S& Y- c
#写入对象
, t) r. o# t8 ^* O9 Y$ bin/rados -p test_pool put myobject ceph.conf  8 d$ K: B3 c7 o
: p3 G  s9 Z0 J2 O% d
#读取对象到文件
; J' _* R: F9 J5 i- s5 c$ bin/rados -p test_pool get myobject.old  0 t2 r3 k, g4 H* p* T

/ J4 O' y" i. E#查看文件 5 ^/ z5 y& j$ c' R3 h4 O" l
$ ll ceph.conf*
& z3 {) H9 K* _% l-rw-r--r-- 1 root root 6211 Jun 25 14:01 ceph.conf % \; L' y/ O9 k. q1 S, w- A6 S
-rw-r--r-- 1 root root 6211 Jul 3 19:57 ceph.conf.old # P5 d% [+ d; o/ i# j
故障总结:
1 f6 ]7 \8 {( f; u2 }为了模拟故障,(size = 3, min_size = 2) 我们手动停止了 osd.1,然后查看PG状态,可见,它此刻的状态是active+undersized+degraded,当一个 PG 所在的 OSD 挂掉之后,这个 PG 就会进入undersized+degraded 状态,而后面的[0,2]的意义就是还有两个副本存活在 osd.0 和 osd.2 上, 并且这个时候客户端可以正常读写IO。! I, g1 o' _8 f& i5 R+ H
3.1.3 总结. D0 Q4 l/ o4 `
降级就是在发生了一些故障比如OSD挂掉之后,Ceph 将这个 OSD 上的所有 PG 标记为 Degraded。
5 O  S: e" {; i降级的集群可以正常读写数据,降级的 PG 只是相当于小毛病而已,并不是严重的问题。
" b: n! }' A$ C2 c( NUndersized的意思就是当前存活的PG 副本数为 2,小于副本数3,将其做此标记,表明存货副本数不足,也不是严重的问题。
9 d- `3 Y6 A5 w$ i- K, j/ t3.2 Peered0 ~3 O) U2 N' a) y1 H
3.2.1 说明9 |$ R& J& W4 B$ S9 z- f
Peering已经完成,但是PG当前Acting Set规模小于存储池规定的最小副本数(min_size)。
/ ~. `  O4 ?5 J: j3.2.2 故障模拟7 L7 p. h- \: S9 i9 V
a. 停掉两个副本osd.1,osd.0
* W2 C8 l& _8 M. F $ systemctl stop ceph-osd@1
! {6 U+ D( e' Z" [# i& n! n( j6 L4 R- t $ systemctl stop ceph-osd@0 9 u5 G- i% a- C* E) G( H

) L7 T  N- ~6 ~: f# {
9 Z0 A6 t" m6 ~3.2.1 说明
; r5 K" ?9 m0 N1 x  xPeering已经完成,但是PG当前Acting Set规模小于存储池规定的最小副本数(min_size)。7 h! d$ Q  P% |5 q9 n; Y
3.2.2 故障模拟
$ S0 l7 D( w' U
, A7 ?9 \7 d' p9 w7 i3 e; Ga. 停掉两个副本osd.1,osd.0* v; e! ~) A- Q; n9 v: K
2 m1 o: M" J- p7 D2 U# x
$ systemctl stop ceph-osd@1 7 k; q  O% n& h+ i, \
$ systemctl stop ceph-osd@0
* S$ a* L  k  q) c- m- y2 q2 C0 w; j' b* @, P! G
b. 查看集群健康状态
9 \$ L8 O# G, }0 s8 I% K5 k6 ~. I% I) W5 M) l
$ bin/ceph health detail 3 r) Z. ?  f7 |
HEALTH_WARN 1 osds down; Reduced data availability: 4 pgs inactive; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s)
9 ]  Y- \+ k3 |; ~OSD_DOWN 1 osds down     # k! N; Z* R. D
    osd.0 (root=default,host=ceph-xx-cc00) is down
3 l' a1 B. C9 C/ \* N+ fPG_AVAILABILITY Reduced data availability: 4 pgs inactive     " I! K. N: _' J% Z
    pg 1.6 is stuck inactive for 516.741081, current state undersized+degraded+peered, last acting [2]     ( H2 b- E' E' X+ k0 w
    pg 1.10 is stuck inactive for 516.737888, current state undersized+degraded+peered, last acting [2]     ; T1 h& y# T0 q. A
    pg 1.11 is stuck inactive for 516.737408, current state undersized+degraded+peered, last acting [2]     # y) ^' ?; }6 v( w
    pg 1.12 is stuck inactive for 516.736955, current state undersized+degraded+peered, last acting [2] & e8 u$ G$ H" N) @0 S, N
PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded     
2 [8 Z, ]* D# O0 m0 \    pg 1.0 is undersized+degraded+peered, acting [2]     - P7 x4 w0 E4 Z; W2 u8 r
    pg 1.1 is undersized+degraded+peered, acting [2]
4 K9 _% g6 F* j4 T( Nc. 客户端IO操作(夯住)( B9 ]3 W* D3 s% t( l. ]9 c; `

: p, l) \5 D5 Z0 g2 B  Q #读取对象到文件,夯住IO
) v4 M& @* ^2 ~9 W! W4 N) L& s# q; x$ bin/rados -p test_pool get myobject  ceph.conf.bak
/ _! @8 C0 {2 m( X故障总结:+ @7 r* ~' s3 R$ f- @9 v

# [: M6 P/ p! J现在pg 只剩下osd.2上存活,并且 pg 还多了一个状态:peered,英文的意思是仔细看,这里我们可以理解成协商、搜索。. d4 ]9 C$ }# z6 T
这时候读取文件,会发现指令会卡在那个地方一直不动,为什么就不能读取内容了,因为我们设置的 min_size=2 ,如果存活数少于2,比如这里的 1 ,那么就不会响应外部的IO请求。; X  z* @' `8 L. E. \
d. 调整min_size=1可以解决IO夯住问题' Y. M2 ?) h8 o. t1 R; w: ^+ S
5 u% {5 u/ x0 `, P/ a4 w
#设置min_size = 1 , _, @, r: w/ E5 r4 Z% \
$ bin/ceph osd pool set test_pool min_size 1 : _+ T2 W' e% V( j* H
set pool 1 min_size to 1 " [  C  j  N8 {4 J- L  @1 e
e. 查看集群监控状态! p0 s5 w! ^2 f4 ?8 v- L  A5 y
# y: l. ?6 W% i7 P1 b  _# a9 B
$ bin/ceph health detail + U/ n3 P( d& `0 q* v6 k
HEALTH_WARN 1 osds down; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized; application not enabled on 1 pool(s) ( ^# a: B4 h+ O/ t( P$ L" w! v8 F
OSD_DOWN 1 osds down     / C* |5 Q9 i) \  T! _2 ~
   osd.0 (root=default,host=ceph-xx-cc00) is down。8 O2 a" \; s( p" K( P9 g+ _
PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized     3 p8 U8 f1 g. Q, v2 s- |
pg 1.0 is stuck undersized for 65.958983, current state active+undersized+degraded, last acting [2]     
& q( i: ^5 t: n5 v0 x% H/ o' |pg 1.1 is stuck undersized for 65.960092, current state active+undersized+degraded, last acting [2]     
/ a$ W1 L0 N% O: R2 Z* Zpg 1.2 is stuck undersized for 65.960974, current state active+undersized+degraded, last acting [2]
* c( v- h1 t- G9 p. |! {3 xf. 客户端IO操作  U$ [4 y" q% W! I+ _. C
  F5 M- i7 G/ ~, `& v* s& W7 s
#读取对象到文件中
& K; F+ h# V, W+ O) J( j' q$ ll -lh ceph.conf* : K3 e, Y8 j5 n; X
-rw-r--r-- 1 root root 6.1K Jun 25 14:01 ceph.conf
4 }5 c$ R5 n3 n7 K& i3 s( G-rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.bak % ?6 H" F/ @7 {- X
-rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.bak.1
) B# Z- F8 g& x故障总结:
& H/ y  L1 }0 F
2 b. n3 d0 P+ L% ]0 T: t% y可以看到,PG状态Peered没有了,并且客户端文件IO可以正常读写了。9 D+ l3 E4 u8 Z0 Q5 \/ J* |
当min_size=1时,只要集群里面有一份副本活着,那就可以响应外部的IO请求。$ b+ [, c  w+ o6 u3 r+ D
9 G' ]" O# [2 Z# q
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 00:58 , Processed in 0.042874 second(s), 22 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表