找回密码
 注册
查看: 4243|回复: 1

Degraded data redundancy: 1 pg undersized ceph status状态异常

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2021-6-9 15:00:16 | 显示全部楼层 |阅读模式
[root@controller ~]# ceph -s
4 T2 w) Z2 U- e* d( U; _9 l6 G' P  cluster:
" A' z- u, m/ X) s/ j: a" B; K. A  t    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de, L  q# U, n4 n+ J
    health: HEALTH_WARN# r3 H  k# j: W  N
            Degraded data redundancy: 1 pg undersized
/ Z- w! }' ?8 o3 B& W8 | ! b9 p$ b, Q+ x) a9 z+ M
  services:! d2 l- U# s$ W0 h2 |& M- N, X
    mon: 1 daemons, quorum controller (age 87m)
3 l" \+ V) v. z* t    mgr: controller.horbtx(active, since 87m)
# D( i: @2 ^7 q7 k    osd: 6 osds: 6 up (since 6m), 6 in (since 6m); 1 remapped pgs
) f# D" F+ U+ a$ a! N) j( E
; r2 ?. K5 \% E+ |% H3 F' {+ I  data:8 L0 ?* l2 ~! ?
    pools:   1 pools, 1 pgs/ _6 d, X5 K" |9 `# k
    objects: 0 objects, 0 B
/ L$ z, O. E7 @2 u7 }+ l. M    usage:   6.0 GiB used, 114 GiB / 120 GiB avail7 ]5 z# M$ h- m
    pgs:     1 active+undersized+remapped8 [& e8 E; p+ s$ W8 u; j4 {
, ~' A9 f4 w2 J) l. [; H5 S' Z
解决过程:: J  o# @# T9 P

' t# ~' y% W! O6 Y9 l" s5 P5 |2 u[root@controller ~]# vim /etc/ceph/ceph.conf 9 I8 [( w8 q; p( T, J0 Y' E1 p) z
7 m/ w7 ^; s) }7 c
  osd_class_update_on_start = false* e! H3 F, ?/ C# v  ^0 {5 c

# u: |5 G4 r/ ]- ?$ \$ l6 M3 k/ z4 z, w3 L: w7 _+ M+ q
[root@controller ~]# ceph health detail
4 q3 V8 d$ B! l" O" WHEALTH_WARN Degraded data redundancy: 1 pg undersized
+ O) ?* S3 W/ \& [1 A" x+ E[WRN] PG_DEGRADED: Degraded data redundancy: 1 pg undersized
; L( _- |5 y: T    pg 1.0 is stuck undersized for 86m, current state active+undersized+remapped, last acting [1,0]
& h; ]' }3 C! ]' v* L0 k, c% Q/ D+ G- J& I" F* F5 T
修改配置后,需要重启osd服务:& c+ C- t+ S/ X, f. ~! D

' a3 r# o) ?8 Cceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.0.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.3.service
# }) ?! S4 L6 _1 Xceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.1.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.4.service
9 i; `" s* w, E8 c' O6 V. T& F. Jceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.2.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.5.service+ Z3 U5 O1 _0 J" m* P, k* N
[root@controller ~]# systemctl restart ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.% j9 J7 E9 z  w. Y
ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.0.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.3.service: G9 \6 U+ y, W9 _4 Y& L
ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.1.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.4.service: A+ Y! _& R5 A; z$ L% e
ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.2.service  ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.5.service9 s  K  b! ?: O! r. S
[root@controller ~]# systemctl restart ceph-a4bb5236-c8ca-11eb-a67b-000c29ad02de@osd.*0 N, @9 ^! N6 R. C( ~, f8 l
[root@controller ~]# ceph -s
! O5 Y: I2 k9 v* d% `  cluster:- P+ X5 c& k% o+ T) I6 Y8 w
    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de4 k: t2 \& E: Y. w/ d" L3 P$ U: Q
    health: HEALTH_WARN
8 A# c) Q/ O! d5 F# s& d5 F1 t  P            4 osds down
. [0 [7 I$ {2 R. W8 {! O2 f            Degraded data redundancy: 1 pg undersized0 N$ N0 o' W% L# {4 Z# n) A

  T6 G9 I/ ~) V, k7 a2 R. l  services:
) w1 f+ D6 @3 O' t; D    mon: 1 daemons, quorum controller (age 89m)
4 @/ ^6 p) H. @% r3 X% b4 Z    mgr: controller.horbtx(active, since 88m)  b6 J- Z& m4 r: c& y/ l
    osd: 6 osds: 2 up (since 0.641904s), 6 in (since 8m)
* b8 {0 p8 O7 ^$ o# b8 ^# ], m2 \
0 e! @4 m# F% D% C' h" e  data:
: e3 F; u. [# E$ j! j  d    pools:   1 pools, 1 pgs$ [. i  N, U2 R& V. P5 ^/ I- N
    objects: 0 objects, 0 B( U/ {9 b) b) X0 q
    usage:   6.0 GiB used, 114 GiB / 120 GiB avail
' K( \+ g: E" Q1 L    pgs:     1 stale+active+undersized+remapped
* |& g: l. ~8 w1 a: {9 x
) A* e- v# g! j# b% q[root@controller ~]# ceph -s
, @2 ?: L3 j% m' r  cluster:
& A2 [# e. p% r& f/ K    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
! L2 F- g  \4 I0 ^) Z! x    health: HEALTH_OK
+ v, ~# }) v( U' o# p' H! b
3 _4 }. m% F7 z, z5 n. p  services:( `! k. @; ~, d, Y/ T# K# c4 P; j
    mon: 1 daemons, quorum controller (age 89m)
- w; ?6 \/ _7 h) K9 D& g    mgr: controller.horbtx(active, since 89m)
% ?0 p( M% G% P( Y( y& W    osd: 6 osds: 6 up (since 6s), 6 in (since 8m); 1 remapped pgs8 n! S; [  N  w3 L9 t4 }* E
, s* ]5 c( R: v/ d+ A
  data:! K- `7 U1 q3 n6 ?
    pools:   1 pools, 1 pgs
' f5 T) @3 m' \! j    objects: 0 objects, 0 B
* t! c* @+ T$ X% d! P- x# y    usage:   6.0 GiB used, 114 GiB / 120 GiB avail0 r" K) W) ]) ^$ a5 j; h) Z& V
    pgs:     1 active+undersized+remapped
' R: i2 A+ l6 d. m4 |" N6 }  |, s
0 f" p, s' P5 T  V- W[root@controller ~]# ceph -s: B. [0 t( H& m. h5 f; L
  cluster:% s. Q5 U( D( ~7 k: x
    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
# O0 P, ]0 p" C# w    health: HEALTH_OK
, B1 c$ E; O* b6 j) G* w  B6 _, U
! [" T# S4 a& g' u  services:- o4 m' j* @5 D0 a9 v& c
    mon: 1 daemons, quorum controller (age 89m)4 ~. K+ z  n' M) |
    mgr: controller.horbtx(active, since 89m)+ Z2 y/ X) u9 Z# {/ B- u" q
    osd: 6 osds: 6 up (since 8s), 6 in (since 9m); 1 remapped pgs
; o. K  g5 X1 \/ S  | 5 s4 L& ]/ R9 `9 W
  data:7 H3 _; H1 A. F  Z5 \! z# ]
    pools:   1 pools, 1 pgs
9 n' h' V! b, ?+ r$ q9 j# ~  j    objects: 0 objects, 0 B
! l( d( O' l1 L. ?" ]/ w1 r- w    usage:   6.0 GiB used, 114 GiB / 120 GiB avail3 O& O9 j$ `; X( x* W; ^
    pgs:     1 active+undersized+remapped
4 s* T/ I# @5 f   D  @: `0 }! ^# J5 t, C6 J3 G' P% j$ k
[root@controller ~]# ceph -s
: n) @1 l+ x2 P  cluster:' ?; [* ]# Z9 {* K% L0 S# m! V, I  _- b
    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
' C, U: I' A4 q; z; Y9 |/ B" z    health: HEALTH_OK
7 s' o; I* z( J7 s5 A3 e) g' s& E6 V
/ C7 R% C3 B! M  services:
1 `3 P0 @; G9 o3 r    mon: 1 daemons, quorum controller (age 89m)6 `4 O+ w; F4 b: h/ D/ p* N
    mgr: controller.horbtx(active, since 89m)
" X" A8 l' U2 h  m- a1 ?    osd: 6 osds: 6 up (since 9s), 6 in (since 9m); 1 remapped pgs1 I* g6 J0 i; W6 ~/ F' u) Q
+ W9 _; ?% ]/ q1 j  o0 T
  data:
% N! j3 x' z( A3 P" l( {9 }    pools:   1 pools, 1 pgs
; C  W( E# V  q6 R3 v    objects: 0 objects, 0 B
4 K, H3 v+ y8 T3 @4 T    usage:   6.0 GiB used, 114 GiB / 120 GiB avail' d- E4 x9 W) |6 g' d+ D
    pgs:     1 active+undersized+remapped& u# m6 @. U9 S8 r

0 i. q, X! i/ |9 l8 O7 H7 O/ r[root@controller ~]# ceph -s% u5 [% y, D" D7 r2 I3 X
  cluster:
3 t( f6 v5 O, b* B9 E; K    id:     a4bb5236-c8ca-11eb-a67b-000c29ad02de
! g, u  g: `9 L. W( J    health: HEALTH_OK  ~: Y6 o  J& ]) W9 n

+ P9 @$ l9 F2 ]  A# ]3 E  services:" z9 f$ B7 g: w! x: b8 m0 |% v) Y
    mon: 1 daemons, quorum controller (age 89m)
4 Y; q) N- Z8 C% R9 i( |' o    mgr: controller.horbtx(active, since 89m)1 i2 ~) k7 I+ P- p; s3 V# j
    osd: 6 osds: 6 up (since 10s), 6 in (since 9m); 1 remapped pgs
1 b  r+ Z/ i$ L/ ?# [6 M3 G% e 6 F7 q* X3 `* C& a( W/ g
  data:7 u; J. ]- Z! ~& J
    pools:   1 pools, 1 pgs
7 ~4 P5 y4 |  C    objects: 0 objects, 0 B6 C7 b( I& T+ t( |  g2 {
    usage:   6.0 GiB used, 114 GiB / 120 GiB avail
3 P! N* ~+ t& U    pgs:     1 active+undersized+remapped
0 ~+ `8 n9 n4 x, _; p& | " ^: S) K2 c2 N$ S6 n3 J# T( m& ^
[root@controller ~]# ceph health detail ' V5 s7 ~$ A8 Q' [3 }2 k( ]
HEALTH_OK
5 Z& T- w- S3 p9 m" Q( J[root@controller ~]# ceph health detail
1 _+ ^$ N3 t/ R, `- o" d' wHEALTH_OK0 u* _' u5 L6 h" W5 B/ _
[root@controller ~]# . R1 x: G4 e, @  @9 S7 ?5 d

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
 楼主| 发表于 2021-6-9 15:00:17 | 显示全部楼层
3.1.1 说明. ]8 [1 }5 o) M# J
降级:由上文可以得知,每个PG有三个副本,分别保存在不同的OSD中,在非故障情况下,这个PG是active+clean 状态,那么,如果PG 的 副本osd.4 挂掉了,这个 PG 是降级状态。4 u: l7 L) F) z* n  I
3.1.2 故障模拟4 P8 J7 G. i6 n' {! A
a. 停止osd.1
) \  h  I, p7 @ $ systemctl stop ceph-osd@1 9 L7 @2 t8 g$ B" e5 r
b. 查看PG状态. q- `( Z' y5 q2 U" U
$ bin/ceph pg stat 20 pgs: 20 active+undersized+degraded; 14512 kB data, 302 GB used, 6388 GB / 6691 GB avail; 12/36 objects degraded (33.333%)
3 n. R2 k% A8 h+ m) r, M. Nc. 查看集群监控状态
5 P. A# ?; k( F $ bin/ceph health detail 4 e0 y6 T! \+ h* ^% ~
HEALTH_WARN 1 osds down; Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s)
/ M7 j- I' T+ yOSD_DOWN 1 osds down     
. [& e: w& F5 }0 {7 U+ H   osd.1 (root=default,host=ceph-xx-cc00) is down
/ b* r' a6 \$ U2 N  \$ @2 _PG_DEGRADED Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded     & ^; g" i  _, i( v
   pg 1.0 is active+undersized+degraded, acting [0,2]     # ^8 ^3 b0 T- ^1 U- o2 H
   pg 1.1 is active+undersized+degraded, acting [2,0]
/ U* }5 j. v( v4 h6 Fd. 客户端IO操作
: a" M8 r) F, u/ @0 E #写入对象 + [. g( ?7 C; k6 K& j2 s
$ bin/rados -p test_pool put myobject ceph.conf  / V- p3 U: B8 w4 k
  k/ H7 ^; K0 F% t9 l
#读取对象到文件 8 p9 ~# R0 W- c% z& e5 A
$ bin/rados -p test_pool get myobject.old  
8 A% A6 y+ w! I. o9 w1 o : a1 a2 j- e3 N) _
#查看文件 % ]4 s! J' p2 ~- k
$ ll ceph.conf*
9 m; r/ Z/ w% U; i3 G-rw-r--r-- 1 root root 6211 Jun 25 14:01 ceph.conf ( r0 b( ]* l4 M3 Y5 R+ e% {
-rw-r--r-- 1 root root 6211 Jul 3 19:57 ceph.conf.old
9 ~: h  }; O) j+ }( K2 l4 E9 E故障总结:6 A* }: y0 T4 D8 M, @( f
为了模拟故障,(size = 3, min_size = 2) 我们手动停止了 osd.1,然后查看PG状态,可见,它此刻的状态是active+undersized+degraded,当一个 PG 所在的 OSD 挂掉之后,这个 PG 就会进入undersized+degraded 状态,而后面的[0,2]的意义就是还有两个副本存活在 osd.0 和 osd.2 上, 并且这个时候客户端可以正常读写IO。" y$ G; k/ M0 t/ p% y, n9 ?
3.1.3 总结
0 `* x# R- R5 _, m# @# ?降级就是在发生了一些故障比如OSD挂掉之后,Ceph 将这个 OSD 上的所有 PG 标记为 Degraded。
8 r0 ^4 e3 \, T降级的集群可以正常读写数据,降级的 PG 只是相当于小毛病而已,并不是严重的问题。
4 h4 b# n1 T/ b' ^  J  l8 wUndersized的意思就是当前存活的PG 副本数为 2,小于副本数3,将其做此标记,表明存货副本数不足,也不是严重的问题。
2 ^; O: q2 _- }! l3.2 Peered, k" R1 ?( R- z  F; y( `0 P0 s% ]
3.2.1 说明
! S) _6 m6 e- H9 o% JPeering已经完成,但是PG当前Acting Set规模小于存储池规定的最小副本数(min_size)。
, E+ \" j$ Y% }  C3.2.2 故障模拟  z) O) f+ @9 i" ?. w
a. 停掉两个副本osd.1,osd.0
9 X8 X$ }4 D8 g5 } $ systemctl stop ceph-osd@1 ! H' g' z+ G* ?+ N) p% H
$ systemctl stop ceph-osd@0
5 c- g* Y8 q  M  _* Y: D1 \/ k( ]6 e& c0 B4 p1 c9 F+ R. }
( y( K' _9 z4 S- c  H
3.2.1 说明
1 Y5 c3 l1 X7 U: LPeering已经完成,但是PG当前Acting Set规模小于存储池规定的最小副本数(min_size)。$ M- [4 c- x0 {
3.2.2 故障模拟2 ~2 O' _* C1 z  i, E
& w4 z) A2 |: J. r: W9 t6 n
a. 停掉两个副本osd.1,osd.05 M8 E$ c5 y# w# x0 _
1 X! a! `. T4 E
$ systemctl stop ceph-osd@1
4 p; C- X% R3 j$ B+ ]% Z $ systemctl stop ceph-osd@0
8 ~5 ]. {% ~! x3 l, ]! k
2 Q/ z7 C' ^- W. o4 _7 B/ r/ Ab. 查看集群健康状态
7 Z. |% A& d1 U+ U# k# V
% i4 D# ^. I& p1 G9 C $ bin/ceph health detail 2 P8 G$ w) V7 i7 V
HEALTH_WARN 1 osds down; Reduced data availability: 4 pgs inactive; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s) 6 E. e; I! s$ b
OSD_DOWN 1 osds down     
' X: o1 ]3 \3 z' _    osd.0 (root=default,host=ceph-xx-cc00) is down ; t/ g3 p2 ~* j5 s2 a
PG_AVAILABILITY Reduced data availability: 4 pgs inactive     
$ n  H  B  ]8 v& ]2 K) k    pg 1.6 is stuck inactive for 516.741081, current state undersized+degraded+peered, last acting [2]     
: H6 I' A$ k6 G7 J    pg 1.10 is stuck inactive for 516.737888, current state undersized+degraded+peered, last acting [2]     
: M3 ?2 s  \( i$ {) a% K: ?% {/ y    pg 1.11 is stuck inactive for 516.737408, current state undersized+degraded+peered, last acting [2]     
2 m2 j/ D) m# C; p6 r+ \    pg 1.12 is stuck inactive for 516.736955, current state undersized+degraded+peered, last acting [2]
4 J: V; ]+ R. b, n7 J! OPG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded     
& w& @$ g0 c& N    pg 1.0 is undersized+degraded+peered, acting [2]     
8 V$ a3 X; C- r    pg 1.1 is undersized+degraded+peered, acting [2] 9 W' m' Y- F3 |6 |0 |/ v3 x  C
c. 客户端IO操作(夯住). U! ?' g# O% B  [( ~& T: f9 o) T

4 |$ i# D+ d8 e  o# U% ~ #读取对象到文件,夯住IO 3 E: h7 q$ k2 T2 O, D+ n
$ bin/rados -p test_pool get myobject  ceph.conf.bak ! t$ K1 M) G1 `  @- O( I) `* k, P
故障总结:8 u5 H4 w5 I! b0 P# l; s7 `) A
" H+ D" W1 W$ W! z1 m5 L: v3 p* V
现在pg 只剩下osd.2上存活,并且 pg 还多了一个状态:peered,英文的意思是仔细看,这里我们可以理解成协商、搜索。; b4 ?! s- c/ c  |
这时候读取文件,会发现指令会卡在那个地方一直不动,为什么就不能读取内容了,因为我们设置的 min_size=2 ,如果存活数少于2,比如这里的 1 ,那么就不会响应外部的IO请求。
$ K8 c/ j- N' a" q0 id. 调整min_size=1可以解决IO夯住问题
: N6 F( T0 h1 V6 M5 A
3 {( v+ M2 F2 k  Q #设置min_size = 1 1 I4 {0 z6 [/ J3 U- f
$ bin/ceph osd pool set test_pool min_size 1 7 h$ v! q8 }5 z8 U! E5 q2 H
set pool 1 min_size to 1
3 i! o& x  ?4 Q# x; {e. 查看集群监控状态
# W$ f  \, G5 i0 g
; [* p; H7 G0 o+ t# `0 b2 u* j $ bin/ceph health detail & h; {* m) z; Y) ?, u+ D
HEALTH_WARN 1 osds down; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized; application not enabled on 1 pool(s)
1 ]( V9 t  l- N  v2 q0 \8 E' Q1 zOSD_DOWN 1 osds down     1 K; S" s" q, ?& k* K' \9 V' E
   osd.0 (root=default,host=ceph-xx-cc00) is down。5 ]& w6 f2 ]1 G- ~1 z- X, E: Q) h
PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized     : n/ E5 b  q$ z2 w
pg 1.0 is stuck undersized for 65.958983, current state active+undersized+degraded, last acting [2]     
5 t4 N! \% h5 m1 f6 C- |8 l- ]pg 1.1 is stuck undersized for 65.960092, current state active+undersized+degraded, last acting [2]     
2 F/ Y0 j% O4 o) ~  S8 f8 [3 lpg 1.2 is stuck undersized for 65.960974, current state active+undersized+degraded, last acting [2]
5 j, S4 T3 O0 G8 m6 m) }f. 客户端IO操作
" S) ]" ?* S: C' s' L" y9 P! c2 {1 [5 c% w
#读取对象到文件中
+ z/ c2 [, U% C" P8 k7 D3 q) p+ {9 G$ ll -lh ceph.conf*
7 J: H& w0 s# A/ h9 ^1 {3 y, Z! G-rw-r--r-- 1 root root 6.1K Jun 25 14:01 ceph.conf 8 b4 @4 v. X, o, a, n; W; L" ^
-rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.bak
7 P0 Q8 ]/ F3 K  ?; a-rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.bak.1
' T4 P1 T: D1 C; J故障总结:
6 p1 T+ [" z) ?# ^) f
1 [8 `- w6 w, J9 y) y可以看到,PG状态Peered没有了,并且客户端文件IO可以正常读写了。4 p: }- d' g+ ]8 N6 ]
当min_size=1时,只要集群里面有一份副本活着,那就可以响应外部的IO请求。: f% P8 O/ w3 j% c! H1 l0 E
! L9 B+ h- [7 K! l/ C3 P
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 00:01 , Processed in 0.026923 second(s), 22 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表