|
|
现在OSD 1知道一些对象存在,但是没有这个副本活的OSD。 这种情况下,到这些对象的IO将被阻塞,集群希望失败的OSD快速地回来。这时假设返回一个IO错误给用户是适当的。
: ^) j0 b! a2 u/ P5 b
& }$ ^' u, ~1 c# g修复建议:
( i4 {. R' y# G+ m8 g- k- l8 z1、启动停止的osd% d) m3 d. j% u. g1 [: X
首先,你应该确认哪些对象找不到了:
9 ^4 X5 P# ]1 Y, B" g2 p, T
4 s8 \4 N8 m" \/ q/ t; C# ceph pg 1.d list_missing
& B6 \# g8 ?) r3 X$ L$ i. T) L, C! `( y8 h) E% |
[starting offset, in json]
2 h: ]$ ^7 ^% c- o) B' Z9 L9 {) @9 H3 [( y8 E
{ "offset": { "oid": "", d3 x! k# T' D4 D4 X' o
"key": "",
% v1 `, S) Q4 l$ L% Z7 U6 @ "snapid": 0,8 |; J+ z! ? |# L; b* {- j0 F U
"hash": 0, D: ~0 Y1 x" A6 W5 c! t! t4 a
"max": 0},
, G! [2 Y7 c r: f"num_missing": 0,3 g4 k( J4 B( V0 N
"num_unfound": 0,
, w2 s! }' {' Y"objects": [; p6 W! w- H: Z' b, N. y, O
{ "oid": "object 1",# ^. Z6 w+ q8 G a
"key": "",
7 U0 K2 m5 g+ }& N8 { "hash": 0,
- ^7 g( V* ?! l; `7 Q "max": 0 },
+ {* R8 V: D7 ~# ]6 Z7 q( y ...
" E" S" P1 r+ y* k+ ]* ?],
( p4 I6 b) R* Z2 l"more": 0}6 H4 ]) A; J. K* |0 g
$ i' S, o( ]7 r( }如果在一次查询里列出的对象太多, more 这个字段将为 true ,你就可以查询更多。0 f( A$ f$ |" _$ O9 X% r k
其次,你可以找出哪些 OSD 上探测到、或可能包含数据:& @; L! Z0 \6 r
( n' s1 b6 _, R% B" `
# ceph pg 1.d query5 @" B2 M J9 I; K4 m3 Q
. W. [' m) H% Q' h
"recovery_state": [
; S6 }$ `) K* T7 D { "name": "Started\/Primary\/Active",: F' @/ ]8 E& g; j. D0 t
"enter_time": "2022-08-01 15:15:46.713212",
1 ?2 q/ B' p6 d* s9 H# U0 C "might_have_unfound": [
& m0 \! R9 F! ~4 Y { "osd": 1,9 F Z7 ]" K4 L: {
"status": "osd is down"}]},
4 e: i, H; ]! u7 F7 X# L4 u, F* V2 y9 h5 x8 V6 \( V. \- H8 K
所有停止osd.1
, [4 I2 H% o& l6 U! V" {8 e) N
8 [5 {% k2 o" n0 ]9 p! O4 H2、如果还无法恢复,你可能只有放弃丢失的对象。执行如下命令回滚或删除对象:
6 |; q0 U7 F" A( \0 l. d3 p" ~1 l/ p' k5 [! r! k
ceph pg {pgname} mark_unfound_lost revert|delete
# n6 i6 o0 ^$ P( i) f: a
+ ~( o" b( R) x& O9 g$ P; v# Erevert选项:回滚到对象的前一个版本6 w- X" Z2 }' K$ ]5 l
delete选项:完全删除这个对象0 U6 P# v+ ]1 n4 N, \4 t
使用这个操作时注意,因为它可能是使预期存在这个对象的程序混乱。0 Y/ O- O( p1 w X* E, _+ E8 A
列出带有丢失对象的PG的名字:
* \" n' f( F. N6 H$ L$ P; g- R: u$ `% X4 C: ~
ceph pg {pgname} list_missing8 [- c" q6 ^9 j2 ]; F4 e) v+ J; h
0 W/ R% h$ U3 d4 o, Z( q举例:1 Z3 b }! m$ v8 V
- X! H( X' N j* |1 e1 Q
[root@node2 ~]# ceph health detail | grep unfound
1 O* |/ ]. Z+ O4 S3 h: c IHEALTH_ERR 50 pgs backfill_wait; 3 pgs backfilling; 60 pgs degraded; 1 pgs inconsistent; 1 pgs recovering; 19 pgs recovery_wait; 60 pgs stuck degraded; 73 pgs stuck unclean; 41 pgs stuck undersized; 41 pgs undersized; recovery 126807/1654284 objects degraded (7.665%); recovery 186892/1654284 objects misplaced (11.297%); recovery 1/551428 unfound (0.000%); 1 scrub errors" y2 N. @- V: k# |
pg 3.39 is active+recovering+undersized+degraded+remapped, acting [14,2], 1 unfound3 U+ R- i7 Q5 Q6 X) O5 u" v: y, }
recovery 1/551428 unfound (0.000%)3 P `2 F9 p* \% Q# l
& l! Z0 b) L& Z7 X$ k[root@node2 ~]# ceph pg 1.d mark_unfound_lost delete
( R; g" ~0 {% K8 }+ a! wpg has 1 objects unfound and apparently lost marking
+ g! C s- R1 S6 p2 V————————————————
, d! ]9 z+ k; h! ?$ _5 U6 o6 j; a. e& n8 R9 u1 {6 R7 ~- j2 T
: O- O6 x$ B* `; P |
|