|
|
现在OSD 1知道一些对象存在,但是没有这个副本活的OSD。 这种情况下,到这些对象的IO将被阻塞,集群希望失败的OSD快速地回来。这时假设返回一个IO错误给用户是适当的。
1 M7 g% Q+ U% Y* G d ~0 }
+ @6 w2 s. e$ [" E+ c+ `4 s8 A! Z1 C修复建议:
4 G! T4 e" |% P1、启动停止的osd- [5 j- G+ O* v4 k7 f# f4 y* s+ K
首先,你应该确认哪些对象找不到了:
4 F) g# y+ t: e' c0 {) y& ^
9 T+ B/ s8 R1 ?# ceph pg 1.d list_missing
7 l$ z( r; ?1 {- H4 ?/ B& i
4 R! t- q* m" q5 S3 ?/ l[starting offset, in json]3 x4 O- E& b) {( A( @( ^
5 P/ A% o( `0 ~$ I q
{ "offset": { "oid": "",# {7 A! K7 S1 z& E
"key": "",# x7 m: ~0 H# Z: V) z* m
"snapid": 0,# O3 W- @) v2 ^' v/ R A$ g' n
"hash": 0,9 i9 ]* M8 |! c: u" y: h! z
"max": 0},
; Y2 A) J* D( {/ I"num_missing": 0,
! O, i' R8 X( G2 ?3 A"num_unfound": 0,
D; @; u2 R8 m* }"objects": [
( u( [2 N1 c& Q9 |8 N( Y* { { "oid": "object 1",
# q0 N! g+ |5 v2 ~% F" Q "key": "",
# t9 \2 f* E E, F3 o4 u "hash": 0,' s& D6 p) C3 s7 E. c: m
"max": 0 },) c$ d6 Y, N/ ^" e% u `3 A$ c! T& m
...
, M+ \. D: @' {& x' p3 {. o+ t) |],
[7 ?3 ]. ?2 _! f8 i"more": 0}
% y ^* s. w% Y" s8 J5 a" ^1 p# t; b6 k; z7 {9 Q' D1 G$ x
如果在一次查询里列出的对象太多, more 这个字段将为 true ,你就可以查询更多。! Q' H& @) Z# x+ k* I5 A/ z
其次,你可以找出哪些 OSD 上探测到、或可能包含数据:' `6 Q- ^4 @% I* _! [( o
8 s( W; ~ }9 J" I6 U# ceph pg 1.d query" ^3 [2 n2 O' h O1 I& c
, A# P3 Y! T, x"recovery_state": [
2 i; v) U* x" g2 V# @) G6 X' K { "name": "Started\/Primary\/Active",& o ~& i* e# a( U& S. N
"enter_time": "2022-08-01 15:15:46.713212",
2 E8 |# E2 u6 U' i "might_have_unfound": [
: u" x$ F% V+ Q( m# [ { "osd": 1,
/ C& O3 a. e8 i; u. O; {6 u& Y "status": "osd is down"}]},' C1 \$ I# E) V/ T7 g- i
* S/ s$ i9 ~5 Y
所有停止osd.1
+ q) t9 E: ?# C1 z
: V8 m8 |5 |" {2 i- h$ }4 s2、如果还无法恢复,你可能只有放弃丢失的对象。执行如下命令回滚或删除对象:( W5 R* _$ ~* }: D$ z1 w- Y
8 A( m3 c1 T8 f/ Y+ ?# g
ceph pg {pgname} mark_unfound_lost revert|delete5 ]2 v& E4 B4 b/ h3 Z& }3 `
; @# V8 J0 L# z! {( }
revert选项:回滚到对象的前一个版本
4 t2 N" e* h" O2 ^delete选项:完全删除这个对象
3 }, @, q1 C! V使用这个操作时注意,因为它可能是使预期存在这个对象的程序混乱。5 e/ s/ `3 X0 `" d
列出带有丢失对象的PG的名字:; o6 S* u% `$ R% r" |2 z( x' X
) v- t5 X, k1 L9 |7 N1 z' j
ceph pg {pgname} list_missing
8 M! i w& A. x; i+ n% T; D T3 A2 b4 ?3 H& m* S4 w' ]& C6 S
举例:' t0 {9 y+ @! X K
( U- m! ~# T8 I r" t# V. ^
[root@node2 ~]# ceph health detail | grep unfound
$ v( u$ s7 X0 \: _( w/ z# |. g6 @HEALTH_ERR 50 pgs backfill_wait; 3 pgs backfilling; 60 pgs degraded; 1 pgs inconsistent; 1 pgs recovering; 19 pgs recovery_wait; 60 pgs stuck degraded; 73 pgs stuck unclean; 41 pgs stuck undersized; 41 pgs undersized; recovery 126807/1654284 objects degraded (7.665%); recovery 186892/1654284 objects misplaced (11.297%); recovery 1/551428 unfound (0.000%); 1 scrub errors
3 M* a% M0 j" W5 P9 Dpg 3.39 is active+recovering+undersized+degraded+remapped, acting [14,2], 1 unfound
5 U: r6 ]9 g4 T# ? `5 irecovery 1/551428 unfound (0.000%)
- W2 \ [; R; {- @& ^+ W- {: ?& Z
[root@node2 ~]# ceph pg 1.d mark_unfound_lost delete
, E. H w( S1 _9 \pg has 1 objects unfound and apparently lost marking4 X, ^4 p3 E% T- U J1 }% |' H
————————————————0 }, T- x, O% P" p; L& q1 L
8 [; x$ n8 a4 m1 q
, B* |: ^. h- q+ Z) U+ {" v
|
|