|
|
楼主 |
发表于 2022-8-11 15:39:59
|
显示全部楼层
创建文件系统初始化
; B! q ^* T1 b9 M+ z1 [; L4 _ h8 [5 _$ a2 a
2 r( I; B# g8 V1 P1 W
; l# q" c3 q( @" _8 g& `ceph osd pool create metadata_pool 64 64 replicated cache_rule(cache_rule为ssd)
. r! D& D1 a' w) f* W9 ]( z+ b# L, _9 A7 `
ceph osd pool create data_pool 256 256
1 D) R# u; t/ J) d3 ^% C+ Z2 E: E! l o1 ]2 K* U7 t
ceph fs new cephfs metadata_pool data_pool
% o( h/ k5 Q/ f8 C2 P& T; h4 ]5 J3 c- s+ U
systemctl start ceph-mds.target. W1 E0 G+ @9 `; _+ ^& l" x! b
6 u$ i; V/ P+ P" `' _. K
& x$ x1 {0 M. P8 d$ @7 }3 k6 @; p3 |& i& _2 g5 L, z' G1 q/ J" ]; _, k
7 x) J4 w3 M$ g9 k
0 t' r, `7 T/ v3 T5 ~) u4 m ]; v2 A1 V! } g# E6 ?* h) n& }
- _; l, J( i5 C3 X/ C
挂载客户端,注入测试数据
6 _0 h0 g5 u: R% z% a/ a
$ e5 H# u( _" k% K' X) Y" g1 x3 w2 n! B4 x1 \* e
* G' ^2 Z/ W1 W, l9 ^# |+ N
ceph-fuse -n client.admin -m 10.10.0.7 –key=AQDncgVfX4NDChAA4yAy6AmbK6YbfLha3zGA7w== /mnt/cephfs5 Y5 ?3 G. U2 S6 b0 p
1 W( ]8 d2 i9 S% t6 b( {: m0 v8 n
mkdir -p /mnt/cephfs/a1/a2/a3/a4& b9 x% W' z, i3 g- f* \6 A
. j$ Z1 ~; a0 q$ x+ O, Zecho hello1>/mnt/cephfs/1
9 W& z$ T5 x- r' i3 v7 A" b8 X8 M* y/ F; B3 o! t/ b5 Z2 E T0 ]3 E3 |
echo hello2>/mnt/cephfs/a1/2
% M0 F, `8 {" K* K& F8 z; ~
6 b0 r8 N( C, e5 }/ gecho hello3>/mnt/cephfs/a1/a2/3! ~! ?1 c7 o4 ^, V* D3 `+ z% v4 c
% g* d, p8 m3 q( N2 [; s
echo hello4>/mnt/cephfs/a1/a2/a3/4, L% W/ L1 t5 _3 L$ A) T
# H- W0 I) \. C" e$ X$ mumount -l /mnt/cephfs k4 n2 Z& i& ^
3 I9 X5 O: d* |0 u! y* J2 f) k, _+ t" N
4 } I- h5 ? Z0 I
/ E& V, ^% z3 h; l/ q5 r, [5 Y& O
1 \7 L* \* V; N4 f模拟故障* U4 m: q7 i, w0 |; F4 a6 b
) C* V1 e' x9 w' G将metadata_pool中所有obj删除,重启mds
, m, S" j) G, l [+ v+ }8 J2 c6 L/ M' |+ P% h6 y& }% A
5 h+ u: e& k l, Q- Y% ^$ W& s5 e- p
for i in rados -p metadata_pool ls;do rados -p metadata_pool rm $i;done
3 m6 q, u! s' }$ a/ O7 d- s4 Z4 M4 x, _- q
systemctl restart ceph-mds.target9 t! \+ h, i2 f5 t7 L0 o
8 O4 @+ h6 U- Q; P2 N; O+ B# t" _
- T6 R. c1 Z6 r$ A, ?" W$ G7 x; Q
9 F9 R0 y; }0 Y7 R7 k
6 g, ?6 k4 n. }, [" E* ]3 l+ d
. J" _) L0 Z) \8 A" E, d
0 x. h8 |: Y; z7 c
( Y9 N2 D% M, O3 N$ ^% k
恢复步骤
& f i6 g: `& h( x: T$ j; C$ B$ t2 ~0 _; c4 E" b: L
& O5 a$ m- O) w9 R& S2 h; g" m: F I8 _
设置允许多文件系统% v( A( h# Q M$ L9 x# S; U
) C- n% |; h# ]4 a. u) uceph fs flag set enable_multiple true –yes-i-really-mean-it* S9 g! Q3 \; \' k$ k1 }
9 B% V( u* H2 A- r. S
创建一个新的元数据池,这里是为了不去动原来的metadata的数据,以免损坏原来的元数据: d% _0 g! l/ n
- o. E1 x5 R6 u" Y. N1 bceph osd pool create recovery 8- W9 P2 \. j4 I: O
( B4 o' ]* Q. g' h9 Y2 A
将老的存储池data和新的元数据池recovery关联起来并且创建一个新的recovery-fs# f8 ?; e# ~4 y
; O' L0 T& d, T, z% Dceph fs new recovery-fs recovery data_pool –allow-dangerous-metadata-overlay
5 d5 W' ~8 j- B2 K: ?, s1 k8 Z. g5 A
做下新的文件系统的初始化相关工作
9 ]. q' N, `6 p7 l' C
8 |* A, N; l6 M6 v3 P; S+ d. Scephfs-data-scan init –force-init –filesystem recovery-fs –alternate-pool recovery
* a) s/ d5 e, Y* g$ k
0 ]5 t3 z C( d4 P6 vreset下新的fs( p& r( p. |: u$ T) o( R- `$ D
3 W8 N1 b; U+ sceph fs reset recovery-fs –yes-i-really-mean-it! z, ]0 P+ z! D4 w
% ]' n. Z7 Y4 }# q, a* _cephfs-table-tool recovery-fs:all reset session- d* u4 n: ?8 n, o- O
: m7 ^7 q( ^4 `5 u {cephfs-table-tool recovery-fs:all reset snap- ?! ~3 d- W* o
5 ^- m/ T# w5 G, [' y) vcephfs-table-tool recovery-fs:all reset inode( t& O9 n( @, B/ t) {
- e ?* _8 \2 E+ O- i5 a做相关的恢复* K/ o" r0 y* N9 u4 _
' h6 q7 r3 k7 y6 H% i' F
cephfs-data-scan scan_extents –force-pool –alternate-pool recovery –filesystem cephfs data_pool" W8 ^- a, G7 ~3 I+ E! d
# g9 ]# ?0 y6 X' R0 {cephfs-data-scan scan_inodes –alternate-pool recovery –filesystem cephfs –force-corrupt –force-init data_pool9 _, G1 Z2 ?( H X" J. z
2 w& L, x D( `4 z$ j {
cephfs-data-scan scan_links –filesystem recovery-fs$ A3 a4 }& \1 k8 f
: R( U. _0 p( ~" ]* ?2 o7 Xsystemctl start ceph-mds.target3 F- V* e+ M7 A/ t* { B0 B8 Y
3 Y6 J5 ]! W3 v/ R
等待mds active 以后再继续下面操作- o8 }6 p y; ?3 q/ y. I9 ]0 Q
/ u/ H- K5 L$ M5 e _( W6 c0 C
ceph daemon mds.mon0 scrub_path / recursive repair
' P6 c# i" S4 P
I0 z* n& X+ ?6 O! V9 a/ jceph fs set-default recovery-fs8 m' P) p% f& g3 _$ N" ]* ?, O5 G5 d
1 H* A! O- J# c4 J( c
* O. \" E, x9 E& ?3 |3 v
) R, f. \- D/ P3 ~7 \2 {9 @
挂载客户端验证
( C5 O5 Y/ Y/ N; n0 n. C
' N3 a0 t' f/ H* i6 ~* s/ w! F9 ?9 m1 j' I! s. `# w
# P( n2 Y) z6 `) z7 U; i
ceph-fuse -n client.admin -m 10.10.0.7 --key=AQDncgVfX4NDChAA4yAy6AmbK6YbfLha3zGA7w== /mnt/cephfs
0 ~1 }+ ?" e9 o4 R7 E. S
, h% I) j: Q+ S, o5 ^9 C9 Pls -la /mnt/cephfs5 m) `5 W9 q, t
. P2 x0 `8 O: R9 w
, y$ x4 i: N. {$ [ H9 a6 E+ r4 }3 j
b6 b0 c& K' B' m8 ^
) q' p* C2 k8 W/ P3 m/ y! f实际测试中为出现所有lost+found文件夹
$ j0 Q6 g, D7 K8 |3 W
' ~/ w+ a" }" M: e: U) a2 I$ s分析
0 a. d1 r- R5 J2 W" Z. `9 ^& w6 Y' q" c
通过查看cephfs-data-scan代码得知,cephfs-data-scan通过执行文件系统数据池,去扫描文件系统中对应保存文件内容的obj的backtrace,backtrace中保存该inode对应文件名以及对应目录结构,然后在新元数据池中构建各级目录对应的inode obj。对于在journal中还未flush到data池的数据,此时data池中无backtrace,此时无法确定该inode目录结构,会在根目录下建立lost+found文件夹,将没有flush值data池中的数据放入lost+found文件夹中。该文件夹对应元数据4.0000000 obj。5 T$ i' `! `3 k' q& C
) R& Q m1 v% ~. u+ D根目录先保存lost+found文件夹信息
6 v; q. t; e( n) B+ B, p6 d# E! x {5 |! m2 F$ P
- k5 n6 s, w- y& ^
! F H w( ] N4 W$ j9 m8 a8 dlost+found obj中保存没backtrace ojb信息3 B3 k p- e$ _7 S1 ?; i5 \
5 c$ A, W7 f: E4 m0 v0 t3 f4 q
) U) _5 X& g) m# @% P2 F
5 t1 g" B ]1 B& \4 m0 H; {此时recover池中根目录下已有对应元数据信息,启动mds时却没有加载1.0000000obj信息。4 q# D2 r7 j# s5 |
* Y( \) _, X: @9 s# m3 B5 O
通过查看mds相关代码,根目录信息通过MDCache.cc中open_root()函数进行加载' t! ^& U8 X$ ^. M" x( {
7 _' d" O3 T5 r2 s @$ Q6 g/ _% N
h! S/ j* Q6 m$ x- F4 ]( _
+ c4 D3 {8 C- p- k" h1 [
void MDCache::open_root()
' T) L E1 r( `7 o* I. ^& I; ]3 _+ H% _3 | h
{& y s. n' I& \" o. f2 G
d; D5 z4 T9 R+ x% o$ z" t" v4 r
dout(10) << "open_root" << dendl;
- q& H0 p: \2 |1 n& K r+ f1 d6 t# e+ B B* \
5 O, A& ?* y' A8 j
0 p3 T% j9 h3 l+ l/ W5 q
if (!root) {2 I; T1 [( E ?% w
' T/ L, E3 U0 L0 }% [# U' b: r. p open_root_inode(new C_MDS_RetryOpenRoot(this));, \; h7 ?- _0 v. r# \1 M$ i+ }5 c
9 x7 L3 a3 {$ x7 m1 u7 u return;
; Y) {9 g/ \( D5 E4 ?, ~; M0 h: Y/ Q$ m
}
1 u8 |% w( C0 @! J4 P
- ^& v3 M( x. V6 s9 n# J" L if (mds->get_nodeid() == mds->mdsmap->get_root()) {+ y9 r( D, N) m5 P5 r4 c
3 e# t3 l9 |% v- `% `* E7 V( Y" [- G assert(root->is_auth());
2 M- o# q! X4 B5 V" k$ s* ~; b7 k' G- G
CDir *rootdir = root->get_or_open_dirfrag(this, frag_t());! l; s; O7 z( h. i @1 [' e, p. l' p
+ B! x- F+ O% q+ k+ `7 Y0 s' o
assert(rootdir);8 L# M5 Y0 a# k) j% n, \ |+ c T
" |% u8 z1 B+ W7 l9 D! m4 N if (!rootdir->is_subtree_root())
{2 d9 u+ E: V8 S2 @: v" d" [& o' q5 S
adjust_subtree_auth(rootdir, mds->get_nodeid()); * E/ F d. g! Y1 F4 k
) Z4 H9 M ~ g$ h3 _% ~% ]
if (!rootdir->is_complete()) {
* z m$ X2 }- O: P
" q; j! F7 l4 w, ? r9 E$ l rootdir->fetch(new C_MDS_RetryOpenRoot(this)); //读取1.0000000 obj中omap信息
0 A1 `% l' ?) L( E# M# [* m
- B/ p) \/ s6 a! A* }4 | return;
0 m0 Q- D; b' J
* k Q' ]4 Z9 w2 `; O3 w2 O }
# T* u! l m) h$ R' [2 ^) e* t$ H5 y/ D" @- V: ?; x& C7 o
} else {8 J- Q: d# j+ x( r0 _: `& d
/ T8 ~5 V4 D# p4 F) T4 W
assert(!root->is_auth());
' L4 v N3 J7 y6 b+ |
: W) n6 R9 h( e( [% _9 k6 u _ CDir *rootdir = root->get_dirfrag(frag_t());* q, I& g* Q1 M7 r( B1 f5 T9 d! u
1 q) c2 Q+ C. X# { u g if (!rootdir) {, p5 h Z3 u+ H. D3 z5 V& G
1 O& C' f/ i8 a0 R/ f. e$ g$ G open_remote_dirfrag(root, frag_t(), new C_MDS_RetryOpenRoot(this));0 g' t( P1 h- ~$ `6 y; j0 }: l$ F
6 V8 x2 z5 V) \
return;
3 A$ R0 n" V4 F# N" G% a) n F0 `! n9 w4 |- o8 B6 d/ L C% _5 E
}
" O' s. c( u: I9 d' s; V/ j1 X/ {1 S$ B: y) E$ m
}
: I; k% M7 B! S. g* g7 N3 x3 O: J6 {1 w. \ Q, O ]. a
( S7 P: o8 w% j' y' \8 y6 S
% b' Z) U1 Y* w5 T if (!myin) {
/ v+ k# _9 Y$ ?: [
2 x& Z) V4 d1 u8 V x; S! B CInode *in = create_system_inode(MDS_INO_MDSDIR(mds->get_nodeid()), S_IFDIR|0755); // initially inaccurate!
" O4 _2 D: m4 A0 Z6 f6 h2 j- i
2 K5 ^% r# M w8 K3 J in->fetch(new C_MDS_RetryOpenRoot(this));) i' E0 v8 g: @8 @8 C r
$ m) y# j( `( E8 o! [) W
return;
8 i T, p" X$ `, p" O3 ?, O* J Q8 n3 @, J/ V" M9 ]- h/ c* _
}7 H' D7 C& f4 e( _
/ p X* N- b$ G& `" b
CDir *mydir = myin->get_or_open_dirfrag(this, frag_t()); ~5 I' }, q, m q' Z9 K; Y$ h0 f
/ U& {- H" i3 M% B. \- u
assert(mydir);; r2 Q+ P" N3 k$ j" p9 X
& t0 b3 U0 o; { adjust_subtree_auth(mydir, mds->get_nodeid());
) u. [: n; h: I9 \+ n% K5 l* P+ A
& _1 X1 ^6 ~4 w% u H9 w3 V/ \
9 B1 ]+ {# x# F+ i4 E6 e. h populate_mydir();7 Y N% N5 r6 Q. @# @
* b' [ e7 m* u: c5 Y( o}. ^6 v: K$ S$ [: `( x- B
- A, r- x, J7 T
0 a4 t! X. j. v) o
8 B0 }" b1 x' O; t" E; o" p通过代码发现,只有rootdir->is_complete()不成立的时候会去读取1.0000000 obj的omap信息,rootdir->is_complete()代表此时根目录信息已经完全被加载至内存中。
3 x+ x! r4 K. M0 Q z& i! c# @+ |1 H; c, Y: R! Z* f. o( l* i! p
因为新文件系统recovery-fs在启动时,直接由creating状态进入active状态,在creating过程中调用boot_create()函数创建mds初始化相关信息及内存结构,其中create_empty_hierarchy()会去构建根目录信息1 V2 G# H4 B( `& x5 u6 }
& N, I y- y3 B1 |* ^7 F0 P
f. U% C* C# V0 i
6 f8 F! ^, s. D5 }0 |6 {% svoid MDCache::create_empty_hierarchy(MDSGather *gather)* h8 [: g0 {9 P( P5 y5 c0 ^/ ^5 Y
; }4 A( d1 s& i1 M6 @
{
2 W- i! S$ _* B& U, \# d* R6 E5 h% N' Z- K
// create root dir
! K) A: o* C- ^/ |- ^) F1 p
% j0 J8 ` @4 K+ X! }. H CInode *root = create_root_inode();. @2 W, N, \$ d# _: e+ F& \
3 ?- P' p; j( c0 H
$ n* G6 f* u, ^" D) G; R9 h
. T& B9 m7 h% ~9 r+ S- _& G" ~- N) } // force empty root dir' ]% E* g+ p4 q1 K; g/ `0 c
! M4 ~6 L/ ]+ m# z" W CDir *rootdir = root->get_or_open_dirfrag(this, frag_t());
( G0 v9 e) n2 ?2 v# x
' w4 l/ Y0 l6 G! L5 s' R g4 w adjust_subtree_auth(rootdir, mds->get_nodeid()); - X) A! P/ y4 D6 C, V
, ^& P# T+ b0 |: H- O
rootdir->dir_rep = CDir::REP_ALL; //NONE;% T p* ?( t# w S6 I- Q1 f0 s
/ A" l0 n- S# n( z3 c ]+ q
4 t# s1 d O; i" Z h
& `$ \5 F+ b5 h' {5 N5 J- ]* T assert(rootdir->fnode.accounted_fragstat == rootdir->fnode.fragstat);( H8 b f& Z2 J5 R
, g0 w" f0 A4 R4 z# U$ D/ {, e
assert(rootdir->fnode.fragstat == root->inode.dirstat);$ l4 l% }0 J9 L
, b9 [, h& ]; Y1 K0 N assert(rootdir->fnode.accounted_rstat == rootdir->fnode.rstat);* i7 Q( d3 m( x+ Z& b, N
- j) Z U9 v9 ?; `: v( q /* Do no update rootdir rstat information of the fragment, rstat upkeep magic
5 N# R4 R$ C0 [. [3 ]4 P( J* ?. ^
* assume version 0 is stale/invalid.
" k2 }. s. ~- h: f, c9 a/ n1 s" V8 I. p/ Y2 c$ E; q
*/6 ^) p+ b# A4 e4 R! n9 e; {% R
) |: A0 L" M, n0 B/ w2 D' z. }. C) z3 Z8 ]4 w7 Y8 \6 L& Z& q& U! \0 ~
& k! F& |5 ^; w/ b, G( j rootdir->mark_complete();
4 r6 i+ `7 F k2 ~# n3 F! j: L
, b; u" x& W8 s- v+ }! Z( K rootdir->mark_dirty(rootdir->dirty(), mds->mdlog->get_current_segment());6 m/ a& s% `: X8 Q- y2 W: I) T
2 B0 b( g9 p8 B
rootdir->commit(0, gather->new_sub());
, b f4 x' _- P$ i. x7 \+ [$ Y2 b% A9 g2 h1 n+ C- X
2 z; q# f, U9 `* e7 X, e
6 m- q+ v6 x9 v+ h root->mark_clean();
0 c- @* z2 _) `$ n, U
2 t7 j/ z4 F1 X* n. G root->mark_dirty(root->dirty(), mds->mdlog->get_current_segment());+ _) d7 J* m5 j
& g9 Q# U! m2 O6 j9 S root->mark_dirty_parent(mds->mdlog->get_current_segment(), true);
, u, n+ v6 r! A7 ^* _! j. n+ V# ~. v9 ~" x; I0 n( C+ \
root->flush(gather->new_sub());
2 M9 q( n1 W5 f4 I1 g8 _, M1 z7 ~+ U5 T* z. h
}, J( E( d3 V8 G
4 m! l" w, N' C/ g8 \) k
! B. H: o6 S8 X+ z/ v
* g7 f7 B$ S+ c$ L7 i, L& r1 a; }其中会将rootdir->mark_complete()标记为complete状态,进入active之后,将不会读取根目录信息。! C% g7 n( j/ b" Z; j
L* ^: [: S; t( j7 L
; G2 g2 o: t( [2 I) L6 y' C
: x8 h$ S* r& O$ w" {0 i2 o1 c$ F& I/ _- n# ]& Y: x" ?
|
|