|
|
楼主 |
发表于 2022-8-11 15:39:59
|
显示全部楼层
创建文件系统初始化! l+ S, @9 [1 Z6 C
- `1 q( R, z {, R# _
7 W3 h" E; { Y! s4 {' B" B) b* M8 ]( p5 ^2 {1 }
ceph osd pool create metadata_pool 64 64 replicated cache_rule(cache_rule为ssd)
( _8 M0 j, t& _/ S, ^ C# i( O" j
; M( `/ y+ s! h0 _8 G, oceph osd pool create data_pool 256 256$ M# t7 ?0 O. E" @
# Y! ~# j! ?; m. a% }6 V: v
ceph fs new cephfs metadata_pool data_pool$ v- ]! [) ~' h% \# {
0 I! g5 v5 b" u ]* J0 E* ~- n. H( Usystemctl start ceph-mds.target
$ [- ?7 W" \/ X/ M5 Q" G- u$ B5 ?7 t7 O% O" n
5 k- F$ V" ]* j2 Q. c& d% }
7 \* t+ E2 I% A( G
) @/ @! A% s# h# C" L. x3 X
6 Q1 }6 g+ k. Z0 d0 j4 u5 F4 p* ~) V8 m& b# z
0 N! Z# ~9 u: i) o$ i. Y
挂载客户端,注入测试数据! u% Z8 E- r7 W0 B% d. U! A* r6 |
9 X8 m% |' }4 s9 H3 @' S- C( _
" n3 L. }( z& g) K& A% r- C6 r9 V3 D: J2 t
ceph-fuse -n client.admin -m 10.10.0.7 –key=AQDncgVfX4NDChAA4yAy6AmbK6YbfLha3zGA7w== /mnt/cephfs
- E/ C9 H- Q" g* c" n5 ^: C9 f0 V5 y7 g. j' u" J! k
mkdir -p /mnt/cephfs/a1/a2/a3/a4
8 H C- K& ^+ f2 ]' T
- M3 P6 D( u; f2 Decho hello1>/mnt/cephfs/1
1 R/ r5 _" y* u' S' c) l" j6 i
echo hello2>/mnt/cephfs/a1/23 J1 N6 k8 c0 P
6 S- N+ Z# b: B$ q* @! N0 Uecho hello3>/mnt/cephfs/a1/a2/3
+ @; B) r. t5 k3 ?- ^4 n- V! a4 {7 Q& l) U. C
echo hello4>/mnt/cephfs/a1/a2/a3/4
# {) J! y! n- J* ?* a8 V7 q; d& [! U/ V) p$ ~ d
umount -l /mnt/cephfs$ r& k a. |) r# Z m4 H0 p ?
7 H5 [. F: P4 G1 E) s+ t
' }. e: K1 z* W- h5 _* ~- S; `" R, w/ `
7 o! I6 z2 w. F7 I
: }/ Z7 D2 S! p# W# l3 m
模拟故障+ f0 s$ w: [% Q+ R
% A: [( l. c, W8 [5 r8 ?7 V7 f7 z将metadata_pool中所有obj删除,重启mds
x6 K3 l5 Q% I$ D& C0 b1 p" v2 u0 ^" J, i" s ~4 Z
2 Y( V$ m' ~4 N; W- o+ v2 O# J& y4 \; s" C
for i in rados -p metadata_pool ls;do rados -p metadata_pool rm $i;done
# z6 F! n4 t; W9 c7 L/ L: o) u! _* Y# |5 B1 W% R6 g
systemctl restart ceph-mds.target. O; ^# V n! B6 B
' L5 M4 K2 ]7 w5 o$ C7 B) [8 ^1 y
) @) @. S c5 Q* t1 K" T
" B2 s f: o% G
. k# T) E+ H) a$ h9 n1 F. a4 B
0 C- ]4 N5 J; n* J2 E
. S* F7 m. L( V, H4 V1 S恢复步骤8 G0 x& ?, B0 Y. |: Q
; {2 D8 [, K2 h, z5 L% l
2 }% t! d, [: e) K3 Y8 F# a
~3 ]1 n3 o8 n' |! c2 G! d1 z z
设置允许多文件系统* b1 b; t ?2 ^+ J( [; W
5 ]- L& d7 ^( V( C* zceph fs flag set enable_multiple true –yes-i-really-mean-it
/ x( x& m9 z. c @+ V) P+ z
% |/ ~+ W; u% \0 n7 D) E' i+ Q& P创建一个新的元数据池,这里是为了不去动原来的metadata的数据,以免损坏原来的元数据
2 X& W; _' C# c1 w# l- y1 Q4 `0 |0 G: ?$ L
ceph osd pool create recovery 8
* y! m6 E3 ]8 G! D$ t6 W$ T! R4 ~3 Y- X0 o7 [0 n( e6 E
将老的存储池data和新的元数据池recovery关联起来并且创建一个新的recovery-fs
" W) A. |: Y( A% G- O: \5 r) o$ u/ a/ v1 r. |6 {
ceph fs new recovery-fs recovery data_pool –allow-dangerous-metadata-overlay
. s' g+ P8 s" ?/ p0 g
4 t# x9 ^: e( m2 G2 O& x2 H做下新的文件系统的初始化相关工作, m% w! I* ~5 w& Y
7 q5 ]* \( u) W* Kcephfs-data-scan init –force-init –filesystem recovery-fs –alternate-pool recovery
0 \! f" V' E& U! r9 L- _$ ~1 o: [- a0 Y6 j1 G- X5 G
reset下新的fs# E7 i9 @5 v$ \; K: U5 ?; `( @
, p: z4 G. Y, k5 E1 pceph fs reset recovery-fs –yes-i-really-mean-it1 u. i9 @1 s k2 n
) @8 }; X1 l& ecephfs-table-tool recovery-fs:all reset session4 @% D1 c3 d) o* P! X
5 Y4 B, s; B4 Y; p+ Vcephfs-table-tool recovery-fs:all reset snap
8 A \( G2 W' [, z% H, |* h+ ^2 E* @2 t) Q
cephfs-table-tool recovery-fs:all reset inode+ G/ y w) ?" g; b( g" j$ W
& J$ I4 w2 Z% }1 w
做相关的恢复1 x! T! {* G* _" |5 ]# L
- h$ J& z2 [4 s; t/ F. m! ]
cephfs-data-scan scan_extents –force-pool –alternate-pool recovery –filesystem cephfs data_pool: V5 F- G6 y- y7 `$ A1 E* N
2 F# R$ c# c2 y, I5 l8 C a% G
cephfs-data-scan scan_inodes –alternate-pool recovery –filesystem cephfs –force-corrupt –force-init data_pool
% @ J0 P, f6 X, K9 S3 S% i; g$ `4 o' E, y9 s3 y5 x
cephfs-data-scan scan_links –filesystem recovery-fs
! L( V. ~% w7 t
l3 w" ~# x% F* Asystemctl start ceph-mds.target
* q8 W& a3 C- Y3 {8 |2 u1 N1 l4 ]5 J6 U
等待mds active 以后再继续下面操作
% |. J, L9 Y: N/ v; X9 N7 v4 P4 E$ b# m# v$ s
ceph daemon mds.mon0 scrub_path / recursive repair
" y) q5 |' _3 R. @. O3 p+ V- s1 X! M0 a+ |
ceph fs set-default recovery-fs
; y' M1 s8 T9 K8 ]; W4 u5 @' f. ]
% j6 m* i, O& |
, L) Q! q1 C1 [! k$ i
" T5 W! E7 z% [5 i7 l3 c( l) G6 s1 w挂载客户端验证
1 h3 [% [4 x4 n7 K- S0 _6 @, r9 J1 M/ y
4 }- j) }, c0 n5 E0 K- q
# B+ k) i$ W* [ceph-fuse -n client.admin -m 10.10.0.7 --key=AQDncgVfX4NDChAA4yAy6AmbK6YbfLha3zGA7w== /mnt/cephfs
5 Q; l a% v, A
5 B& N+ U6 C2 V# u7 b( tls -la /mnt/cephfs, m5 K( v- i3 Y- l" S1 G1 _
$ f. u2 R# O9 e+ @
* @, c* w1 ]( B
2 m& [- N, ]! X4 l3 q' C; A
/ S* ^+ i3 @& P3 b+ i' b5 x5 H' [+ f. y* r
实际测试中为出现所有lost+found文件夹) x/ L& D2 Z) r }4 q, P
) t k Q. {% j7 L3 ^8 @
分析
5 B! }# \8 }* M. p9 M
$ S J) S( n& K O+ W. Z+ \3 }通过查看cephfs-data-scan代码得知,cephfs-data-scan通过执行文件系统数据池,去扫描文件系统中对应保存文件内容的obj的backtrace,backtrace中保存该inode对应文件名以及对应目录结构,然后在新元数据池中构建各级目录对应的inode obj。对于在journal中还未flush到data池的数据,此时data池中无backtrace,此时无法确定该inode目录结构,会在根目录下建立lost+found文件夹,将没有flush值data池中的数据放入lost+found文件夹中。该文件夹对应元数据4.0000000 obj。$ {( G& O' L; n* c* h6 Z
2 S, M+ H3 ]6 y8 b2 p! ?
根目录先保存lost+found文件夹信息' ?8 `& s& H+ L+ n+ Q$ ?. T7 f2 c
6 |: L4 h0 b! }5 p& c1 B" j1 g" g) U0 a
7 j. j0 I7 i# _- u, f
lost+found obj中保存没backtrace ojb信息
% j8 n' n; {! y
0 Q% w* q0 p- z: c; z0 r6 W0 G/ ]. k+ C8 }) g4 z9 C7 U& W- u3 a
: {: N8 X/ I! y; g) h
此时recover池中根目录下已有对应元数据信息,启动mds时却没有加载1.0000000obj信息。' s0 e' }6 H6 b4 J0 |
4 o L4 w# I$ u通过查看mds相关代码,根目录信息通过MDCache.cc中open_root()函数进行加载
4 R: u% ]% G3 ~% A# o9 h
4 W \1 n0 N: l: j2 `& o4 O- G6 q7 G. i
8 M/ h2 ~2 x( K% y7 t3 [3 `void MDCache::open_root()/ ]: `+ C! o2 N8 f' Z |
( Q+ ~& A |8 x' l0 }4 M
{
0 Z, w1 z, J0 U6 M" R' x' K _8 Z
9 `% U) n# O& [. n$ e dout(10) << "open_root" << dendl;
- \1 @$ }4 c9 h6 z- i6 X% {& [1 v' C* I( G. c N( z% o
0 b* S0 t( B! f/ ]
7 \4 C0 l8 Q# \ if (!root) {
$ I' j4 i- w2 b% J, w3 }) ]/ W0 I& I+ ?- }" |: P3 M
open_root_inode(new C_MDS_RetryOpenRoot(this));
+ V2 F4 K! M3 [% y6 h: J
* p# W2 L$ d6 j; o/ E, j2 f return;
0 i/ |$ ^" N/ T& Y7 H$ Z5 ]. I1 k5 c0 Z7 b9 G; f: F9 ~
}1 j/ Y5 i% C6 C9 ?
# P$ C- _& f1 q% F- C- `
if (mds->get_nodeid() == mds->mdsmap->get_root()) {3 X; | L/ p E) Y0 B, B
# d r- t& W% \ assert(root->is_auth()); 0 z& {! C7 L/ I0 R
+ Y! k7 w6 F8 {5 @3 y$ V! B
CDir *rootdir = root->get_or_open_dirfrag(this, frag_t());
1 y4 a s0 o# D8 K3 p7 \
. c2 q* a9 M4 k8 Q5 K" k assert(rootdir);
8 y o. Z- U1 b& f9 b0 m) U8 O+ d) G F
if (!rootdir->is_subtree_root())8 n- Q9 O4 H/ C' }* f( m5 K
4 y6 `# H% O/ P2 _/ | h8 k adjust_subtree_auth(rootdir, mds->get_nodeid());
: `: w$ g* g, P1 c+ ^( X
A4 c# p: J1 ^3 G3 V2 F% ~7 U if (!rootdir->is_complete()) {$ T( G! ?& c* l3 J& v
1 |, @# ?1 m: u rootdir->fetch(new C_MDS_RetryOpenRoot(this)); //读取1.0000000 obj中omap信息
- S/ H+ g4 J8 `8 q3 }5 u! d
& M. J0 e$ x9 T+ h return;
/ [# b+ y, [0 L% _5 W+ I: y2 o3 {4 e2 K) P: _% \7 N7 y- Z
}3 j* t* x$ n. h1 d$ y
" o) [! I; d7 o$ R% U2 f
} else {/ b" {0 o. j6 I& `
$ O% B g! }. M4 S assert(!root->is_auth());" A4 @" h. A6 f7 _( Q8 t+ t5 m
& u7 E& {& D0 q0 q4 H( c
CDir *rootdir = root->get_dirfrag(frag_t());7 L( a* F: S6 [4 K% m3 I" Z# D; {
" ~7 q5 `* @6 @' b8 Y7 ]- j if (!rootdir) {
. x3 I0 ]* E; P' }8 R" m8 e) s) S$ J
open_remote_dirfrag(root, frag_t(), new C_MDS_RetryOpenRoot(this));! h. V& A* n r, L5 g+ s9 K, `
6 s" E' P) H9 F& x9 d( V0 o
return;
! y2 M3 x2 Y8 K. S! c
+ R' @8 D( U* ?! j! k }
9 v* b1 |$ Q! u
- z( y# {& u/ [6 Y: j }
& O# A" U0 p& M0 z+ t0 q/ g+ ~) H9 w8 }* L& U( }
/ E a2 h, e- k6 m9 K9 Z
, L& S7 p1 `7 j2 X' G4 A
if (!myin) {$ n# ]: W. T$ u9 g
2 {- m3 H: f; T# z5 [ CInode *in = create_system_inode(MDS_INO_MDSDIR(mds->get_nodeid()), S_IFDIR|0755); // initially inaccurate!
' b' x: n+ |: U, N! U \/ @. G, Q
; K, [. U2 ?# e" K in->fetch(new C_MDS_RetryOpenRoot(this));
' a7 V" B5 u3 O/ H' g# [+ r% n, ^0 \# P. T3 A! P, _
return;
" g. m0 Y5 e. A7 U9 z4 I( ]# i
; s1 l2 r$ v: n, p }/ X( M. Y+ e* c
L8 w: v9 ~! M" y$ A' B, t* G CDir *mydir = myin->get_or_open_dirfrag(this, frag_t());3 V: s7 p5 H% s
3 x1 `+ Y6 [: y. y! ?
assert(mydir);3 e0 P- r. }: \( H9 k+ o
+ \0 S0 a; \' V8 J' M( E
adjust_subtree_auth(mydir, mds->get_nodeid());
! k0 p- u4 d" L8 O9 U
! @3 I4 D7 y. W! O. S
1 N. V; R' Y5 y. z1 P2 W# _
+ u8 z! g0 ?. Z populate_mydir();
* F, l3 j. d* x* o, f+ d& `5 h/ Y
}/ K: V. G7 ~7 B/ s6 F
5 Q7 Y0 j' j8 B; s; k
8 w7 N# ?9 f+ R6 |
# a# c7 u0 h; Z0 j
通过代码发现,只有rootdir->is_complete()不成立的时候会去读取1.0000000 obj的omap信息,rootdir->is_complete()代表此时根目录信息已经完全被加载至内存中。$ j3 f2 V8 M/ D$ k
6 \: F3 D: e- p$ D, A, J
因为新文件系统recovery-fs在启动时,直接由creating状态进入active状态,在creating过程中调用boot_create()函数创建mds初始化相关信息及内存结构,其中create_empty_hierarchy()会去构建根目录信息
( U) Y7 }/ ] X+ o( Y& E1 y1 b" ]: E# D
; {$ ]3 g/ e0 V0 k5 Y$ z8 T
7 }6 N& k5 w5 ^8 p Q% E4 Z" t
void MDCache::create_empty_hierarchy(MDSGather *gather)% V; b }. E- v9 |! N0 x& n- B
3 ]. g8 K8 u% v
{
8 R, j( u2 K: d' A! _. k1 P* D' X! g
// create root dir' E7 O5 o- ?" b8 D' z- ?
4 ]5 k# U' ^+ T CInode *root = create_root_inode();
6 T- F& j2 d J( h! @) e! S( @8 c N$ Q( `1 P. c
/ W) x6 l* V! d. C& @# `/ N
! X( u2 V9 G% X9 ?$ C- U // force empty root dir
( \& y* N* v" N: F
8 R( q4 k7 E5 x' B CDir *rootdir = root->get_or_open_dirfrag(this, frag_t());; H' E" {3 {/ i( }6 {
5 }& P& h, G! `7 t+ x2 ^& Q% s
adjust_subtree_auth(rootdir, mds->get_nodeid()); 7 d, I. w0 I" ], i0 d
( P% V0 O$ q2 h; i
rootdir->dir_rep = CDir::REP_ALL; //NONE;" S# x, i" s4 B- T9 B" j+ U. u/ S
6 y0 q4 K2 e. U& o
3 [& J: R/ K4 U0 u. n& k
- S$ H/ c1 Q4 l) S* B, I assert(rootdir->fnode.accounted_fragstat == rootdir->fnode.fragstat);
% W# @+ W, N( A0 U1 P
- \1 Q1 m4 j3 s+ W0 y$ O assert(rootdir->fnode.fragstat == root->inode.dirstat);+ ], p; Q6 a: l5 p6 x
, b: b- w0 ?6 d1 S+ b1 L) e assert(rootdir->fnode.accounted_rstat == rootdir->fnode.rstat);9 ?- ^6 e" q* K( w
! B5 g# t( s. N1 T& o% \( x7 W. L /* Do no update rootdir rstat information of the fragment, rstat upkeep magic0 H2 A: l% ^0 R7 T
X# h7 \4 g4 m4 ^' Y* n * assume version 0 is stale/invalid.3 a+ _* h2 r2 c8 m/ D) t- }& `# {3 c
' u1 Z/ F/ b8 Y
*/+ S s6 T6 Z# O! l. q/ Z' k' V
8 S; H) x- u4 w8 r
" C! c: p. u2 Q8 S( X! v4 P
6 }% _" A r3 t; V9 s7 z8 h rootdir->mark_complete();1 G8 r ?# i+ U4 j
- Q7 X. f% F* G. c rootdir->mark_dirty(rootdir->dirty(), mds->mdlog->get_current_segment());
* v5 z% u# _9 V4 G, |( h; U3 G4 }" C. a3 C
rootdir->commit(0, gather->new_sub());& \2 D9 ]* _1 ^. ]: f. W
. Y8 z, ^ |. E: h! R
1 a O1 o }; y c% [- k4 L+ q
2 P3 z, P0 a. L/ t root->mark_clean();
4 q" f- h4 z* B% j
8 L& L0 \0 y$ @/ }+ b root->mark_dirty(root->dirty(), mds->mdlog->get_current_segment());7 ]% u" z( i; t9 v
% l# w% B7 t9 ~- E) V root->mark_dirty_parent(mds->mdlog->get_current_segment(), true);9 q) c7 G+ Z0 I, D- W
0 I* b6 k$ }9 `- [8 i root->flush(gather->new_sub());+ e& D' _ _+ S
" k$ _% Y$ s: \6 Z5 u}& E3 D1 L6 E9 c) v2 ?
- g) ^1 ]: s1 A J- { V
6 w$ m" W" R+ ]6 b/ S" t1 V6 T: n
其中会将rootdir->mark_complete()标记为complete状态,进入active之后,将不会读取根目录信息。& y5 B! ]+ Y+ W" D4 a* B! W
' f' r) z. A4 ]' R- k2 o& V1 e+ j1 X8 A( W6 N: V: @2 H
3 d5 M4 U! z& L3 H+ I* @- Z5 E3 S
: a, W" h% `* b( n7 o( B" [ |
|