|
|
楼主 |
发表于 2022-8-11 15:39:59
|
显示全部楼层
创建文件系统初始化5 |, U9 e* C4 ^; I; U
1 a( c2 |9 M( D4 I) O' f6 l( O1 v1 N; H
& y( w3 O" l4 o% {
ceph osd pool create metadata_pool 64 64 replicated cache_rule(cache_rule为ssd)
' e4 x, X4 f7 y4 i1 C4 P7 W0 t i% T+ ^8 a9 @, t U, _& i
ceph osd pool create data_pool 256 256
; O* c; ]3 |7 ], y5 N' Y8 W% \: Z9 d! {7 }
ceph fs new cephfs metadata_pool data_pool
) [ c* k4 k# x8 L0 ]6 n9 g4 Q2 w( z& X
systemctl start ceph-mds.target
4 |; q! _3 H, K/ F
. A: _" y& w3 q d7 @3 A2 D9 [( b, A9 }& o+ l# A* x
5 C6 |1 b) W7 _' L' T- O
; D% Z! `. k6 I( c \. c
% ^4 }" ?3 x, z! e1 ~2 P1 y4 A7 ?
$ M4 }& \) v& e1 G8 Q4 c
0 O! G! p8 t5 S" X$ G( p# S- I挂载客户端,注入测试数据8 Z% q2 ^& D+ u. J k6 b
1 L3 d9 l' J" F' g1 j: ]: v
( v1 {9 o0 u; \# Y4 X
4 s% } J( g6 W. v+ }: lceph-fuse -n client.admin -m 10.10.0.7 –key=AQDncgVfX4NDChAA4yAy6AmbK6YbfLha3zGA7w== /mnt/cephfs
( A/ W6 E6 ~- c3 V- c% \4 O
( ^1 {6 z R2 j8 i9 Jmkdir -p /mnt/cephfs/a1/a2/a3/a47 Z2 [6 I% z- p( h; X; a
) v4 Z! p3 o6 recho hello1>/mnt/cephfs/1
/ Y8 E; U( p$ ]
- Q5 \! e1 z) ?echo hello2>/mnt/cephfs/a1/2
$ l' b% D7 A5 r z# [$ a$ o/ H: U. Z5 A
echo hello3>/mnt/cephfs/a1/a2/3& c0 j; f3 O+ [4 k, e8 P1 L3 Y
3 ?$ a7 x. C% s5 x7 L; @- C) j! s
echo hello4>/mnt/cephfs/a1/a2/a3/4
. `' e. p+ A5 M( h/ M5 s/ Q/ V$ F3 s; Q7 E3 m9 b" O7 ^' D' Z! D6 M
umount -l /mnt/cephfs- L! k7 G& D2 s. x- T7 ^- z, w( U0 w
0 ?: I6 q8 k$ L6 t3 C0 V9 Q
4 o( @8 s, i0 R3 F, n7 z" m6 y, t. W7 q2 [
# @2 ?- n: r$ ?* U; p2 I; k- s6 T. s! D& h0 f0 V- ~0 \ u' n
模拟故障
0 q( B r g" C4 q4 Z
2 D- U0 t& G( J; `5 q! ?将metadata_pool中所有obj删除,重启mds
" m2 ~; W) N& z+ `5 m! n1 J( a" N. P G' X# L; X2 K# {% C/ N
$ x$ K7 @+ l0 t) ]
% b& K3 A' G) Y( w \ U5 i
for i in rados -p metadata_pool ls;do rados -p metadata_pool rm $i;done
, }( ?& ]2 ^3 ]/ {
( ^6 W1 ?8 D* r: c. csystemctl restart ceph-mds.target; _. a) k# D1 c2 S! n: ?
( `9 n- J' }% A- t
' ]6 s' ~* s# b5 T& V
% u- w* y9 t/ Y1 G" r' k. y; |( {4 D9 @- N" j- h
6 |# X4 @: m; x1 g" p+ |
/ A4 |% M' l1 X' g2 \- R1 a& C6 D- F0 _3 A
恢复步骤
; |5 G2 C% o: M0 Y; V* } f/ _
) U" f" V, |: a" c1 `! |( u- k& p& s$ ~
( O6 e% z/ B8 a' p: R2 `# Z7 F
设置允许多文件系统3 l$ }- }- i [* p
+ C5 o5 h" k8 b4 i; D
ceph fs flag set enable_multiple true –yes-i-really-mean-it( u. X p) K0 p* v+ K& P
9 P" Y1 C0 |1 d. f
创建一个新的元数据池,这里是为了不去动原来的metadata的数据,以免损坏原来的元数据
2 Z8 J. U3 N! _# z, [' }. k# D! O9 T9 q
ceph osd pool create recovery 8
/ N& j: q: K% g( j% w$ y* g/ h9 E7 Q t+ N% m+ V* g
将老的存储池data和新的元数据池recovery关联起来并且创建一个新的recovery-fs
0 v0 c- e1 t) O; Z- l7 x: i& o8 R9 ^/ T# d u3 |
ceph fs new recovery-fs recovery data_pool –allow-dangerous-metadata-overlay
/ ?3 k! l- t, Z8 d) ]' z! r) K, U2 j4 K0 O
做下新的文件系统的初始化相关工作
u, C. [ R5 @! ~& S m- b6 a; A6 W; W) `3 _) j$ a/ ]/ A
cephfs-data-scan init –force-init –filesystem recovery-fs –alternate-pool recovery: j; w! ]4 }# O. }
5 r7 ?) C" O g! S; }+ C: E Ireset下新的fs
6 g: d( L1 }. U6 }& v6 U
# G2 @. \1 `; ?# q$ ]* |6 nceph fs reset recovery-fs –yes-i-really-mean-it
9 b$ u: V6 s9 e3 c! z% i. U5 N1 [
cephfs-table-tool recovery-fs:all reset session
) Y4 c/ J9 u# v; ~/ V* f
% z: x9 n: O. h7 e: A Lcephfs-table-tool recovery-fs:all reset snap
4 Y$ q2 C V' u, q7 l. |* a7 N
: Y! h; z# r: e8 D0 I+ p% N0 ccephfs-table-tool recovery-fs:all reset inode
4 n$ Y y3 c* \6 `: a2 R) U4 h& K1 T q* z8 |
做相关的恢复
2 e5 g8 ~9 F: I7 ?8 P& [+ k# ~* m
4 s; A" n& l) f- o5 y' A, Hcephfs-data-scan scan_extents –force-pool –alternate-pool recovery –filesystem cephfs data_pool
, G9 }1 b) }6 B7 [1 U( b; P& V* C! O2 j9 F
cephfs-data-scan scan_inodes –alternate-pool recovery –filesystem cephfs –force-corrupt –force-init data_pool b& j% U- A1 Q9 u
0 o* H, z9 I" I t* Scephfs-data-scan scan_links –filesystem recovery-fs
! n( ]% q) u( ?2 v: | w0 Z K% `0 z
systemctl start ceph-mds.target& P) {- S" w6 x
; L Z$ ?8 ^+ \9 g% {. c
等待mds active 以后再继续下面操作" A, d9 o; Q. ?7 ?
: W, i' ^7 i8 a {ceph daemon mds.mon0 scrub_path / recursive repair/ w( V* |) u# g7 r/ S9 N* p0 z
. J* O" g9 q" O+ D7 ^
ceph fs set-default recovery-fs' V, U, N3 j' m, j) e& @6 b5 V# ^
# J, I+ F% w! Q+ x) S1 I" u2 x' ~
$ ^. g# M' e( x& [: _/ }" k# z# d0 I- w. Q9 p7 V9 @# _' [
挂载客户端验证
) i4 } I4 D4 Q. F9 \9 X6 n' y, K/ D j. ]+ K: k% g- r
3 v2 n: n+ v# L' c. H2 u
: t8 E1 u' @2 `5 `% t% |/ [ceph-fuse -n client.admin -m 10.10.0.7 --key=AQDncgVfX4NDChAA4yAy6AmbK6YbfLha3zGA7w== /mnt/cephfs
. B5 z0 {3 e {, j. I$ H6 ~& Y" d) n7 |1 b0 H; r
ls -la /mnt/cephfs7 {0 U* J8 w5 W/ j# m
$ ^, d1 H) s+ p q$ ~3 T
! j2 x+ o2 Z3 O6 F0 S+ ?; j4 `8 F+ j
: W ]0 L: V/ }" H9 N" O2 N' e5 k' L: @
实际测试中为出现所有lost+found文件夹
0 Q0 S+ X7 p3 e7 m) U
- s% w3 l$ p) b$ {7 k b0 @分析3 l4 L3 {5 [# h `% Z8 ]
o0 C+ `$ x7 \7 S! o通过查看cephfs-data-scan代码得知,cephfs-data-scan通过执行文件系统数据池,去扫描文件系统中对应保存文件内容的obj的backtrace,backtrace中保存该inode对应文件名以及对应目录结构,然后在新元数据池中构建各级目录对应的inode obj。对于在journal中还未flush到data池的数据,此时data池中无backtrace,此时无法确定该inode目录结构,会在根目录下建立lost+found文件夹,将没有flush值data池中的数据放入lost+found文件夹中。该文件夹对应元数据4.0000000 obj。1 ?! o* U4 L+ [; K0 `7 g# u
' o6 p& h0 [; ^0 e; d% Q根目录先保存lost+found文件夹信息
$ Q$ h% v& p- l7 c' G2 N* K# o# [# P5 }. z& C2 J8 F
- n- z2 E- P! v$ u# s% @
8 Y$ _$ X4 g2 Glost+found obj中保存没backtrace ojb信息
* W8 ] M) l7 S, ?: s" p' i& U& F. p" T+ i; t& ~ v
: Y* R0 G" d5 n- L7 O
' Y9 X9 l; I, _ N0 e( S) {( f此时recover池中根目录下已有对应元数据信息,启动mds时却没有加载1.0000000obj信息。
/ Y j9 o; M1 [* Q, w' H- T3 ?( O6 L8 D4 ^. p; ~6 \( ?
通过查看mds相关代码,根目录信息通过MDCache.cc中open_root()函数进行加载# m# ]. `" \7 N: u- G
% a% E9 d! \! v/ i. |# B8 {9 J: K* {
$ l( l y# J, L1 X( ~8 b+ \/ r$ J3 t L! H6 o
void MDCache::open_root()
( ^! x0 u$ E5 d1 H0 {5 R& F: b2 |& P( @2 u- ]: M
{6 U) W j" f+ d4 D/ t4 V1 m7 N
3 |2 k. I) X' Y% A5 y/ e dout(10) << "open_root" << dendl;9 X) c3 Y1 r$ v7 T) F. `
" ]7 ~$ J: q! Z& O
4 K$ o3 W3 P2 F4 a+ h% I; r+ R8 T0 C% q [3 W
if (!root) {
) g3 I, `8 L- y* i. q$ b. z( q( S" v) k: u2 _- C; u
open_root_inode(new C_MDS_RetryOpenRoot(this));5 J, h: R' e. K/ f
: P8 S6 H. d: ^" D* s3 B! _
return;
) ^, Z7 D, B) J) T
1 D0 @2 |0 j- g- L4 W }
& L" d) V- a O7 R2 Q3 }+ c* F3 `. l( O2 I6 S
if (mds->get_nodeid() == mds->mdsmap->get_root()) {% y4 ^, _# S d
M% U0 B h( x1 R1 [ assert(root->is_auth()); 4 o9 ^5 C1 ]% \( K% Q
- N8 t9 _2 L+ {
CDir *rootdir = root->get_or_open_dirfrag(this, frag_t());
: ~4 [( H! `% s: d: Y4 w" s; [3 |8 c" V3 G. T" ~
assert(rootdir);' ^7 S0 x( w6 Z! Y1 y+ z
+ z7 s8 g( s2 D3 l( `4 M* y; f if (!rootdir->is_subtree_root())4 E' s0 J6 e Q z5 A
; c# H0 V8 _( |
adjust_subtree_auth(rootdir, mds->get_nodeid());
4 b- M. z) c. Y# U5 n$ B$ b& h# X6 ~% y% |5 k
if (!rootdir->is_complete()) {
: b. s( O) P. [- t# r9 V! B& B! o& ?! L
rootdir->fetch(new C_MDS_RetryOpenRoot(this)); //读取1.0000000 obj中omap信息, M% T" I+ ~6 c& h6 b L
6 o, O! _& i3 \: w5 _
return;
# W `, P) w; `8 u1 u V9 A/ B% _+ I; z, E0 S
}; j0 { y! u2 D5 Y9 ?
- y# _ W8 }6 l2 V: X } else {% B* J: r9 |" x. n. ^: v
. U/ s" t/ j, e1 B8 J3 H
assert(!root->is_auth());% Z2 r$ B; h( D3 T
) N& D/ d$ Q. i8 V CDir *rootdir = root->get_dirfrag(frag_t());" L" F8 U7 C" S7 |$ e
/ A$ A7 U X! G1 [* \* g if (!rootdir) {
- n- j* y/ S- k3 G/ E, T9 F; g2 C- A& S, a1 \
open_remote_dirfrag(root, frag_t(), new C_MDS_RetryOpenRoot(this));# g9 P& n8 C9 d6 l7 \( i
2 C% K! j N8 o- G
return;+ x. Y# K, u. [0 {& g. q
: C! k ^' ~0 t. d. i }
! G+ N6 ~, m( X" q5 D
5 Q* O" O3 ]/ G1 f! ?- I }% q$ e3 `+ f; h
0 }0 x( G. k# o8 c6 J( L, p1 c; H- y. }7 g8 A$ L
0 @9 Q' ]" E* |, o' g% ~! t5 P) t if (!myin) {
4 n6 @2 N$ w9 G/ \
# U! B5 _, k' k0 g, K CInode *in = create_system_inode(MDS_INO_MDSDIR(mds->get_nodeid()), S_IFDIR|0755); // initially inaccurate!
$ [1 h3 @9 V/ f& M, R, j
/ F+ B) \' E9 ]. I2 S in->fetch(new C_MDS_RetryOpenRoot(this));9 H$ d0 c# m( L
3 V; w- q# @( G/ ^, {
return;# j- V0 \- `0 }- o$ @( u
6 |! ^& r: Q* i. s/ r: H
}5 z; e1 L. N* E
& z1 S# A0 X( A9 h; [$ a2 L: {/ G
CDir *mydir = myin->get_or_open_dirfrag(this, frag_t());
4 Q0 v2 S3 L: ~' T( J! o4 w \: x; ]/ Z0 I( q
assert(mydir);: q# J M z/ h2 H, g# b
# X: f$ l$ a: f3 ~9 ?
adjust_subtree_auth(mydir, mds->get_nodeid());3 I9 h+ O% y! D0 G( K
) z% F% x9 E9 k+ n1 k9 B2 P) B6 L2 ^
, d* s' F4 [! z" `) i/ v: t' k populate_mydir();
7 s- c6 ~' Z i9 A, J# E& h
! o& I" s3 w' ?}
% A2 e* D# |0 v# m9 H7 ^+ Q7 X5 Z! @2 m z! X: I) z& V+ l I
7 w3 u9 M# r8 O) m. z& r
8 ~& n6 s5 M4 @. V+ r
通过代码发现,只有rootdir->is_complete()不成立的时候会去读取1.0000000 obj的omap信息,rootdir->is_complete()代表此时根目录信息已经完全被加载至内存中。5 W3 O1 g6 r2 d) Z9 Z6 I
6 W3 ?( q. _* I因为新文件系统recovery-fs在启动时,直接由creating状态进入active状态,在creating过程中调用boot_create()函数创建mds初始化相关信息及内存结构,其中create_empty_hierarchy()会去构建根目录信息, d( z9 S0 R/ O2 |
! g9 n8 @7 y7 ~' F: A
& h4 _' P" o: U0 }6 N
& B0 f, k. [$ e
void MDCache::create_empty_hierarchy(MDSGather *gather)' U2 ~1 _- k4 x, }7 o3 b" z1 T
1 j! V! ^4 @3 I; S. o: I6 G{4 b& K* e; y6 V! m
4 x* \, p$ O$ P# q // create root dir
' q, k+ T/ H' y- l$ x3 U9 x4 i: B( @0 x: |0 n
CInode *root = create_root_inode();
, h. ~2 w- b: L4 W" [1 _. N/ D5 I+ y5 l ?& F! c( s) l
' S* |+ f1 x/ E" R
; q" x0 T4 c0 X( L! X3 ~. J // force empty root dir4 b2 q8 B7 S0 J" E8 Y
5 s( V5 Y3 t( \+ R) o2 t! I- d CDir *rootdir = root->get_or_open_dirfrag(this, frag_t());; R) b+ i/ N2 r! D) ]
! j: s: r- m* ?9 a adjust_subtree_auth(rootdir, mds->get_nodeid()); & ]( S* Y7 C5 E# V0 k. T8 f, F ?
U2 H2 f8 W: o' A6 h rootdir->dir_rep = CDir::REP_ALL; //NONE;% Y b6 s2 ^" O+ q/ C* P
% U! h6 g) J: f1 y/ Z3 t. Y7 I/ u% m# p% ^" g, V5 ?
9 M- }* u4 |' F/ N! p1 c
assert(rootdir->fnode.accounted_fragstat == rootdir->fnode.fragstat);
" ?) f8 |, o( W' z
! s. }5 k1 ]% P% a8 Y* w assert(rootdir->fnode.fragstat == root->inode.dirstat);
/ D9 x& O6 g7 p& W% }% L) N! J& | [7 v% B
assert(rootdir->fnode.accounted_rstat == rootdir->fnode.rstat);
# J- E8 ^; o9 V6 f% ^& V2 B' V
/* Do no update rootdir rstat information of the fragment, rstat upkeep magic9 Y' G3 d5 _4 a7 [/ Q+ U. h
( h4 `. T+ ^7 J9 d+ L
* assume version 0 is stale/invalid.
# s7 }& Q6 ~! }2 B2 J
u( z& N) M1 F */
# v" y; P5 M& ` m7 Z1 ^
. P+ o& E# f. X+ c6 s; v, u4 m" J/ x) Q& f% ~
# o! f& Q* r' X, x; o0 A( y, ^- d
rootdir->mark_complete();3 Q+ M; P# X& S4 W d* c* p3 p3 [1 ~
0 Y& h: k0 w5 ^
rootdir->mark_dirty(rootdir->dirty(), mds->mdlog->get_current_segment());
4 l8 `1 i; A4 J
- p5 V* X) @4 B8 R) f+ J rootdir->commit(0, gather->new_sub());
# A" A! h: ?0 Q- b2 K# A7 L+ ?1 r
9 Z5 I. K3 O* \8 d. ~# Y$ a- Q
# Z2 [) c" O F) V; d) t0 Y r root->mark_clean();
0 U0 v: g9 w5 M$ L* d0 E( |& p9 H9 T8 [# N
root->mark_dirty(root->dirty(), mds->mdlog->get_current_segment());
- P; V7 [! b* a8 z/ Y+ w1 M+ M* G( D9 \3 @( _
root->mark_dirty_parent(mds->mdlog->get_current_segment(), true);
4 r! Z% p& `( ^$ _8 Z/ w! x+ o- d8 Y+ y' K
root->flush(gather->new_sub());. C8 g5 a9 U, w% A- R# J
+ ` f, b- `) _
}- p; ^. e7 _2 d* M% i# w
# A) V7 ]. W! C( G) v- E: g* Y* M5 [
& V( h; z. Z' g其中会将rootdir->mark_complete()标记为complete状态,进入active之后,将不会读取根目录信息。6 ^1 d3 [$ \ h$ f% V" U
( y1 q8 U. x) T8 C1 e- Q9 R4 Q
! D; ~* G* C+ @ P2 L! c+ \
8 t& _: S+ v4 G1 `& U' @+ J8 P8 P: b R
|
|