找回密码
 注册
查看: 632|回复: 2

Ceph 故障排查方法总结

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2022-7-27 17:39:17 | 显示全部楼层 |阅读模式
Rbd 无法删除; D1 Z( l- F, s+ e8 T2 M
rbd 无法删除,错误如下:3 A2 l2 f5 P; F, ^4 a
6 B# q7 o5 v& L* h6 r1 w
$ rbd rm nextcloud/mysql+ E5 ?- y$ }5 u9 A6 ]
2020-05-13 16:27:46.155 7f024bfff700 -1 librbd::image::RemoveRequest: 0x557a7af027a0 check_image_watchers: image has watchers - not removing6 d: f1 U3 t$ ]3 f7 n7 A! b
Removing image: 0% complete...failed.4 s- J' s0 _+ `  g
rbd: error: image still has watchers$ T, c8 J& S& u' M* J8 E
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.8 ~( t( v; w6 V. m5 y( E

  Z2 M, L, R4 X  \7 i3 k, {$ rbd info nextcloud/mysql5 m; Q( l& L# K: B
rbd image 'mysql':6 F" _4 |, X$ ?$ {
        size 40 GiB in 10240 objects
) J6 x) c- d+ d+ u        order 22 (4 MiB objects)6 G4 m# w7 o7 o* |  {' f) _
        id: 17e006b8b4567# l/ H8 K( J: X, s" d/ h; M2 \; L- M
        block_name_prefix: rbd_data.17e006b8b4567% \5 u2 ~9 W: l( Q+ }/ Z( i! e
        format: 2/ j$ a4 o' v" @9 ~9 f+ m
        features: layering
! U& Q" Z' M" v- ]0 J  l        op_features:
" g. i! R  }! X. _8 b        flags:
) [  F$ D7 {  f% i        create_timestamp: Tue Oct 15 10:47:34 2019
7 h/ G$ O1 \4 M) T, H7 i; U复制. s" u" b2 c/ G$ i2 I
查看当前 rbd 状态:  c* b& }1 Z/ j! w1 j+ W7 F

: J* @# o, n) e" R$ J3 H3 f* o$ rbd status nextcloud/mysql- U7 c0 T# c% j) T! U5 m2 E
Watchers:9 Y; V" w8 ^: P# z( W6 g6 R
        watcher=10.100.21.95:0/115493307 client.67866 cookie=7, c- [7 U2 a2 E+ B
复制# C; ]0 {' u5 E! |* O) `( Q; @$ C
发现有节点正在挂载,登入到相应机器进行查看:0 K1 T4 p7 c4 `. X( f/ _
; s* M8 W' Z5 y) |$ R
$ rbd showmapped
2 L% ^/ O. j2 \6 \/ J8 ^& @id pool       image                                                       snap device
9 n7 T7 f5 E8 C* ^...
, x! e: K. @$ l% U/ I/ a3  nextcloud  mysql                                                       -    /dev/rbd3
, L1 h3 }9 [5 G! a4 s: x; s! j2 R8 h复制# k5 L7 m0 K. f! ]+ M2 }8 j
取消映射:
& p' B3 y( G4 y: V, H; r* l! G; V' y* N4 ^  m- W
$ rbd unmap nextcloud/mysql
2 z6 k" H5 c# j) A0 A  ^复制  P& y) \$ O: q8 Y  \
重新执行删除操作即可:
, ~, i9 {& ~6 l. ^$ U
& U8 a9 ~- j4 ?2 n" F: z' ~$ rbd rm nextcloud/mysql
$ Y# [3 X3 P* N2 g$ MRemoving image: 100% complete...done.
1 A$ i! q3 x2 y) i% |4 N复制! y$ H0 y2 s6 E+ u+ l+ j
暴力解决方案,直接对其添加黑名单,忽略挂载节点:
# Z/ v' o5 p0 h0 p* m+ B3 F$ j$ o1 [  V, ]7 f' W
$ ceph osd blacklist add 10.100.21.95:0/115493307
! S7 L- e/ Y/ S& x$ k$ rbd rm nextcloud/mysql
; y; j9 @4 t7 w. a5 X' B复制
8 }; m+ A& p1 ^OSD 延迟- G& B4 Y. {  f: O: F+ L
查看是否有 osd 延迟:
. m7 ~0 i% z/ m  w- C7 h0 e+ h
1 m  a: _) F5 w+ a( d$ ceph osd perf
1 r3 r8 \4 a: Rosd commit_latency(ms) apply_latency(ms)5 w1 I7 q  O, u. F, o! V: c
  2                  0                 0- f( O- J3 g+ a
  1                  0                 02 i  n* d( q  m+ r8 J
  0                  0                 0! b& o. |" p8 W7 J9 J7 o
复制
$ |! ?0 c7 Y. Y. G/ C/ h/ T碎片整理
/ c+ o' {& Y9 V4 i7 c查看碎片:2 f' P2 V2 q/ t3 ~
+ ?5 k& L0 t  k; `  Y$ \8 ]+ x% \
$ xfs_db -c frag -r /dev/mapper/VolGroup-lv_data1
/ w  U/ O5 X/ c+ K8 X, w复制  |: ?. i' f5 c& y# R$ z
整理碎片:' S  `1 P2 Z3 [& T" B; y* a' l  K% o

* |' g+ ~8 G$ y4 A/ I) \1 R& p查看通电时长
- |$ H3 b* r  J9 ^4 y2 g查看磁盘通电时长:
  C4 Z; c3 n; @/ ^* ^" N5 |2 ^) V/ f% J" X9 _
$ smartctl -A /dev/mapper/VolGroup-lv_data1: a/ {' p* ~# m( j1 w
复制; _, w9 y3 _1 v5 n- V
修改副本数量
7 i' k9 a0 `; H- P: v; t3 b0 a修改副本数量:
" w% J: l1 o8 T3 ~4 U  ]  v' A5 \7 Z* F$ @( R  V2 H
$ ceph osd pool set fs_data2 min_size 1  ^0 \" s+ C# M! k) h$ x' L( r
$ ceph osd pool set fs_data2 size 23 U0 h6 s* [5 V. u2 o: ~9 `% L
复制
# ^, |0 S- p) c' p: Y0 F添加 / 删除 pool
6 D4 _# Q* c7 B8 U& ]% b8 F- ^添加 / 删除 pool:
2 A9 B0 N% L' P7 \2 g" q, T. Q7 l/ \# t7 S% N) R, T
$ ceph fs add_data_pool fs fs_data2
& ~: A' b4 @1 N$ ceph fs rm_data_pool fs fs_data27 t& c( T" r4 L6 h
复制/ A5 d; ?9 n) p) x+ i
osd 数据均衡分布  m& l& u6 R& |$ V  M" T1 s! p2 N. ]; Y
osd 数据均衡分布:- F" \9 R4 W# B6 R8 u0 @) i

) _& H) F; u, D$ j3 ]2 g+ a$ ceph balancer status
1 I$ V; Z, P4 W% w5 ^- }) P$ ceph balancer on1 a( v$ T6 Q$ Z7 \) m7 K
$ ceph balancer mode crush-compat' I* T$ p; q- P& J- Z3 t( h8 _
复制
$ e( Z% j- w6 l: \- }mds 无法查询$ s4 h- G+ e0 a6 f1 C+ s, H. [! N
mds 无法查询:+ n7 f  H. _5 s/ n/ r+ q; W

  R5 y0 Q7 G9 z4 h9 b& p. t$ ceph fs status7 i* o( ~$ F' s: ]) w% A4 s
Error EINVAL: Traceback (most recent call last):  {% H. x  a# S( H0 h
  File "/usr/lib64/ceph/mgr/status/module.py", line 311, in handle_command
0 }6 r. ^. a) [$ A7 o    return self.handle_fs_status(cmd)
3 H( m  x0 s# S5 T6 E  File "/usr/lib64/ceph/mgr/status/module.py", line 177, in handle_fs_status" F/ ]9 B9 h  u, {" ~" y8 u) L
    mds_versions[metadata.get('ceph_version', "unknown")].append(info['name'])0 z+ p9 ~. d& N: ?% b
AttributeError: 'NoneType' object has no attribute 'get'
. C# O1 ]% Y8 y' R8 V6 x0 `& q0 P% U% M* t  K
$ ceph mds metadata
$ Y7 @7 x$ H6 p* s[! k6 D8 `; V. P7 R# q
    {
/ b9 X- y. n" t8 S( ?        "name": "BJ-YZ-CEPH-94-54"* b1 X" E" @9 s, l( h  g% J" n
    },! o- ?0 H. l2 P8 {
    {8 q, b1 x4 H" e5 _
        "name": "BJ-YZ-CEPH-94-53",
$ A* }4 Q. k5 G# l  j& P        "addr": "10.100.94.53:6825/4233274463",
% _: ^( b3 ^7 e4 b8 W        "arch": "x86_64",/ F0 o# J& _$ a0 ^4 S* k
        "ceph_release": "mimic",$ t$ B3 p9 [4 M( |; r5 @
        "ceph_version": "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)",
) R5 [/ Z  i' J) d. u        "ceph_version_short": "13.2.10",
4 ^" v4 ~( w, ]7 h( B) t        "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",* w, d! C2 I* f, A
        "distro": "centos",. Q" P- L& |4 v5 _
        "distro_description": "CentOS Linux 7 (Core)",9 p; x* q! [7 ^5 I+ m$ [
        "distro_version": "7",& x0 G* ?' N' h
        "hostname": "BJ-YZ-CEPH-94-53",
& h- T0 _: n$ `, T        "kernel_description": "#1 SMP Sat Dec 10 18:16:05 EST 2016",- s, |" B6 K/ ^; h; r
        "kernel_version": "4.4.38-1.el7.elrepo.x86_64",
" u+ |) _1 _6 z, R" O        "mem_swap_kb": "67108860",
9 @' W4 P4 Q: o+ j        "mem_total_kb": "131914936",9 v; Y* A( Y% R! `4 u3 ~
        "os": "Linux"
4 ?+ a1 L6 ?4 N) s    },
" g3 G/ j- [4 Z# \6 `    {% n# [% j' R* \) }
        "name": "BJ-YZ-CEPH-94-52",6 @+ Z& M2 L/ t
        "addr": "10.100.94.52:6800/3956121270",3 |' H. |/ E: l5 ?; e4 M! B7 {0 n1 [' k
        "arch": "x86_64",. x8 y+ Y' I) R- {4 R
        "ceph_release": "mimic",
2 h" Y" V4 l' t4 V3 Y# P7 F        "ceph_version": "ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)"," t$ ~* E; @' t* Q( `& ~+ N4 m
        "ceph_version_short": "13.2.10",
# k* ]. o3 o3 |4 }9 C' N, X        "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",- K- v* a( _/ ~  ]5 Y8 T. X( d
        "distro": "centos",5 ^3 a5 K# e7 d8 h. g$ s
        "distro_description": "CentOS Linux 7 (Core)",
+ w# K9 Q& B. O- B6 [        "distro_version": "7",, a- a$ V6 r* Q7 U* _* r. m& Z" j" }
        "hostname": "BJ-YZ-CEPH-94-52",
6 q! ~* ?4 W) K; w- t' j- g        "kernel_description": "#1 SMP Sat Dec 10 18:16:05 EST 2016",
4 R) C  Z! G' k. A+ }        "kernel_version": "4.4.38-1.el7.elrepo.x86_64",
0 q/ I1 |1 N; U1 {        "mem_swap_kb": "67108860"," J8 ?7 u+ ^/ x- K; t* Y
        "mem_total_kb": "131914936",
- U# a8 f2 ]8 g; R3 Z, m        "os": "Linux"
6 Q+ ^0 ?( y# A  P/ s9 O    }
! W3 o1 B- R0 O# y]- v  Z: A5 h) {! ?
复制
/ r( R" u2 h( x" c3 p重启 mds 解决。: H! K) C! h. q, B
1 B: X/ J4 ~/ I8 O% f( D
cephfs 显示状态正常但无法写入数据' H& w" M, }% j3 D2 D. I
cephfs 显示正常无法使用,一般是有异常 client 导致的,首先查找 mds 是否存在链接,尝试删除链接解决:! B# L9 m3 d) Q* |* M

" D' D4 z) u% R6 e$ @: O' U$ ceph tell mds.BJ-YZ-CEPH-94-52 session ls0 O+ s, ]( [- O5 U1 ]
$ ceph tell mds.BJ-YZ-CEPH-94-52 session evict id=834283( s1 a' }, @, j- i6 m' R) i
复制( ?# F- \' i" Q$ y
每一个 mds 的 id 号不通用,不能跨节点删除。
- ]+ H  M( [  I/ @* t! `. U) v6 q0 m& w
fs 增加 mds
# a( L  w. d; O5 Rfs 增加 mds:0 W( m0 O) @$ E, E* F2 h, z
8 \! t% W; J; @
$ ceph fs set fs max_mds 2" L& ~% {2 a7 ?2 ]" u
复制# J/ w4 O5 L9 ~+ X& L* u0 v5 n
mon 时区异常
3 r8 G9 |$ @- r! Zmon 因为时区有部分异常导致报错如下:: a  Y0 A/ ]: S* y0 k7 a( e

1 S5 A5 J. r$ R3 v: S6 C$ ceph -s
! K8 G) |* g3 w4 [  X  cluster:
7 j0 H$ [" s8 D' t8 U9 {1 c    id:     2f77b028-ed2a-4010-9b79-90fd3052afc65 u/ B9 o. a! w8 a, f8 l
    health: HEALTH_WARN
. L/ ]: Y# _4 P            9 slow ops, oldest one blocked for 211643 sec, daemons [mon.BJ-YZ-CEPH-94-53,mon.BJ-YZ-CEPH-94-54] have slow ops.
& K( e6 _' ~7 I4 _+ [/ @7 V& `0 N) i/ W1 N1 X2 [# A+ w. b9 v
  services:
" P- l, R* x8 E2 E    mon: 3 daemons, quorum BJ-YZ-CEPH-94-52,BJ-YZ-CEPH-94-53,BJ-YZ-CEPH-94-54. W, @9 W# t* t* j
    mgr: BJ-YZ-CEPH-94-52(active), standbys: BJ-YZ-CEPH-94-54, BJ-YZ-CEPH-94-53; h! p" y6 K# ^/ o
    mds: fs-2/2/2 up  {0=BJ-YZ-CEPH-94-52=up:active,1=BJ-YZ-CEPH-94-53=up:active}, 1 up:standby-replay
9 T5 h3 t1 ]' @' o- [" l    osd: 36 osds: 36 up, 36 in
$ G5 b; f" J: M5 t3 l: ?( C( w. [
  data:
+ Q5 n" o. M) n; }9 p" S, G    pools:   7 pools, 1152 pgs
* D6 B2 h9 r3 C$ Q1 J; c    objects: 37.66 M objects, 67 TiB, l0 `- q7 y% T9 a& v
    usage:   136 TiB used, 126 TiB / 262 TiB avail
) r5 l8 Z7 T0 F# O9 k& p, l% ]    pgs:     1148 active+clean
7 g7 O6 e" }% d" A9 O. z; O             4    active+clean+scrubbing+deep; \5 @' S7 M3 Q) e7 U6 W
9 `- t+ E. L- V% o; R% }- N  S
  io:$ ^, B- C' K+ |7 o$ j1 d% Z
    client:   13 KiB/s rd, 27 MiB/s wr, 2 op/s rd, 19 op/s wr: T2 @1 H/ S* H2 z1 T. a& Z
复制; z( E. h2 C( Z( j' o$ ~& r
配置 npt sever:
# c/ e/ H) K1 o  s2 k! ]% [  ]+ h$ C1 }8 o* `, ~0 i2 X( B
$ systemctl status ntpd( R. b$ w/ v$ L& |
$ systemctl start ntpd
& {9 w# J5 B9 D9 q* n; n复制; n  ]- s. U- z1 U
重启异常的 mon.targe 解决:
! S, k' B0 ^9 a* i; X, D
7 ]9 Q6 t- z' [+ U7 Y: o$ systemctl status ceph-mon.target% v" u% k# q7 i$ A1 s
$ systemctl restart ceph-mon.target; r: }8 W2 R5 f4 E. X
复制4 H8 U9 a1 U8 p3 z
1 MDSs report slow requests3 e' {/ a3 X5 [* ^8 P
报错如下:
% E) ]% {) P6 h# h9 `
% g+ V( s: `% T) u9 L' ~2 Q2 f$ y. }  P$ ceph -s
: A8 s( y- F9 w2 }" }+ t, q  cluster:
1 Q4 o3 H1 }4 ^& G. a) z6 c! t$ I    id:     b313ec26-5aa0-4db2-9fb5-a38b207471ee" [$ m1 V9 h6 E& ]: x
    health: HEALTH_WARN
- d7 h8 A3 Q% _- z" \/ L            1 MDSs report slow requests
/ K1 Y4 {3 }9 q  x2 ]            Reduced data availability: 38 pgs inactive
2 U) c1 b* b/ [7 n. M) ?; W            Degraded data redundancy: 122006/1192166 objects degraded (10.234%), 102 pgs degraded, 116 pgs undersized% a( s# L* R. r3 B) z+ Y$ M
            101 slow ops, oldest one blocked for 81045 sec, daemons [osd.1,osd.2] have slow ops.1 x& ?0 S8 y2 B( C3 w
复制
0 e% o% r  C3 ]' a+ W( u/ o# V重启 mon 即可解决:! A% g1 d5 `$ b8 w. g  r9 p3 B, w# H$ T

# z  X1 u5 M! Q7 t% y: t$ systemctl restart ceph-mon.target
8 |6 ~! o5 U$ g; s: B* B6 @2 ]' l* V复制3 H: X/ }! \- S& _
如果无法解决需要重启 mds 解决:/ o: i. I$ ^+ q% g# Z
: Y9 x8 G4 _& U6 S6 B( d
$ systemctl restart ceph-mds@${HOSTNAME}3 |& R" ^: `% ^
复制+ }+ j8 H0 H) C# n7 K% x
Reduced data availability: 38 pgs inactive
, ~, V1 h6 Y' c6 f2 U# g报错如下:https://zhuanlan.zhihu.com/p/74323736[1]
, y2 A7 H4 a/ [) I7 q% b0 X0 s8 ~& U( `7 j3 E
$ ceph -s* h  u7 ?/ X! L+ f( W! L+ I* W
  cluster:
3 r; E0 o5 j3 Y1 R5 J9 n9 n. @    id:     b313ec26-5aa0-4db2-9fb5-a38b207471ee) k. u# z+ c2 l2 M$ B5 S6 R8 W
    health: HEALTH_WARN9 F) p- s- a* k
            1 MDSs report slow requests& D) K8 ]  b* E/ V- ?/ c
            Reduced data availability: 38 pgs inactive
4 r# A9 I6 \* t' ]            145 slow ops, oldest one blocked for 184238 sec, daemons [osd.1,osd.2] have slow ops.
& x. _: q9 p, C# w7 {9 [! A8 n" O% ~) Z3 g# O/ X7 Z" e
  services:
9 |$ T! |+ C3 u/ i. R. P& \    mon: 3 daemons, quorum master001,master002,master003
, X1 K( w2 T1 P0 V7 N% m4 l' W    mgr: master001(active), standbys: master002, master003
9 m0 l9 r/ N9 B& r" z2 c    mds: kubernetes-2/2/2 up  {0=master001=up:active,1=master002=up:active}, 1 up:standby; j- x" m5 r. n# A/ P9 ^% B
    osd: 3 osds: 3 up, 3 in+ R  w! W+ K5 L/ \
    rgw: 1 daemon active
, k" u  G+ ~" V% W4 v( l, s3 S+ h9 ?) e; C0 z/ U& F2 g; C
  data:
4 [( V# m. ^7 `' V5 B" R( y, `- m    pools:   9 pools, 244 pgs- I- ]( b# C8 A$ W3 v+ v
    objects: 535.1 k objects, 177 GiB0 Y1 M4 t/ l. [  t0 t. h* ?* e) V
    usage:   470 GiB used, 4.1 TiB / 4.6 TiB avail; L% \+ `1 M% u/ u2 ~
    pgs:     15.574% pgs unknown
7 V; c2 T: u" w             206 active+clean; q8 T6 G4 c2 A: y
             38  unknown) o- y/ P! Y! h, }

9 @! Z: f, y  X$ g4 d6 U% r+ m  V; K  io:
0 N- F5 w; h) N    client:   35 KiB/s wr, 0 op/s rd, 2 op/s wr3 s- v( M# I3 T9 w3 C
复制7 G" W6 m/ S# {' M' g: Z4 z: M
此问题属于 pg 丢失数据并且无法自动回复造成的。解决办法是清除 pg 数据让其自动修复,但这样可能会造成数据丢失(如果 size 为 1 则肯定丢失数据)' y, s1 ?# T2 l6 }) J

" {- E7 d( s" h' [3 }9 [# k1 `' D首先查看异常的 pg:( n' \9 ]/ ~2 Y6 Q; a6 k  L! h

+ \8 N* z$ X  g7 F3 [! s然后执行 query 查看信息:
* E- p. t$ @+ g6 M+ p5 Z9 r. V$ {$ z# S! g- N
$ ceph pg 1.6e query
1 c* e( v. e( T' j, t1 |* m1 X7 ^3 T0 P- hError ENOENT: i don't have pgid 1.6e# V# l( R8 I. u  c+ w
复制
- X$ ]* s% p9 Z0 S' h% a上述无法查到 pg,通过如下命令查看异常的 pg:5 [4 g* p& O" u4 R

9 ^# D  u, W4 ]; j7 J$ ceph pg dump_stuck unclean" t+ B3 N  G; x. \- R
ok+ m7 E0 K' O3 v! _- H
PG_STAT STATE   UP UP_PRIMARY ACTING ACTING_PRIMARY
3 H, k0 L) |. ?) L1.74    unknown []         -1     []             -1
$ U0 ~/ h2 [; L% c- l  U1.70    unknown []         -1     []             -1# p( y, y+ F' T3 C' D
1.6a    unknown []         -1     []             -1' C$ [3 H- b( P, ]6 Y# s" e9 X
1.2d    unknown []         -1     []             -1- O6 a& }: G6 g% k& Z! n- [
1.20    unknown []         -1     []             -1
; c) v. {# h+ Y! I" k+ I1.1e    unknown []         -1     []             -1
) H6 Y0 x0 ]4 l: G. u/ C1.1c    unknown []         -1     []             -1
$ v1 c* W  H% U2 B' f$ S. r1.17    unknown []         -1     []             -1* c1 L) F0 o1 N! w
1.9     unknown []         -1     []             -1. ^' n4 D8 J& H/ t! ]
1.29    unknown []         -1     []             -1
9 X/ c& X/ R+ e' i$ X1.56    unknown []         -1     []             -1
9 a. z( p, |: W; M1.72    unknown []         -1     []             -1
! |" |+ u" Q! T# X  a1.45    unknown []         -1     []             -1
: j& B3 y! F# l# k$ ~1.4e    unknown []         -1     []             -1
$ X+ |# c; Z0 g0 J" l) h1.46    unknown []         -1     []             -1
  D' u2 w  z3 ^! O5 i0 r! N1.22    unknown []         -1     []             -1
1 ]. K$ _" z) ^$ z7 d' }1.53    unknown []         -1     []             -17 b. Q0 @" s" y  Z
1.59    unknown []         -1     []             -1
$ a2 w2 ?* J* E8 i1.24    unknown []         -1     []             -1/ j) `8 y5 {4 r7 C
1.55    unknown []         -1     []             -1
, ^3 N: ~1 b+ t1 J/ {1.3f    unknown []         -1     []             -1. K, [6 j; y3 X4 v$ a
1.38    unknown []         -1     []             -1
7 _0 e- q( Q8 I. b* E- ~8 f1.a     unknown []         -1     []             -1* q3 _, @$ q1 j+ Y0 b/ M) ?
1.7     unknown []         -1     []             -1- w: F8 x5 I9 K5 C& E4 a$ g% H
1.34    unknown []         -1     []             -19 F! L. d  v' d( }1 j9 x% U1 }
1.64    unknown []         -1     []             -1
, h) e8 ]$ A; e+ T; O+ q% f+ L4 t1.6     unknown []         -1     []             -1- W/ a2 X0 ]. P( B: Z6 _
1.32    unknown []         -1     []             -1
. u! @( ?  [2 ~; u; X8 W! `$ c' q7 K$ k1.4     unknown []         -1     []             -1
9 `3 j1 V3 d7 c# |1.2e    unknown []         -1     []             -1
- i0 w! V1 h+ }1.31    unknown []         -1     []             -1
4 l% k  b2 O1 j: f7 q" a1.5e    unknown []         -1     []             -1
" r1 z5 ]- Q2 A1.0     unknown []         -1     []             -1
; L7 O% ^. s4 L4 T! D4 m1.42    unknown []         -1     []             -1; w! L9 R6 q5 P% B' n! B$ h2 [
1.15    unknown []         -1     []             -1% G; O2 f4 g: Z" Y; l
1.6e    unknown []         -1     []             -1
) O6 z" S  |* q% x1.41    unknown []         -1     []             -1
- G1 {  ^' o2 A% w9 d1.10    unknown []         -1     []             -1
4 s  H  N* t0 R# V+ j复制* k+ ]) j) z  ~
执行如下命令强制清除 pg 的数据:https://docs.ceph.com/docs/mimic ... troubleshooting-pg/[2]
0 j3 l% m/ ?8 }( P4 c0 J4 b
3 _9 c3 _, J3 U/ F$ j$ ceph osd force-create-pg 1.74 --yes-i-really-mean-it
  i, a( k: k- f1 B  {( A
9 P) S# _/ S- m# 批量执行
% `: A7 T9 M. s9 s# ceph pg dump_stuck unclean|awk '{print $1}'|xargs -i ceph osd force-create-pg {} --yes-i-really-mean-it
. |4 d* g* g4 L( y! b9 |复制' D8 g9 E1 J7 |3 s) ~
执行完成后即可恢复。
- |% I. a% e$ ^+ a' a/ v+ ~) i. ]. j3 C  ^8 k# V. T
1 clients failing to respond to capability release; m( a- m/ s( f1 m
报错如下:
6 d+ R: E0 o) E6 G6 T$ L, a) _) R# K3 _& ?* W& R  \
$ ceph  health detail8 p) z5 ~4 Q( i( G4 o# |6 C
HEALTH_WARN 1 clients failing to respond to capability release
  r) J8 d! K" u# ]3 ~+ M( MMDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
3 A# @$ n1 K2 K( }    mdsmaster001(mds.0): Client master003.k8s.shileizcc-ops.com: failing to respond to capability release client_id: 284951
  z" O6 ]: {& y复制% S4 _& F0 N' f6 A1 F( u
清除次 ID 即可:https://blog.csdn.net/zuoyang1990/article/details/98530070[3]1 G& V# I! K3 T( K. Q; C
& \5 W$ ?( f2 U/ _( H9 Z
$ ceph daemon mds.master003 session ls|grep 2849515 k' X3 D5 F3 k4 S
$ ceph tell mds.master003 session evict id=2849513 D0 @" i+ l1 O) ]8 ?  c
复制
% u/ s/ w, m/ G5 Q8 t如果报错如下:
$ w7 P: Z7 z* a) M4 ~  e
6 v, s3 ?5 n3 t0 N7 S, j$ ceph tell mds.master003 session evict id=284951( E4 G# j; d: V( i% R$ Q5 }
2020-08-13 10:45:03.869 7f271b7fe700  0 client.306366 ms_handle_reset on 10.100.21.95:6800/1646216103
5 |0 O+ V3 a" J: t( R2020-08-13 10:45:03.881 7f2730ff9700  0 client.316415 ms_handle_reset on 10.100.21.95:6800/1646216103
  T* [; V, Y: ^) i9 j, C6 v( OError EAGAIN: MDS is replaying log  Y$ [/ f, G+ M2 T% p* s
复制6 r+ u  H' l' @. V9 F- ?
需要到 mds.0 节点执行,否则无法找到次 client。
8 m( ?: K+ N) j* v; `% A& Z( q4 s) p% f$ @2 ^% A
内核优化( Q; a4 ^) a0 a+ t! }. A0 u8 r
内核优化:https://blog.csdn.net/fuzhongfaya/article/details/80932766[4]" M# M0 n, k/ x3 @2 j6 W
- g1 d& s1 i5 @3 s- S# f
$ echo "8192" > /sys/block/sda/queue/read_ahead_kb* l& y" V1 }8 c9 U' {( p
$ echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf
8 t4 ?8 `# V- c- m5 Z7 M$ sysctl -p
( I0 p5 l2 B' c* n- [$ echo "deadline" > /sys/block/sd[x]/queue/scheduler
. [- t  q* f% N6 M1 I0 s0 z3 d7 |) F1 ^: l; e: q0 h# F) g
# ssd  V0 O  z! [# ^7 Z: M
# echo "noop" > /sys/block/sd[x]/queue/scheduler
+ T' X/ V8 U' {7 N" ^复制
; g$ r9 r$ N9 o# a% F. dswap 最好是直接关闭,配置内存参数在一定程度上不会生效。
+ s6 M. D( {9 n( q; R8 C, I4 K1 ^/ j' X. ^* u6 e5 B
配置文件
9 i3 U8 p9 Z0 w# u8 _8 N! v; T% V4 Z8 j3 W' ~. U! @3 f
40 核心 128 GB 配置文件:& E# I1 @7 Z- ^/ {1 T
9 A1 d5 O8 k  U+ F6 G7 U
[global]! b2 }' j0 p" J+ C
fsid = 2f77b028-ed2a-4010-9b79-90fd3052afc6
/ K, ^7 O: d4 X6 k3 Zmon_initial_members = BJ-YZ-CEPH-94-52, BJ-YZ-CEPH-94-53, BJ-YZ-CEPH-94-54, l: a" t/ }" g' l9 d2 {% n# i
mon_host = 10.100.94.52,10.100.94.53,10.100.94.54; J9 H% B  R/ l' t( g2 @
auth_cluster_required = cephx, D$ j! ^/ W3 Q- X
auth_service_required = cephx
) H" {0 p; G& @! ?5 G: Yauth_client_required = cephx4 W7 W% u3 W" e9 J/ s% C5 t. m
6 m' F# c8 o8 D5 z9 r6 }: I; O
public network = 10.100.94.0/24
, Y" N: w: z5 d& y3 Y0 Fcluster network = 10.100.94.0/24
  {" u+ I- o  T+ v, C' \* S$ n2 l% {: N  P8 i0 @" a& R
[mon.a]
6 k# Z$ H& C# f8 Ihost = BJ-YZ-CEPH-94-52* u) ?' V2 B0 a$ D
mon addr = 10.100.94.52:6789
$ ~% s* g& w! A- K! H' w: X, \& N) }' G9 K, s$ k3 r, B/ C, J- H0 S
[mon.b]
  m* ]8 V" W# Y! U4 M: whost = BJ-YZ-CEPH-94-53
% \4 \5 q3 ^6 Q  H1 e" Xmon addr = 10.100.94.53:6789& H- X7 Q; ]: h& l% ]  q% G
% z# o5 |5 G; s
[mon.c]2 h5 E5 M5 D( w" a- `  I, Q
host = BJ-YZ-CEPH-94-54
( d( n. v  U. D( Qmon addr = 10.100.94.54:6789% }$ E! A4 `( i

: ?, c& j; `5 R[mon]
( ^+ w+ p  Y) e+ tmon data = /var/lib/ceph/mon/ceph-$id
. }6 z7 E! P3 D' v6 ]) k3 [8 @- U# Q/ F$ W* ^7 W% r8 j
# monitor 间的 clock drift,默认值 0.05' E# ^3 `* m" F# u- M
mon clock drift allowed = 15 T% z8 D. C8 M& F! x

" s, j7 H: |& S( |, S' _) c: q# 向 monitor 报告 down 的最小 OSD 数,默认值 1
7 }1 i1 s: D& B/ K9 C$ b. nmon osd min down reporters = 11 T1 \7 |6 d  d) g$ J/ E0 C! w
% V! g1 [2 A# |) ]; p# B
# 标记一个OSD状态为down和out之前ceph等待的秒数,默认值3007 H% b1 w" {" d
mon osd down out interval = 600. O4 r& m/ {8 G! u5 \

9 ?5 G" Z3 F/ K. v5 j/ K. ?mon_allow_pool_delete = true
& r  f5 f+ e3 }/ s: t7 I" \2 R0 |  ^+ {# F/ z
[osd]3 {3 H2 o: F$ ~' C2 l: v
# osd 数据路径
- z! k8 S- f, T5 q* qosd data = /var/lib/ceph/osd/ceph-$id  U) e- B9 P+ m

( ^) _8 {4 f8 Q5 e! {2 m; B# 默认 pool pg,pgp 数量7 i& G0 g& Y6 F" Y0 m
osd pool default pg num =  1200+ g( t% ?9 \+ @* D
osd pool default pgp num = 1200
2 {3 ^9 ]/ q' W/ v. Z* t4 i
9 {7 O* T% D2 T# osd 的 journal 写日志时的大小默认 5120
2 ]# ~/ i0 T$ y$ x- Z6 Qosd journal size = 20000% V1 V1 D: U6 U+ D( q

- h" k6 r; K" [5 A# W/ g1 |1 b% F1 w# 格式化文件系统类型- I! \# s" b" A3 d. O8 V
osd mkfs type = xfs5 u. H1 {" _( S; g

; t/ \  ]1 Y: X! E4 s- Y# 格式化文件系统时附加参数+ e, c( |3 ^$ l- k; d
osd mkfs options xfs = -f
$ a0 Z7 [& i3 v. ]0 F
' t. S, m0 |) O! N' @# 为 XATTRS 使用 object map,EXT4 文件系统时使用,XFS 或者 btrf 也可以使用,默认 false
) O, f0 j6 ^: u. W' J8 Nfilestore xattr use omap = true' x( j/ V7 K* k
! @+ Q4 n4 Z7 w* a. f9 r
# 从日志到数据盘最小同步间隔(seconds),默认值 0.1
9 |- j; s& o: @6 Z0 _, Cfilestore min sync interval = 102 g) z# t9 ?5 X- d* j" ^
( L: {, s$ C8 h  m" h
# 从日志到数据盘最大同步间隔(seconds),默认值 51 P, O9 a0 C2 z' @
filestore max sync interval = 152 n4 L0 c" L2 n+ \  i9 l

7 u7 ?4 n& ?- q3 \7 u& T7 E( d. P# 数据盘最大接受的操作数,默认值 500
- W( e; m8 t, Z' z1 P7 R+ H4 k, ^* Pfilestore queue max ops = 25000% Z4 n5 i9 {8 e' b' ]

) G, O. G0 r$ y$ v- _6 v# 数据盘能够 commit 的最大字节数(bytes),默认值 1006 \% v( a+ C2 p& j: T
filestore queue max bytes = 10485760
; j6 o3 O7 f) H) r4 L. {" s  v3 f' V9 ?
# 数据盘能够 commit 的操作数,500" S) D/ s3 a% e0 @8 ?6 m! s- B2 n
filestore queue committing max ops = 5000
9 `- k8 [; o8 o* H5 N. h9 ^6 X! B4 b
# 数据盘能够 commit 的最大字节数(bytes),默认值 100
5 w5 ?4 M8 T% u- b6 }) |: Nfilestore queue committing max bytes = 10485760000
8 G7 `: R6 N3 B, m+ }, q( S: T" Q9 D, N
# 前一个子目录分裂成子目录中的文件的最大数量,默认值 2
; K$ L7 l; K. f7 n1 Jfilestore split multiple = 8
! |1 y/ I5 U: s$ o, I) J7 h" K+ U9 I" X( L) w0 T
# 前一个子类目录中的文件合并到父类的最小数量,默认值103 m# m+ d1 ]0 D- Y. k0 o- E! l8 B
filestore merge threshold = 40
" B, n, S. ^8 v( C2 ]4 d( N' t" s4 Z# @" {+ {5 ?2 y. f
# 对象文件句柄缓存大小,默认值 128/ @. x; w' _5 c! p" p
filestore fd cache size = 1024! Q; X! X9 O9 k2 o" ~
; O1 E- z, @& m# v% W
# 并发文件系统操作数,默认值 2: u3 F# y1 `" u8 @
filestore op threads = 32
2 A2 I9 J0 m  o, U8 R  g* W: }2 g, i& q
# journal 一次性写入的最大字节数(bytes),默认值 1048560
) K' Y7 A( N/ f. C  Ojournal max write bytes = 1073714824, F8 y! d7 y% Z- ^
, Y# z! Y2 d; U( r5 ^  P
# journal一次性写入的最大记录数,默认值 100
! ^9 p8 f* @1 {) b$ f* w% g/ d: _journal max write entries = 10000* u( ?4 y% \5 Z7 x$ {  \+ T  X* ]

. P, y. Q- o& v7 h0 Z# journal一次性最大在队列中的操作数,默认值 50; r8 z9 }9 t% q* q" S  G
journal queue max ops = 500001 |; d' [8 k* U
* c4 {/ I# N3 |
# journal一次性最大在队列中的字节数(bytes),默认值 33554432" U: A- M1 F" T9 B* m
journal queue max bytes = 104857600000 S& Z2 u/ {! G4 G0 O  w+ v. S
5 E% y5 F7 g. `4 ^2 {
# # OSD一次可写入的最大值(MB), 默认 90. Y7 k: F* c, o9 k( _9 i
osd max write size = 512, R# y+ h, v; X, Z) ?5 |2 L
' E2 v5 s$ b, C* ], H& B, D
# 客户端允许在内存中的最大数据(bytes), 默认值100: c$ Y% f' [1 ]' }
osd client message size cap = 2147483648) p% A" h, Y5 ~8 s8 t9 W
3 }4 k; p! s- B) k. [9 Z
# 在 Deep Scrub 时候允许读取的字节数(bytes), 默认值524288
7 Y. j. D% F6 R- Oosd deep scrub stride = 1310720" T) s7 H' ^  m0 l$ ~" y

$ N: ?, m8 Z" s' [8 c4 q# 并发文件系统操作数, 默认值 2. g9 A6 f) ]; q! d
osd op threads = 32
8 S5 g9 L6 }" C8 {4 r; n% }1 ?$ p# {9 ?
# OSD 密集型操作例如恢复和 Scrubbing 时的线程, 默认值1
2 o3 I4 o7 e( |+ X; x6 R! iosd disk threads = 10
! N. X; U  G3 A9 ^
' H4 O; M3 c; b. h# 保留 OSD Map 的缓存(MB), 默认 500; P  \# u4 b; n7 S" F
osd map cache size = 10240
" f+ d1 X1 _& x- M0 U6 z  T" J: J' b3 W1 n4 [
# OSD 进程在内存中的 OSD Map 缓存(MB), 默认 503 T+ ]2 z/ J  Z$ j! d6 J
osd map cache bl size = 1280
# a. @& s9 A" S. i; R/ l, u
/ g' @+ i' l( T+ g& `# 默认值rw,noatime,inode64, Ceph OSD xfs Mount选项# g  o  m' d6 j& G1 I' D
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
" A* W/ m- ]* J+ D! D) E- _3 R8 C: d* g9 z
# 恢复操作优先级,取值 1-63,值越高占用资源越高, 默认值 104 c, w1 b" ?8 N5 [+ U2 D
osd recovery op priority = 20
2 I0 U4 b0 u  O9 s- T5 I  \, }$ ~- E0 w+ u, O5 _6 S9 x
# 同一时间内活跃的恢复请求数, 默认值 15
/ U0 \0 D( X  U( @. \" Kosd recovery max active = 15; R) I3 j4 N9 a
3 p0 R) m* }4 k! o6 o, v" ]+ H  i' c
# 一个 OSD 允许的最大 backfills 数, 默认值 10
. ^0 D- {/ C0 L- m8 Z6 v2 |9 josd max backfills = 10
+ U5 ?  Z% q& n4 y( a6 L/ r- w% }8 a; x- w# p. J' S' }3 [
# 开启严格队列降级操作1 c' m; C- S! H8 V0 O0 Y& w% y
osd op queue cut off = high
) o6 y" q3 v" D# M: c% O% ^  R6 A' R" }7 o7 O
osd_deep_scrub_large_omap_object_key_threshold = 800000
$ }) d3 M" L* m! L" n8 zosd_deep_scrub_large_omap_object_value_sum_threshold = 10737418240
1 @0 n. y  l; F' ]) B/ f
2 S9 Y% r' r! h6 M; H$ n* n9 F[mds]
* `3 c) K6 ?% ~# Q7 j* A# mds 缓存大小设置 60GB& V' ^+ Z9 M7 R7 y. m
mds cache memory limit = 62212254726
, t  y, G6 o8 i- B
) i& R7 [7 Y4 c- h2 ]0 r# 超时时间默认 60 秒
4 Z( R8 p# n" v; `' m1 R$ M: lmds_revoke_cap_timeout = 360
5 F6 X  O8 F6 Z/ C
% f2 `! w( J4 Umds log max segments = 51200- v8 Y8 S& z+ n* ~. M
mds log max expiring = 51200
; }& \1 P! p$ |0 C0 V4 [/ L! k0 O7 F" s' {: R3 y8 G' P4 T$ n. C
mds_beacon_grace = 300
! J9 C7 p. Y" b+ f
( V- f9 W% x  d0 ~/ K* q5 K# 对目录碎片大小的硬限制 默认 100000- v; f# F' ?0 Q% a5 L9 `
# https://docs.ceph.com/docs/master/cephfs/dirfrags/
0 T  A' x# \( \, R- Pmds_bal_fragment_size_max = 500000; ~! n* Y' V, k% K

6 R& E+ x6 z5 D2 q$ ^) J: I6 S## 官方配置 https://ceph.readthedocs.io/en/latest/cephfs/mds-config-ref/
' z, {7 H  _9 E' u& U3 T8 [& m+ j' y% a- j* A! d& K
[client]: K) b$ k  l3 G3 I9 d+ z
9 v) Z' w/ B1 a$ c3 r" B# X
# RBD缓存, 默认 true
3 o- B5 r4 j4 z9 x; V( Frbd cache = true
5 ?  e' I2 L) Z' b+ B# |# [3 y1 ?( U, P- P6 d
# RBD缓存大小(bytes), 默认 335544320(320M)
+ h) a4 E2 r, s4 {& i# ~  X7 grbd cache size = 268435456
; s! g+ J2 c* N/ I) ^% J1 Q0 l+ I" H- g
# 缓存为 write-back 时允许的最大 dirty 字节数(bytes),如果为0,使用 write-through,默认值为 25165824& L& |/ x" \! U9 k9 e
rbd cache max dirty = 134217728
( f) Y" x2 x) `2 ^6 D2 {
# V. ~1 G6 x  ?0 R+ ?. U' g4 l# 在被刷新到存储盘前 dirty 数据存在缓存的时间(seconds), 默认值为 1
9 v+ n" f* L" b1 y6 n+ K: rrbd cache max dirty age = 5/ o; _* b+ F6 D- E! }5 v

/ X4 k) [4 o. @8 T, }2 vclient_try_dentry_invalidate = false
8 q, Q/ R8 G. f  o5 Q! Q3 k( A% M. B1 m: o1 A7 I4 o, P- \
[mgr]
# k* R* m& J9 D  T6 E) Pmgr modules = dashboard
3 a, J1 D5 R4 y0 ?; u9 D, G8 z1 q# h: n! g$ `5 G! p2 V
# 华为云调优指南 https://support.huaweicloud.com/ ... object_05_0008.html5 W* `& U" a' _% j( L9 S# }1 S2 c
# https://poph163.com/2020/02/18/c ... %E8%B0%83%E4%BC%98/
4 s3 l/ \! D7 A复制7 _+ r2 N2 d  \, }
full osd
8 p* {8 U+ M5 J/ N% Nfull osd 每个 osd 已经写满上限:https://docs.ceph.com/en/latest/ ... no-free-drive-space[5]
& K% m/ t' a" j4 B
" y# J; J3 e% r; f9 L. x6 ~, e$ ceph osd dump | grep full_ratio
- Y8 W9 v# c5 z% }# D: d- n% pfull_ratio 0.959 ~6 ?4 a) O) ^" ~: i0 d5 M
backfillfull_ratio 0.9
7 q9 N; H5 M9 `& Dnearfull_ratio 0.85; G5 W" F( N. F2 @2 t) r% z
复制5 h8 a  r# i, f  N: H! l  ]
集群状态:
# D& T1 S1 d6 T* b# X2 c1 [4 E; n* ]! G2 ]. x# _' L( q" f% m; r
$ ceph -s
6 g7 X" C$ n/ o. k; V  cluster:
( ?3 |9 `8 _1 w! {7 v    id:     2f77b028-ed2a-4010-9b79-90fd3052afc6
6 B  ]1 w: \+ s6 q* Y" ~    health: HEALTH_ERR
  C/ X+ |) o# F: V$ ]! _; P7 I* ]            2 backfillfull osd(s)5 W% t  W* ]% ]6 C" j; g
            1 full osd(s)3 X+ c, B0 K  i: z- f! J: ~- k; w
            2 nearfull osd(s); ]% R; q& Y; j' ?& v: w
            7 pool(s) full
5 X& T% h6 n" H' K复制* Q6 z7 h8 Q6 [3 B8 Z, t7 x
执行 osd 磁盘状态时,如果已经有超过 95% 使用率时则会报错 full osd 则会造成 cluster 无法正常使用:: u$ V. G) A( x1 s. g! R3 f$ O

  N0 S& w" w& d+ P$ ceph osd df
2 o! m* A' v( ~* D" P! V5 F* V9 r. IID CLASS WEIGHT  REWEIGHT SIZE    USE     DATA    OMAP     META    AVAIL   %USE  VAR  PGS
6 t7 L4 F3 ^2 H4 K5 q! n 0   hdd 7.27689  1.00000 7.3 TiB 4.7 TiB 4.7 TiB  918 MiB 9.1 GiB 2.5 TiB 65.15 0.84  68
8 x0 t; G9 t' r6 H 1   hdd 7.27689  1.00000 7.3 TiB 6.1 TiB 6.1 TiB  327 MiB  11 GiB 1.2 TiB 84.07 1.09  67
& w# a# ?; k" }7 U 2   hdd 7.27689  1.00000 7.3 TiB 4.3 TiB 4.3 TiB  924 MiB 8.4 GiB 2.9 TiB 59.70 0.77  67
- e( m1 F: g' B0 H 3   hdd 7.27689  1.00000 7.3 TiB 5.1 TiB 5.1 TiB  807 MiB 9.8 GiB 2.1 TiB 70.57 0.91  66! x- u, F7 w% K( m0 q" S. y# g$ C
4   hdd 7.27689  1.00000 7.3 TiB 6.7 TiB 6.7 TiB  770 MiB  13 GiB 583 GiB 92.18 1.19  66
, |: C$ _4 k4 U1 a: F, ^' F 5   hdd 7.27689  1.00000 7.3 TiB 5.5 TiB 5.5 TiB  623 MiB  10 GiB 1.8 TiB 75.87 0.98  66
, m! z- d, R! a 6   hdd 7.27689  1.00000 7.3 TiB 5.7 TiB 5.7 TiB  602 MiB  11 GiB 1.6 TiB 78.67 1.02  64; v. V. {9 i' m+ v& k2 z
7   hdd 7.27689  1.00000 7.3 TiB 5.3 TiB 5.3 TiB  1.1 GiB  10 GiB 1.9 TiB 73.35 0.95  65, a7 g' M0 R0 a% l# P, A) a
8   hdd 7.27689  1.00000 7.3 TiB 5.9 TiB 5.9 TiB  498 MiB  11 GiB 1.4 TiB 81.29 1.05  68
4 _- s( q+ u& |& t 9   hdd 7.27689  1.00000 7.3 TiB 5.1 TiB 5.1 TiB  1.1 GiB 9.8 GiB 2.1 TiB 70.59 0.91  65+ ?) J( g% n0 W" a  }  D6 S; v% W* G
10   hdd 7.27689  1.00000 7.3 TiB 6.3 TiB 6.3 TiB  297 MiB  12 GiB 985 GiB 86.78 1.12  61
. m5 y! P5 }! T0 A8 {# l11   hdd 7.27689  1.00000 7.3 TiB 5.1 TiB 5.1 TiB  923 MiB 9.7 GiB 2.1 TiB 70.56 0.91  67
5 C3 c  A5 r2 `5 j6 O12   hdd 7.27689  1.00000 7.3 TiB 5.9 TiB 5.9 TiB  203 MiB  11 GiB 1.4 TiB 81.39 1.05  65! W8 G3 I8 o" L. Q. _; Z0 @
13   hdd 7.27689  1.00000 7.3 TiB 5.3 TiB 5.3 TiB  799 MiB  10 GiB 1.9 TiB 73.29 0.95  66
' }5 A( a2 a: V14   hdd 7.27689  1.00000 7.3 TiB 4.9 TiB 4.9 TiB  873 MiB 9.4 GiB 2.3 TiB 67.77 0.88  71% {( G/ V! ]( Y4 o) E
15   hdd 0.29999  1.00000 7.3 TiB 6.9 TiB 6.9 TiB  191 MiB  13 GiB 387 GiB 94.81 1.23  39
  o; [1 t2 X! @% s, ^' H  Y6 u6 l! S16   hdd 7.27689  1.00000 7.3 TiB 5.5 TiB 5.5 TiB  548 MiB  11 GiB 1.8 TiB 75.91 0.98  69  G" l1 ?0 m4 a' d& |( L/ F
17   hdd 7.27689  1.00000 7.3 TiB 6.7 TiB 6.7 TiB  806 MiB  13 GiB 581 GiB 92.20 1.20  66  b( ]. t4 \  B8 P1 U( ~
18   hdd 7.27689  1.00000 7.3 TiB 4.5 TiB 4.5 TiB  1.4 GiB 8.5 GiB 2.7 TiB 62.43 0.81  66$ U# X- V$ D6 U& `3 h' a
19   hdd 7.27689  1.00000 7.3 TiB 5.3 TiB 5.3 TiB  1.4 GiB  10 GiB 1.9 TiB 73.28 0.95  65
9 K* i( t4 x. h5 z6 ]# M20   hdd 7.27689  1.00000 7.3 TiB 5.5 TiB 5.5 TiB  705 MiB  11 GiB 1.8 TiB 75.91 0.98  64
9 @( t7 m$ q1 |7 A  ?+ `21   hdd 7.27689  1.00000 7.3 TiB 6.1 TiB 6.1 TiB  911 MiB  11 GiB 1.2 TiB 84.11 1.09  62
9 Z% k8 N( L/ p' @22   hdd 7.27689  1.00000 7.3 TiB 6.1 TiB 6.1 TiB  301 MiB  11 GiB 1.2 TiB 84.03 1.09  665 q% {& J2 q7 L* \' ?( A# T, `
23   hdd 7.27689  1.00000 7.3 TiB 5.5 TiB 5.5 TiB  401 MiB 9.8 GiB 1.7 TiB 75.96 0.98  675 o) s. Q! w3 I+ K
24   hdd 7.27689  1.00000 7.3 TiB 5.1 TiB 5.1 TiB  1.3 GiB 9.6 GiB 2.1 TiB 70.58 0.91  63! F: W- p6 u- m9 @9 @) [: s: Z" t: C
25   hdd 7.27689  1.00000 7.3 TiB 5.1 TiB 5.1 TiB  1.1 GiB 9.7 GiB 2.1 TiB 70.56 0.91  65
3 n! v' l& v, x26   hdd 7.27689  1.00000 7.3 TiB 5.3 TiB 5.3 TiB  730 MiB  10 GiB 1.9 TiB 73.32 0.95  684 A, T1 \' U# I3 A
27   hdd 7.27689  1.00000 7.3 TiB 6.1 TiB 6.1 TiB  818 MiB  12 GiB 1.2 TiB 84.08 1.09  623 Y2 z4 U) B% o, N6 B& Q
28   hdd 7.27689  1.00000 7.3 TiB 4.9 TiB 4.9 TiB  587 MiB 9.3 GiB 2.3 TiB 67.84 0.88  68
9 V# ]) ~) ?& P9 @- i$ U# \$ S) W29   hdd 7.27689  1.00000 7.3 TiB 6.1 TiB 6.1 TiB  215 MiB  11 GiB 1.2 TiB 84.09 1.09  666 P" j" _4 x3 G0 p% Y
30   hdd 7.27689  1.00000 7.3 TiB 6.1 TiB 6.1 TiB  690 MiB  12 GiB 1.2 TiB 84.15 1.09  64. a% A, r  E6 v' h" L- l
31   hdd 7.27689  1.00000 7.3 TiB 5.5 TiB 5.5 TiB 1020 MiB  10 GiB 1.8 TiB 75.94 0.98  64
, ^& k$ d1 b2 m32   hdd 7.27689  1.00000 7.3 TiB 6.5 TiB 6.5 TiB  616 MiB  12 GiB 786 GiB 89.45 1.16  660 ]& E4 N: o3 x* R1 c6 _
33   hdd 7.27689  1.00000 7.3 TiB 4.9 TiB 4.9 TiB  622 MiB 8.9 GiB 2.3 TiB 67.84 0.88  66
' ~8 G7 Q1 A) j3 s34   hdd 7.27689  1.00000 7.3 TiB 5.7 TiB 5.7 TiB  102 MiB  11 GiB 1.6 TiB 78.56 1.02  651 F# R5 U6 m% v; @$ s5 W
35   hdd 7.27689  1.00000 7.3 TiB 5.9 TiB 5.9 TiB  723 MiB  11 GiB 1.4 TiB 81.31 1.05  639 D" a! ?% N( h1 ~
                    TOTAL 262 TiB 202 TiB 202 TiB   25 GiB 381 GiB  60 TiB 77.15
+ ]6 Y- S4 X6 I9 ^复制5 t+ C+ Z3 }6 z
可以手动修改权重解决:
: [# x: d' j3 i; \; }
' |. O) {+ K0 `1 B, n  n' s4 ]$ ceph osd crush reweight osd.4 0.3. p- K6 C3 F# E4 E4 v* i. i
复制
; B1 k& J6 N- G- }! Z* X$ vpg 均衡
2 f6 J" _% g7 \5 Rpg 在默认分配有不合理的地方。https://cloud.tencent.com/developer/article/1664655[6]" p- m0 y- d" \, ?- ~. ]2 k
# K, H. t, K* O
$ ceph osd df tree | awk '/osd\./{print $NF" "$(NF-1)" "$(NF-3) }'  t; N9 Y  M+ D% `
osd.0 89 71.20
: C& O+ J0 i6 k/ m' j" Posd.1 38 94.80. r1 g! p  ?, Q: }1 c
osd.2 92 68.44! |6 M8 R) V; P& ^& |7 C# }4 b2 ~
osd.3 92 72.36
4 Y! L$ R/ ^1 i9 M* z) Cosd.4 28 76.861 e. w& t2 y1 L8 V' X  ]% g) G+ e+ m
osd.5 64 81.37
: D2 n& J9 Y. T, y, ^2 K8 ~osd.6 62 87.90
5 f. K3 E6 f% o, _9 S; I: Yosd.7 89 78.78
  r, E& e8 q. b; z6 @4 }1 sosd.8 52 86.18
8 J3 o2 c2 L8 j2 v1 eosd.9 89 75.44. T+ p5 H" U. }/ u5 \
osd.10 37 96.33- e  ?9 R7 y* O! ]7 x
osd.11 102 75.26
$ z7 y1 `5 }) V$ i' ?4 s2 josd.12 33 91.41
8 {& n5 q$ `. z& ^" @* R. b. losd.13 34 95.98
) \: h9 C( F/ P) m7 G% dosd.14 59 84.97
* P$ o: p) F# Cosd.15 20 70.92
. s$ M4 L0 t! M5 R# Rosd.16 113 89.46/ h& [8 r' K1 g) x$ ]7 \
osd.17 30 77.12
1 i- d( U7 P" w7 V* rosd.18 124 77.11
6 y- [. X; t; H% ?3 \  k) d! Cosd.19 44 95.23
% |( x' f( j8 z1 f. D2 p* Losd.20 65 84.63
+ G: x. I! ]! f: f* Dosd.21 98 96.71% ~; V4 F4 x( b, `
osd.22 34 95.93
; ?, u3 A* P% q& @6 |  Gosd.23 62 84.56
$ X+ w8 `5 @( G3 [osd.24 110 76.63( L& V/ s1 m( a6 x8 h5 E0 g5 z
osd.25 64 82.325 G+ f: q! s! o0 z
osd.26 59 88.268 y' x% n* P8 z7 P9 X
osd.27 38 95.83; A9 O; g9 ]- U$ v1 ]! B
osd.28 105 79.198 _' U3 a' u6 u, P6 B( B
osd.29 36 94.94
/ c" I$ l. I4 t) f% {* _  @osd.30 94 90.79
0 Z& p. z5 f3 X+ V2 Uosd.31 91 81.743 l6 Z9 Y5 q$ U! E' a& T0 s
osd.32 12 42.44
' O$ g8 t6 b1 a+ s' Gosd.33 94 81.32% t% d4 o4 ?7 E0 h6 t# J0 j, P
osd.34 46 86.51
; b. f; l9 v8 uosd.35 37 92.68
$ y6 t3 F9 k) V- f7 C" J9 d1 H复制
! @9 U9 e. O& y$ ~8 d+ B& `9 Sreweight-by-pg 按归置组分布情况调整 OSD 的权重:
3 V- J8 _3 ~3 Z2 C, w9 h; U/ N- V9 y7 v
$ ceph osd reweight-by-pg$ T! c- g$ M& U8 O$ y3 J. n2 w5 b; E
moved 0 / 2336 (0%)- M+ A( ]5 S" {: s
avg 64.8889
5 a! M7 G8 d9 O7 Zstddev 58.677 -> 58.677 (expected baseline 7.9427)9 M! K* g# x# e0 y* c* k
min osd.1 with 0 -> 0 pgs (0 -> 0 * mean)* D4 O  F' P( v) d4 P. I
max osd.18 with 168 -> 168 pgs (2.58904 -> 2.58904 * mean)5 h7 {0 o0 j: T1 f" D0 a; v
8 f( `  [7 s4 r) s& @( B
oload 120
: @; n+ P+ C6 qmax_change 0.05- w. D2 ]" B0 r* F6 v  ~$ u
max_change_osds 4
1 ~8 H7 S+ ], Q5 N. x: ?/ Vaverage_utilization 18.2677
6 @/ r) Y3 q* Q' M: ooverload_utilization 21.9212
; m/ M! Y( p8 `( sosd.19 weight 1.0000 -> 0.9500
9 F% }* ?) I& X/ |osd.1 weight 1.0000 -> 0.95007 _6 j+ U; e7 o) h! w* k8 \5 q
osd.27 weight 1.0000 -> 0.9500
% j4 p# k* Q) |6 [* Aosd.10 weight 1.0000 -> 0.95000 i* U/ G, l% H, @
复制/ ]. n+ w) k  i, c& l* o4 T
reweight-by-utilization 按利用率调整 OSD 的权重:2 |4 X+ L. @+ ?( |

/ k6 [  F, f- R- |0 v; F* R. l$ ceph osd reweight-by-pg& b7 ~8 [  v( ^
moved 0 / 2336 (0%)
9 J* u, i$ E4 V' ~9 e& y0 H' Havg 64.88890 b6 a1 ]' }& T/ P# r3 v
stddev 58.677 -> 58.677 (expected baseline 7.9427)- O/ x. J1 D; T' A: V- P' f1 ?
min osd.1 with 0 -> 0 pgs (0 -> 0 * mean)/ Q8 u- o$ y# C+ A* d, _
max osd.18 with 168 -> 168 pgs (2.58904 -> 2.58904 * mean)6 U+ m, L" E; n
/ J' J3 h8 J' Z2 `2 z
oload 120
: H, C1 P2 T2 Omax_change 0.05
' H& M; M* {- fmax_change_osds 4# G4 {0 }6 w, T9 |' @
average_utilization 18.2677$ V; g% }* R$ t( r$ X' G
overload_utilization 21.9212( a8 |8 k; [4 \' v; B( N
osd.19 weight 1.0000 -> 0.9500- O- Z9 x/ w- l- ^4 d
osd.1 weight 1.0000 -> 0.9500
2 _( J7 z. @4 l, a5 h' xosd.27 weight 1.0000 -> 0.95006 X9 O1 t6 R7 z7 {8 p
osd.10 weight 1.0000 -> 0.9500
! ]: X7 u) c7 k+ H2 D# ^- t复制
+ c9 f9 [! }' d6 Q) `5 A* N* B* h调整写入权重:2 K$ q# E$ _* t

3 D/ d; ?, T' ?$ ceph osd reweight osd.35 0.001
; k, J+ h( G+ i4 _* W/ I* t复制7 r6 A6 w. ~! w- l
查看当前 osd 信息:: z  v( r5 }+ _; L' D, e

1 z, b1 v3 Q5 h; k. q8 R- K$ ceph osd df
' U7 G1 _" z, A$ n8 t. O$ uID CLASS WEIGHT  REWEIGHT SIZE    USE     DATA    OMAP    META    AVAIL   %USE  VAR  PGS' B( k; w, U  m; i* A  p( e, |
0   hdd 7.27689  1.00000 7.3 TiB 5.2 TiB 5.2 TiB 1.0 GiB 9.4 GiB 2.0 TiB 71.96 0.86  39" |7 p3 @9 q$ S6 p2 y
1   hdd 0.00999  0.90002 7.3 TiB 6.9 TiB 6.9 TiB 604 MiB  12 GiB 382 GiB 94.88 1.13  37
# D' d! T# A  p% O! l" j' v' k 2   hdd 7.27689  1.00000 7.3 TiB 5.1 TiB 5.1 TiB 1.2 GiB 8.8 GiB 2.2 TiB 69.55 0.83  34
( [: n4 C6 K+ Z$ A6 G7 ~( e 3   hdd 7.27689  1.00000 7.3 TiB 5.3 TiB 5.3 TiB 812 MiB 9.9 GiB 2.0 TiB 73.15 0.87  34- `; Z9 @2 s, J* e7 k5 t$ q9 j
4   hdd 0.29999  1.00000 7.3 TiB 5.6 TiB 5.6 TiB 185 MiB  12 GiB 1.7 TiB 77.01 0.92  26
' M) P$ v/ F- \; c' ~5 R* q, L 5   hdd 3.00000  1.00000 7.3 TiB 6.0 TiB 5.9 TiB 443 MiB  11 GiB 1.3 TiB 81.90 0.98  36
* z, p9 |' M: L  }) g 6   hdd 3.00000  1.00000 7.3 TiB 6.5 TiB 6.5 TiB 499 MiB  11 GiB 809 GiB 89.14 1.06  38- X4 u. t/ ~' D! o$ F! Y+ }& Z
7   hdd 7.27689  1.00000 7.3 TiB 5.8 TiB 5.8 TiB 1.2 GiB  11 GiB 1.4 TiB 80.10 0.96  439 n" e2 V2 h3 f! h% x: _1 S
8   hdd 3.00000  1.00000 7.3 TiB 6.3 TiB 6.3 TiB 502 MiB  11 GiB 992 GiB 86.69 1.03  360 D  M5 t- A  w3 W1 p: D$ j( y
9   hdd 7.27689  1.00000 7.3 TiB 5.6 TiB 5.6 TiB 1.5 GiB 9.8 GiB 1.7 TiB 76.57 0.91  42
3 S7 M( r0 w4 u3 n: ?10   hdd 0.00999  0.00099 7.3 TiB 7.0 TiB 7.0 TiB 295 MiB  12 GiB 267 GiB 96.41 1.15  37- {+ s1 ^- C8 [) q% `" f
11   hdd 7.27689  1.00000 7.3 TiB 5.5 TiB 5.5 TiB 1.2 GiB 9.8 GiB 1.7 TiB 76.13 0.91  37# w" Y# r# U( U$ @0 z8 m! N. g
12   hdd 0.00999  1.00000 7.3 TiB 6.7 TiB 6.6 TiB  95 MiB  12 GiB 635 GiB 91.48 1.09  32, s# G+ e, E0 i/ _  |: V4 I
13   hdd 0.00999  1.00000 7.3 TiB 7.0 TiB 7.0 TiB 584 MiB  12 GiB 315 GiB 95.78 1.14  34  O! r1 H8 K: _* c0 L/ }+ f* n
14   hdd 3.00000  1.00000 7.3 TiB 6.2 TiB 6.2 TiB 974 MiB  11 GiB 1.0 TiB 85.86 1.02  40' ~0 t/ Q- A: \4 m8 C' f+ V! ^. }6 j
15   hdd 0.00999  1.00000 7.3 TiB 5.1 TiB 5.1 TiB 116 KiB  10 GiB 2.2 TiB 70.43 0.84  20/ q5 a+ j& q" B
16   hdd 7.27689  1.00000 7.3 TiB 6.6 TiB 6.6 TiB 1.2 GiB  11 GiB 697 GiB 90.64 1.08  43, f1 _$ U! w( b1 F: w& I: X3 D7 V' {1 x
17   hdd 0.29999  1.00000 7.3 TiB 5.6 TiB 5.6 TiB  40 KiB  12 GiB 1.7 TiB 76.75 0.92  26
- q0 _/ ^6 v% C/ {2 t18   hdd 7.27689  1.00000 7.3 TiB 5.7 TiB 5.7 TiB 1.9 GiB 9.3 GiB 1.6 TiB 78.01 0.93  53/ X9 u6 u# I, @- _0 T" X
19   hdd 0.00999  0.00099 7.3 TiB 6.9 TiB 6.9 TiB 1.5 GiB  13 GiB 371 GiB 95.02 1.13  40
/ m" ~. F+ ^0 |+ B% O$ P0 K# s# @20   hdd 3.00000  1.00000 7.3 TiB 6.2 TiB 6.2 TiB 744 MiB  12 GiB 1.0 TiB 85.86 1.02  37
" ~  N1 ?* T* Y4 i  y* n, I21   hdd 7.27689  0.00099 7.3 TiB 7.0 TiB 7.0 TiB 913 MiB  12 GiB 239 GiB 96.79 1.15  402 p* S1 E" N1 s+ u
22   hdd 0.00999  0.00099 7.3 TiB 7.0 TiB 7.0 TiB 283 MiB  12 GiB 298 GiB 96.00 1.14  34/ m2 q  W/ M% {
23   hdd 3.00000  1.00000 7.3 TiB 6.2 TiB 6.2 TiB 515 MiB  11 GiB 1.1 TiB 85.30 1.02  35
4 n; H& v* ?; ^2 H8 P- Z; {( B3 U4 \1 B24   hdd 7.27689  1.00000 7.3 TiB 5.6 TiB 5.6 TiB 1.4 GiB 9.8 GiB 1.6 TiB 77.63 0.93  42/ P& a1 P& D4 @  O
25   hdd 3.00000  1.00000 7.3 TiB 6.0 TiB 6.0 TiB 1.2 GiB  10 GiB 1.3 TiB 82.66 0.99  40
  i: `2 n8 D0 X26   hdd 2.00000  1.00000 7.3 TiB 6.5 TiB 6.5 TiB 737 MiB  11 GiB 823 GiB 88.95 1.06  36
+ N, }1 N1 }' j- y4 v) _, g5 z27   hdd 0.00999  0.00099 7.3 TiB 7.0 TiB 6.9 TiB 822 MiB  12 GiB 327 GiB 95.61 1.14  37
; D+ U3 F( {3 `: {& T! h  Z2 m28   hdd 7.27689  1.00000 7.3 TiB 5.8 TiB 5.8 TiB 859 MiB  10 GiB 1.4 TiB 80.23 0.96  40: h; q/ v( P! _; c$ L( l5 ?5 w: U
29   hdd 0.00999  0.00099 7.3 TiB 6.9 TiB 6.9 TiB 215 MiB  12 GiB 371 GiB 95.02 1.13  36
* `( E0 C$ ~7 B, Z7 t) f+ m! m30   hdd 7.27689  1.00000 7.3 TiB 6.7 TiB 6.7 TiB 1.0 GiB  12 GiB 607 GiB 91.85 1.10  47# i: M, \' e3 o5 b- C9 y
31   hdd 7.27689  1.00000 7.3 TiB 6.0 TiB 6.0 TiB 1.2 GiB  10 GiB 1.3 TiB 82.81 0.99  41  q2 g# J1 T  c% r$ b1 Z* }
32   hdd 0.29999  1.00000 7.3 TiB 3.0 TiB 3.0 TiB  32 KiB 7.1 GiB 4.3 TiB 41.47 0.49  107 ]0 H3 V4 J  M9 c0 {  F
33   hdd 7.27689  1.00000 7.3 TiB 6.0 TiB 6.0 TiB 827 MiB 9.7 GiB 1.3 TiB 82.06 0.98  41
8 y  l/ h  ~: p# ?1 l! R$ e9 u0 o34   hdd 2.00000  1.00000 7.3 TiB 6.3 TiB 6.3 TiB 308 MiB  11 GiB 976 GiB 86.90 1.04  33
. \% c6 S+ b0 R2 w7 u35   hdd 0.00999  0.00099 7.3 TiB 6.7 TiB 6.7 TiB 613 MiB  12 GiB 540 GiB 92.75 1.11  36
* W9 F0 O' L1 O$ {) D  O( r                    TOTAL 262 TiB 220 TiB 219 TiB  27 GiB 391 GiB  42 TiB 83.87; F# p# ~) i  s8 v/ n3 t3 M- Q/ M
MIN/MAX VAR: 0.49/1.15  STDDEV: 10.62
/ g% r+ F/ P- [7 U0 y复制
1 N. J7 ?' Z/ E: c4 W* S  a删除 Cephfs
! \. v- K7 k/ x8 J
. {9 B( s4 N* A7 b! s8 c7 q关闭所有 mds 服务, 需要登入服务器手动关闭:1 Z7 K. R; ?# G4 |/ U) k" H& F' |
7 `! m$ {) @! {4 q1 J1 a" S; ?+ E3 d
$ systemctl stop ceph-mds@${HOSTNAME}7 x  Q) X! I' V& A5 w
复制; x2 K# `4 z5 p# g+ x) n5 Q
删除所需 fs:0 x* l) ~0 f! ?3 g4 [
7 @, i- f0 c+ t/ U( d7 A
$ ceph fs ls
" d) v+ d4 l1 A% x- r$ ceph fs rm data --yes-i-really-mean-it
! |+ `, W5 s; v( [6 e: D" C* {! ?复制$ ?$ T3 j" Z5 a, L2 L
SSD 使用: [) _% d) J8 a2 U
& e0 x: d6 u( g) S$ b. E( W% e
查看当前 OSD 状态: (相关文档:https://blog.csdn.net/kozazyh/article/details/79904219[7])
: I7 _6 s& V- u5 O
6 W9 S1 m* D; X$ ceph osd crush class ls4 ^8 z0 Y9 c% n. E3 n9 y
[! Y: E5 E7 [; V! q- \1 T+ S5 }
    "ssd"0 R+ _" o2 p; f9 p, j$ |$ R6 x- p/ X% X
]
' K  z% F9 R1 T. [复制. g# C+ }4 m, q7 i8 ~2 z9 K
如果使用的 SSD 标识错误,请自定义修改,命令如下, 移除 osd 1 ~ 3 的标识:6 @& P4 B+ B* T  v* D* [
$ O% |; O2 a! M& e* m$ `7 |
$ for i in 0 1 2;do ceph osd crush rm-device-class osd.$i;done
! _; V# L' [1 q: v- o- b复制+ j1 ~1 ^3 [" A
设置 1 ~ 3 标识为 ssd:
) I4 c/ N8 \( w" N5 j: e
) W/ S/ [3 n6 X$ for i in 0 1 2;do ceph osd crush set-device-class ssd osd.$i;done
! y7 B5 C/ N2 V( y+ G& _复制
5 R, x. w6 c8 F- C* I: j创建一个 crush rule:9 k$ F- E  r/ P& @% o  Q1 A

/ ~: t* P0 l6 g- ~" C+ _$ ceph osd crush rule create-replicated rule-ssd default host ssd. T& f6 p! O! }$ l. f0 f
$ ceph osd crush rule ls9 O" j* I( O/ P5 m6 \
复制
; s& V  Y" e' j, I3 p1 m) V9 Q+ d然后创建 pool 时附带 rule 的名称:
: D. v  d, x; m8 T
( }5 ^- r/ U& Z$ ceph osd pool create fs_data 96 rule-ssd, i3 p. k' c* v7 q5 \0 i3 N
$ ceph osd pool create fs_metadata 16 rule-ssd& O6 U" C" K3 e( @
$ ceph fs new fs fs_data fs_metadata( @6 O( n6 ]! V$ \; N
复制4 K6 K: X% M& A" \- k. u' K
crushmap 查看
% C) M9 N; D- ~0 S' f; j* H3 y5 z% M执行命令如下:
0 y+ B. N% F" e/ b2 y; T" I  L; Q) m! l& {: Q3 s& z, k, H3 p$ }
$ ceph osd getcrushmap -o crushmap; h3 f0 k; ], m; W
$ crushtool -d crushmap -o crushmap9 D! }5 }) |$ N5 Q. Q; q! ]
$ cat crushmap. }9 q  i: t" x( C/ T
复制/ U" V% x& S3 N; B1 Z
3 monitors have not enabled msgr2
# G# B1 a9 Y; F3 F3 i解决如下:  Q! {- [2 M! U

$ G3 Z0 d/ q  n+ Z7 X4 X# n$ ceph mon enable-msgr22 M6 Q+ g2 x$ [( Y1 v+ o& a
复制
7 _) y/ Z  l4 }) A! k6 K) Y3 V2 daemons have recently crashed
% i3 X" V/ z; |解决如下:https://blog.csdn.net/QTM_Gitee/article/details/106004435[8]
, q+ U$ X% ^9 U
( x$ E9 ^: e, H: c$ ceph crash ls) y) ]- w9 u. `' G
$ ceph crash archive-all  `% g  H! V5 b# r, I
$ m8 Z: X$ z' [

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
 楼主| 发表于 2022-7-29 09:56:46 | 显示全部楼层
# systemctl stop ceph-mds@p04.service, p- [# Z6 h, |6 @) r7 ]0 W
# cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary/ p3 A5 T- t! M
# cephfs-journal-tool --rank=cephfs:0 journal reset
9 k" e5 w5 _/ L5 L1 F6 [% Y6 H) I# cephfs-table-tool all reset session
( p4 [' W6 }3 Z, \. `# systemctl start ceph-mds@p04.service+ X6 g; k; x% b# b8 H" J
# ceph mds repaired 0
! Q+ z3 M4 M% J  {. z' H6 H1 ]+ Z( E

' X( |. p  h, p8 uJOURNAL EXPORT

Before attempting dangerous operations, make a copy of the journal like so:

cephfs-journal-tool journal export backup.bin
0 w8 N( M+ C7 ^0 `
: x8 Y4 F5 Y8 ]. M2 z5 y

Note that this command may not always work if the journal is badly corrupted, in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).

DENTRY RECOVERY FROM JOURNAL

If a journal is damaged or for any reason an MDS is incapable of replaying it, attempt to recover what file metadata we can like so:

cephfs-journal-tool event recover_dentries summary) j3 D/ i9 x) V  E3 B& l. ^* k: V

3 R. M$ A" O/ [1 t( t3 n' c, C

This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.

This command will write any inodes/dentries recoverable from the journal into the backing store, if these inodes/dentries are higher-versioned than the previous contents of the backing store. If any regions of the journal are missing/damaged, they will be skipped.

Note that in addition to writing out dentries and inodes, this command will update the InoTables of each ‘in’ MDS rank, to indicate that any written inodes’ numbers are now in use. In simple cases, this will result in an entirely valid backing store state.

Warning

The resulting state of the backing store is not guaranteed to be self-consistent, and an online MDS scrub will be required afterwards. The journal contents will not be modified by this command, you should truncate the journal separately after recovering what you can.

* }$ W3 |4 b. m6 p
JOURNAL TRUNCATION

If the journal is corrupt or MDSs cannot replay it for any reason, you can truncate it like so:

cephfs-journal-tool [--rank=N] journal reset; R: N/ Y& ?9 f0 s$ P  D
9 ]& M8 k8 ?5 k7 w3 K( O

Specify the MDS rank using the --rank option when the file system has/had multiple active MDS.

Warning

Resetting the journal will lose metadata unless you have extracted it by other means such as recover_dentries. It is likely to leave some orphaned objects in the data pool. It may result in re-allocation of already-written inodes, such that permissions rules could be violated.

) S  y; u3 O8 P0 w
MDS TABLE WIPES

After the journal has been reset, it may no longer be consistent with respect to the contents of the MDS tables (InoTable, SessionMap, SnapServer).

To reset the SessionMap (erase all sessions), use:

cephfs-table-tool all reset session
8 h2 q& N/ U0 k; o8 M5 A, E
, [0 d- v3 k# L9 K( w3 f. f

This command acts on the tables of all ‘in’ MDS ranks. Replace ‘all’ with an MDS rank to operate on that rank only.

The session table is the table most likely to need resetting, but if you know you also need to reset the other tables then replace ‘session’ with ‘snap’ or ‘inode’.

MDS MAP RESET

Once the in-RADOS state of the file system (i.e. contents of the metadata pool) is somewhat recovered, it may be necessary to update the MDS map to reflect the contents of the metadata pool. Use the following command to reset the MDS map to a single MDS:

ceph fs reset <fs name> --yes-i-really-mean-it
% v  L8 h" N+ w8 n9 J! ~( {1 N/ T* x7 ]5 z5 D

Once this is run, any in-RADOS state for MDS ranks other than 0 will be ignored: as a result it is possible for this to result in data loss.

One might wonder what the difference is between ‘fs reset’ and ‘fs remove; fs new’. The key distinction is that doing a remove/new will leave rank 0 in ‘creating’ state, such that it would overwrite any existing root inode on disk and orphan any existing files. In contrast, the ‘reset’ command will leave rank 0 in ‘active’ state such that the next MDS daemon to claim the rank will go ahead and use the existing in-RADOS metadata.

RECOVERY FROM MISSING METADATA OBJECTS

Depending on what objects are missing or corrupt, you may need to run various commands to regenerate default versions of the objects.

# Session tablecephfs-table-tool 0 reset session# SnapServercephfs-table-tool 0 reset snap# InoTablecephfs-table-tool 0 reset inode# Journalcephfs-journal-tool --rank=0 journal reset# Root inodes ("/" and MDS directory)cephfs-data-scan init' E1 ^. u- B! g4 [$ j( v' w

7 e9 b- p9 q$ B

Finally, you can regenerate metadata objects for missing files and directories based on the contents of a data pool. This is a three-phase process. First, scanning all objects to calculate size and mtime metadata for inodes. Second, scanning the first object from every file to collect this metadata and inject it into the metadata pool. Third, checking inode linkages and fixing found errors.

cephfs-data-scan scan_extents <data pool>cephfs-data-scan scan_inodes <data pool>cephfs-data-scan scan_links9 I5 }2 U. d+ C

8 X- H3 s9 h. k' [( ^

‘scan_extents’ and ‘scan_inodes’ commands may take a very long time if there are many files or very large files in the data pool.

To accelerate the process, run multiple instances of the tool.

Decide on a number of workers, and pass each worker a number within the range 0-(worker_m - 1).

The example below shows how to run 4 workers simultaneously:

# Worker 0cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool># Worker 1cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool># Worker 2cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool># Worker 3cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool># Worker 0cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool># Worker 1cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool># Worker 2cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool># Worker 3cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>$ o- Y- N' x4 y
" d" u0 E6 M2 u: ^2 O( l# l

It is important to ensure that all workers have completed the scan_extents phase before any workers enter the scan_inodes phase.

After completing the metadata recovery, you may want to run cleanup operation to delete ancillary data generated during recovery.

cephfs-data-scan cleanup <data pool>9 S; r0 Z$ F' `

# Q) f& A+ h6 B2 }! k; ~USING AN ALTERNATE METADATA POOL FOR RECOVERY
Warning

There has not been extensive testing of this procedure. It should be undertaken with great care.


3 W* D5 Q1 K/ m1 C1 c

If an existing file system is damaged and inoperative, it is possible to create a fresh metadata pool and attempt to reconstruct the file system metadata into this new pool, leaving the old metadata in place. This could be used to make a safer attempt at recovery since the existing metadata pool would not be modified.

Caution

During this process, multiple metadata pools will contain data referring to the same data pool. Extreme caution must be exercised to avoid changing the data pool contents while this is the case. Once recovery is complete, the damaged metadata pool should be archived or deleted.

* m9 k# ]. Y8 h- k

To begin, the existing file system should be taken down, if not done already, to prevent further modification of the data pool. Unmount all clients and then mark the file system failed:

ceph fs fail <fs_name>. w( S3 m1 w4 J3 T9 F: C
# Z, s' U. I: n; n

Next, create a recovery file system in which we will populate a new metadata pool backed by the original data pool.

ceph fs flag set enable_multiple true --yes-i-really-mean-itceph osd pool create cephfs_recovery_metaceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay7 F: }7 P, Y9 p/ B* O# ~# T
6 {$ Z; n/ A8 d$ d8 P1 N

The recovery file system starts with an MDS rank that will initialize the new metadata pool with some metadata. This is necessary to bootstrap recovery. However, now we will take the MDS down as we do not want it interacting with the metadata pool further.

ceph fs fail cephfs_recovery* X9 v; [9 U% v
$ b: U5 g3 `! b: w

Next, we will reset the initial metadata the MDS created:

cephfs-table-tool cephfs_recovery:all reset sessioncephfs-table-tool cephfs_recovery:all reset snapcephfs-table-tool cephfs_recovery:all reset inode
5 z  Q8 ^5 v0 w: h9 s, R. i; G. p
( T+ \( R, _1 D6 F

Now perform the recovery of the metadata pool from the data pool:

cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_metacephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>cephfs-data-scan scan_links --filesystem cephfs_recovery
9 r- P: H$ y+ U; X& t6 J$ _9 j- u5 K" u* ~6 v
Note

Each scan procedure above goes through the entire data pool. This may take a significant amount of time. See the previous section on how to distribute this task among workers.

! j$ k5 F, E$ H5 {1 B+ w% l( v

If the damaged file system contains dirty journal data, it may be recovered next with:

cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_metacephfs-journal-tool --rank cephfs_recovery:0 journal reset --force. r0 f+ o" M( s0 T/ t* h
( C  @$ C5 h: X  [' E3 V0 N1 b

After recovery, some recovered directories will have incorrect statistics. Ensure the parameters mds_verify_scatter and mds_debug_scatterstat are set to false (the default) to prevent the MDS from checking the statistics:

ceph config rm mds mds_verify_scatterceph config rm mds mds_debug_scatterstat
. q' {2 V8 }  ^- l9 r7 K& O/ y% F+ a7 z
1 f% A7 n5 ~9 q+ b+ a4 h

(Note, the config may also have been set globally or via a ceph.conf file.) Now, allow an MDS to join the recovery file system:

ceph fs set cephfs_recovery joinable true* M) M; ^$ `3 a$ G( V3 j
' u1 j. ~: S. j" b) y6 H

Finally, run a forward scrub to repair the statistics. Ensure you have an MDS running and issue:

ceph fs status # get active MDSceph tell mds.<id> scrub start / recursive repair( J9 m1 ?8 R+ c- X: R( p- w

# R2 s0 j- |: J# y
Note

Symbolic links are recovered as empty regular files. Symbolic link recovery is scheduled to be supported in Pacific.

% c, ?& T6 k7 C) Z( I/ a3 H0 {

It is recommended to migrate any data from the recovery file system as soon as possible. Do not restore the old file system while the recovery file system is operational.

Note

If the data pool is also corrupt, some files may not be restored because backtrace information is lost. If any data objects are missing (due to issues like lost Placement Groups on the data pool), the recovered files will contain holes in place of the missing data.

5 q' E: W+ X% ]
' R' l3 U: {2 k. @; W6 g& @) ~
2 q3 Y% D$ {, P1 M' |! o
[color=rgb(64, 64, 64) !important][backcolor=rgb(243, 246, 246) !important] Previous
[color=rgb(64, 64, 64) !important][backcolor=rgb(243, 246, 246) !important]Next

( Q9 s7 F0 ^# q  S
# o2 j; ?8 ?% j; O: H) q/ k5 s2 Q4 `6 k- L0 ]' a- x3 i9 B

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
 楼主| 发表于 2022-7-29 10:05:00 | 显示全部楼层
1、创建一个 Ceph 文件系统
: k7 p- X! ]- s! l. _1、首先要创建两个pool,一个是cephfs-data,一个是cephfs-metadate,分别存储文件数据和文件元数据,这个pg也可以设置小一点,这个根据OSD去配置
1 l0 G7 p( z5 z. @9 |& m
; w5 M7 l& [& ?! u" @) l/ ]% y# O! E#ceph osd pool create cephfs-data 256 256
4 I2 @7 h- b! L# f+ g#ceph osd pool create cephfs-metadata 64 644 b+ Y" T: \0 `9 A* ]2 S* C! \$ A1 P
查看已经创建成功) u3 p, O7 M  k0 U" t2 [

% o$ n+ _6 I& s! l9 m' ^[root@cephnode01 my-cluster]# ceph osd lspools% \. n% s( W) z) y
1 .rgw.root& y4 z% P- u% \7 N  Z; j8 k
2 default.rgw.control
' a! K* G5 M2 [! F3 default.rgw.meta: s! X+ a3 A& u' _- X
4 default.rgw.log) j" z1 P5 T( W2 q& X* C* M
5 rbd
6 N7 Q) x- O/ b. L6 cephfs-data5 Y4 Z- d3 q' z% l3 b
7 cephfs-metadata
/ a* @9 I8 O4 }关于ceph的日志,可以在/var/log/ceph下可以查看到相关信息
2 i: Q( F  @, z, L. B- N  t, U/ F8 L( X+ `$ S# ~  a
[root@cephnode01 my-cluster]# tail -f /var/log/ceph/ceph0 \& v" G5 N+ P; O0 \6 i# w8 N
ceph.audit.log                  ceph.log                        ceph-mgr.cephnode01.log         ceph-osd.0.log9 f7 j3 u8 Z/ Q9 S. j- \. K* n( B
ceph-client.rgw.cephnode01.log  ceph-mds.cephnode01.log         ceph-mon.cephnode01.log         ceph-volume.log+ h8 g! S( @  H# R4 I7 W
注:一般 metadata pool 可以从相对较少的 PGs 启动, 之后可以根据需要增加 PGs. 因为 metadata pool 存储着 CephFS文件的元数据, 为了保证安全, 最好有较多的副本数. 为了能有较低的延迟, 可以考虑将 metadata 存储在 SSDs 上.
1 O/ p! r9 P) P" O9 z2、创建一个 CephFS, 名字为 cephfs:需要指定两个创建的pool的名字+ {, n( l: S, _8 V; l0 Q

0 X1 ?, h; f' Y& j1 J#ceph fs new cephfs cephfs-metadata cephfs-data4 n7 y/ m) `; @" X2 q
new fs with metadata pool 7 and data pool 6
' {7 f$ c; ~( n# O! x: `2 e3、验证至少有一个 MDS 已经进入 Active 状态,也就是活跃4 l% I/ p& I" h" Y5 @
另外可以看到两个备用的是cephnode01,和cephnode03& d) Y2 A9 k. N
2 {3 g# F; c& Q# U) B3 O2 ?
#ceph fs status cephfs' _: d' c4 P, A0 \, S: w
cephfs - 0 clients
+ ^) Q2 b3 k' z3 ?; _* @( [! Q0 t$ n! P
+------+--------+------------+---------------+-------+-------+; I3 c0 j) K3 R7 Y/ S
| Rank | State  |    MDS     |    Activity   |  dns  |  inos |- e- G6 r" n! f$ z, j3 l
+------+--------+------------+---------------+-------+-------+
+ s) o; s0 c4 L8 o3 b9 ]6 K& N|  0   | active | cephnode02 | Reqs:    0 /s |   10  |   13  |
: P8 X/ {! C  r7 K( C* _7 A+------+--------+------------+---------------+-------+-------+
0 _( o1 s, ]4 z$ f+-----------------+----------+-------+-------+; o+ `* w5 a% M3 {" F) b
|       Pool      |   type   |  used | avail |
3 r6 C" \& ^$ c. J3 N% o- P+-----------------+----------+-------+-------+; C+ R9 M" W$ y+ ?! v
| cephfs-metadata | metadata | 1536k | 17.0G |
( `2 T- [" T5 q% j; i) W- a|   cephfs-data   |   data   |    0  | 17.0G |' r7 a5 Q! s+ x* c
+-----------------+----------+-------+-------+
& D* D2 f0 q% e- W1 Q2 C& E! Y. _+-------------+3 h# s1 H, ^: [
| Standby MDS |
2 u3 b2 P# N9 p$ K, H+-------------+
. t' T: G: u0 {' o& x, N' |' j2 E|  cephnode01 |- R. U8 k& W( b0 L) n5 v; @
|  cephnode03 |
7 V9 g1 S/ n8 i; H  z( D( E' z+-------------+
/ P# a7 s2 k  E) L& |MDS version: ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
- U- }" Y7 I( B% j4 _2 o4、在 Monitor 上, 创建一个叫client.cephfs的用户,用于访问CephFs
/ }  ~2 J% K$ z$ h/ B' `2 F, Q7 b  k7 J2 k
#ceph auth get-or-create client.cephfs mon 'allow r' mds 'allow rw' osd 'allow rw pool=cephfs-data, allow rw pool=cephfs-metadata'! s0 p0 M! o. R6 Z
这里会生成一个key,用户需要拿这个key去访问
; l2 v) P. ?* _$ Z[client.cephfs]
. W. }; h0 r! {, f    key = AQA5IV5eNCwMGRAAy4dIZ8+ISfBcwZegFTYD6Q==( J7 L: h2 b$ W4 I
查看权限列表,有哪些用户创建了权限0 Q0 k1 }) g4 o5 \+ K9 d

8 [/ m# u" U! Y- B- [[root@cephnode01 my-cluster]# ceph auth list
6 P& g- T0 M8 R  ~4 o6 J* }client.cephfs: |/ {. I4 I3 h6 _1 ^& j( x
    key: AQA5IV5eNCwMGRAAy4dIZ8+ISfBcwZegFTYD6Q==7 w+ B+ `/ R- k- |+ X
    caps: [mds] allow rw* v7 u0 G7 z2 U- s, m4 m3 w+ @
    caps: [mon] allow r* _: a/ q; C" W: j; p" @: O
    caps: [osd] allow rw pool=cephfs-data, allow rw pool=cephfs-metadata
2 W* m+ u! v( V$ rclient.rgw.cephnode01
0 D0 R- X2 }1 \0 K2 ]5 w+ F* f    key: AQBOAl5eGVL/HBAAYH93c4wPiBlD7YhuPY0u7Q==
5 Z8 e  W# J" H* j# n+ g    caps: [mon] allow rw  |/ O6 D. S9 r: S
    caps: [osd] allow r# M+ z3 |6 s4 _$ {' Z5 s  ]3 i
5、验证key是否生效
7 \0 A$ g2 [' D0 H/ T/ M+ m  @  y* D: W; W1 y
#ceph auth get client.cephfs
& C8 p" k4 B! g8 E6 ]可以看到这个用户是拥有访问cephfs的读写权限的: O: l  @- Y8 n! a/ A  {9 p
exported keyring for client.cephfs
8 Z; A! w; R  N- ~& u# C[client.cephfs]
- ?" H, n/ |3 i: V* Q9 |3 O4 ?$ R    key = AQA5IV5eNCwMGRAAy4dIZ8+ISfBcwZegFTYD6Q==
& [9 m  @' L$ O$ A; l7 `' w    caps mds = "allow rw"3 x9 N! U3 _6 j' ~6 |2 }1 X
    caps mon = "allow r"
: q  |: q, l. M) y+ r    caps osd = "allow rw pool=cephfs-data, allow rw pool=cephfs-metadata"
( V- F- M! d2 Y$ m2 c! w) N6、检查CephFs和mds状态- g1 t; S5 d+ D1 [
: L* k& G8 U- E% e/ D1 ?* Q  ^
#ceph -s   查看集群已经增加mds配置2 g  I/ T; c3 A! N4 u
  cluster:4 d# t- e' \% A! K; g
    id:     75aade75-8a3a-47d5-ae44-ec3a843940338 c6 m: c; N) `2 i. m# q& C
    health: HEALTH_OK9 B3 o2 d" O  P2 |$ k4 O7 N. R
7 i& f& Z- I/ y) @; V' O$ c
  services:
) w3 k* x1 r* \1 B# ]; T    mon: 3 daemons, quorum cephnode01,cephnode02,cephnode03 (age 2h)
7 |# r, m  E3 K- h, u4 H    mgr: cephnode01(active, since 2h), standbys: cephnode02, cephnode03! p; J% s' R0 T- H1 a# M1 n
    mds: cephfs:1 {0=cephnode02=up:active} 2 up:standby
. [8 M" O# o6 H# s+ x    osd: 3 osds: 3 up (since 2h), 3 in (since 2h)
* _% ^3 M; f1 O# l7 v' h; @    rgw: 1 daemon active (cephnode01)  {/ |0 M: S" N( C0 V" f1 C, ?# |

+ X% n8 s, ^, \5 y! a; t* A  data:7 l9 D+ X$ p/ E  S( p
    pools:   7 pools, 96 pgs
# T+ p1 s% p" q  n5 [' y    objects: 263 objects, 29 MiB% [% X6 i1 X5 c. Z
    usage:   3.1 GiB used, 54 GiB / 57 GiB avail
) F: r  r# x7 q7 u+ T+ x# Y6 l    pgs:     96 active+clean, z( U; C% k$ a% s+ `$ V1 ?& T

% K0 C! Q6 X) K- f# y#ceph mds stat. p/ D* M! D5 t' V% W* d3 c% y
这里显示1个是active状态,2个备用状态* E3 n& h  U$ j1 p: E1 {
cephfs:1 {0=cephnode02=up:active} 2 up:standby* _2 _% G7 O& ]6 G4 ]
' o7 S) u& J3 ]) p
#ceph fs ls
% b! }7 j* n! F这里有两个pool* |& X9 r5 T! Z- c  p2 @
name: cephfs, metadata pool: cephfs-metadata, data pools: [cephfs-data ]* d0 _; _2 c* h: Y% f& O4 N* f
#ceph fs status
: e8 H  A8 T1 S  _1.1 以 kernel client 形式挂载 CephFS2 M+ h- t5 F$ E# @' }
这里使用其他的机器进行挂载,这里是是以prometheus主机挂载,不过这个在哪挂载都可以,kernel主要联系系统内核,和系统内核进行做相互,用这种方式进行挂载文件系统
+ `6 s) [; M3 C7 }1、创建挂载目录 cephfs
; P( e+ J+ t" p7 A" B#mkdir /cephfs/ x- z& R$ R: a) D0 H' w

/ `1 i- e" I3 F2、挂载目录,这里写集群ceph节点的地址,后面跟创建用户访问集群的key5 w! G6 {8 X/ \: K: Z7 ^" J) z

) d" |) b6 Q9 W8 N#mount -t ceph 192.168.1.10:6789,192.168.1.11:6789,192.168.1.12:6789:/ /cephfs/ -o name=cephfs,secret=AQDHjeddHlktJhAAxDClZh9mvBxRea5EI2xD9w==
3 T" Z% H- K, Q3 n( B# M$ n3、自动挂载3 f$ V6 i0 J8 v) o8 \! f
#echo "mon1:6789,mon2:6789,mon3:6789:/ /cephfs ceph name=cephfs,secretfile=/etc/ceph/cephfs.key,_netdev,noatime 0 0" | sudo tee -a /etc/fstab
0 X1 X1 M' c5 {) G0 r& ?% g, q0 h0 l6 Z; b, r9 r7 p' U6 T' ?0 @
4、验证是否挂载成功/ y- W. s3 S; N7 w6 W
3 ?# \# U' M9 v5 U3 k8 \4 G( t
#stat -f /cephfs/ Y4 L" s/ \+ Z/ r4 v2 r* s/ M* B
  文件:"/cephfs"0 e4 a$ r7 U9 O, Q& i
    ID:4f32eedbe607030e 文件名长度:255     类型:ceph
! R* l* a4 M& {  z8 G( j9 d0 _块大小:4194304    基本块大小:4194304' O# a9 V* F: r3 }* t( X" b7 Z
    块:总计:4357       空闲:4357       可用:4357( W, N, Z0 h! b1 Q- Q& i
Inodes: 总计:0          空闲:-1
; y1 j9 }) D" S( X1 R1.2 以 FUSE client 形式挂载 CephFS
4 z- r1 ]1 A" v7 h8 a6 O7 M1、安装ceph-common,安装好可以使用rbd,ceph相关命令
! O3 C+ W4 s+ p) q, i6 f2 z+ U' ~这里还是使用我们的内网yum源来安装这些依赖包5 v# H& m! W' w' q8 c

: r1 _7 `. C, F. [yum -y install epel-release* a; r# K- {2 [7 W$ @5 C
yum install -y ceph-common
% [* I; s- k, r7 G: A7 c2、安装ceph-fuse,ceph的客户端工具,也就是用ceph的方式把这个文件系统挂上
- n2 S, F* w4 {; Nyum install -y ceph-fuse* h' Z& S$ h* Y+ p% d

" h3 Y7 i* K& H; B* c- U5 O4 o, H3、将集群的ceph.conf拷贝到客户端
- B. ]# k% i/ M' H6 w
! [* h% C6 |, }scp root@192.168.1.10:/etc/ceph/ceph.conf /etc/ceph/; ?( ]2 s% p/ \5 l3 q
chmod 644 /etc/ceph/ceph.conf' |$ M0 Z5 T( |% L$ @! N  I/ a# i
4、使用 ceph-fuse 挂载 CephFS. D* }0 X3 Z0 s5 E
如果是在其他主机挂载的话,需要这个使用cephfs的key,这个是刚才我们创建好的
! v/ v9 F6 S. ^% z( l直接拿这台服务器上用就可以6 C+ f0 H; k7 u# r  B
0 E' C2 {, v! m
[root@prometheus ~]# more /etc/ceph/ceph.client.cephfs.keyring
) B9 v8 Y+ @- \- yexported keyring for client.cephfs9 \! m: j0 |, E( c
[client.cephfs]
& W: u) F. l0 e7 X. q+ c) w    key = AQA5IV5eNCwMGRAAy4dIZ8+ISfBcwZegFTYD6Q==( d$ x- f- B8 ?7 U" i) u' i
    caps mds = "allow rw"
- d$ c4 c( g2 o( `( l% n) ?. o    caps mon = "allow r"
6 g3 d4 E9 Y1 X3 q. c8 W    caps osd = "allow rw pool=cephfs-data, allow rw pool=cephfs-metadata"( ]7 D) H4 L! ^5 q2 g7 {

' ?; P( ]: N0 U: x; L5 U#ceph-fuse --keyring /etc/ceph/ceph.client.cephfs.keyring --name client.cephfs -m 192.168.1.10:6789,192.168.1.11:6789,192.168.1.12:6789 /cephfs/. N1 R5 J, v# c% {& Z3 b  s8 h
5、验证 CephFS 已经成功挂载+ I; }! J! q1 u) a9 J9 \5 U

( Z# z7 {* Y% u" _6 T5 q#df -h3 j6 e" `) N9 v5 q( t, ?; z3 [( o
ceph-fuse                          18G     0   18G    0% /cephfs1 M% ?( Z8 c+ A

! J4 k  _, A1 Z9 R7 W2 r#stat -f /cephfs1 I5 x0 m) f6 @2 U: L
  文件:"/cephfs/"9 l" T- a, C1 o7 D( ]8 |3 ~( q/ S3 \
    ID:0        文件名长度:255     类型:fuseblk: }1 m9 d+ y& Q# I1 g" ?. F
块大小:4194304    基本块大小:41943047 W. i) n  u4 P# U0 ]
    块:总计:4357       空闲:4357       可用:4357
6 U! H+ U  ?- x' SInodes: 总计:1          空闲:0& s7 X3 q0 y* o
6、自动挂载
  Y  b0 n, p" H2 n( h! p! y
( I+ E# @  u& j2 T9 T7 X#echo "none /cephfs fuse.ceph ceph.id=cephfs[,ceph.conf=/etc/ceph/ceph.conf],_netdev,defaults 0 0"| sudo tee -a /etc/fstab
# ^/ S' U. l5 a4 Q& f; K
& S0 I' M% w: k3 k+ r# h#echo "id=cephfs,conf=/etc/ceph/ceph.conf /mnt/ceph3 fuse.ceph _netdev,defaults 0 0"| sudo tee -a /etc/fstab5 H. k: a2 X6 g# G8 F7 c
7、卸载- s7 j8 @9 Q( n, B
#fusermount -u /cephfs
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 00:04 , Processed in 0.019567 second(s), 23 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表