|
|
# ceph health detail
$ h- b( b7 v3 ^( D5 x7 i4 y4 r; RHEALTH_WARN 1 pools have many more objects per pg than average. K- N6 y, E$ K6 K- V- k
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average2 I6 |7 B4 y) f$ u
pool pool-hdd-2 objects per pg (5503) is more than 12.9482 times cluster average (425)% z3 g5 V, r6 {
" z0 s1 b5 k$ W0 `6 c6 @& I. y% V, `- Q) p9 F& x, c; g$ y/ [$ R
定位问题6 a/ T, `( p, }5 i
[root@lab8106 ~]# ceph -s
* a4 C( A8 U- s8 D cluster fa7ec1a1-662a-4ba3-b478-7cb570482b62
) ]& [8 n; }7 a2 | U- d health HEALTH_WARN
5 D9 p. U( D9 ~% a/ O* K$ t pool rbd has many more objects per pg than average (too few pgs?), z S, k3 @( K; i9 v
monmap e1: 1 mons at {lab8106=192.168.8.106:6789/0}
1 a' Q2 r5 ]% B5 d election epoch 30, quorum 0 lab8106
6 c7 C* n: Z. V1 i) ~$ Z osdmap e157: 2 osds: 2 up, 2 in
. p9 u7 ^% @' V flags sortbitwise% {; S: T7 @2 q# t$ \( W
pgmap v1023: 417 pgs, 13 pools, 18519 MB data, 15920 objects( f5 ]9 i4 O" f% [" V! z3 o* e
18668 MB used, 538 GB / 556 GB avail
9 U0 E. t! h3 V( t4 D+ r* }0 W 417 active+clean" B8 G* Y& z: `0 y" c4 x4 ?0 w
集群出现了这个警告,pool rbd has many more objects per pg than average (too few pgs?) 这个警告在hammer版本里面的提示是 pool rbd has too few pgs
" Z: G# Q% k, {; q+ U* z V3 R这个地方查看集群详细信息:
% Q7 Z5 `% n0 ~4 J2 ?, B8 {[root@lab8106 ~]# ceph health detail
5 m0 o$ J4 ?# i% {" m& THEALTH_WARN pool rbd has many more objects per pg than average (too few pgs?); mon.lab8106 low disk space: X' ]. }2 S7 Y5 O4 A% s4 i
pool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)5 {" \3 U9 R5 m, \! Z0 t" G- G7 n
看下集群的pool的对象状态
1 `+ n/ Z8 v3 n' e Z' |. ~. j[root@lab8106 ~]# ceph df7 [3 M2 a+ Y4 M6 S3 \- l2 f
GLOBAL:6 O& F- J9 v! u- Z! o) h
SIZE AVAIL RAW USED %RAW USED
$ b8 A- I, z s+ X# E! ~; ` 556G 538G 18668M 3.28
& n* G* ?" n# {( U9 e5 APOOLS:
7 w. X# Z G9 f! |% y NAME ID USED %USED MAX AVAIL OBJECTS $ e/ z. j* O, k# y9 A
rbd 6 16071M 2.82 536G 15296 . R$ `9 n7 u2 P! |4 Z( g) h# L
pool1 7 204M 0.04 536G 52 : r; m8 O ` F
pool2 8 184M 0.03 536G 47 ! U2 T% o7 i1 Z+ ]2 I
pool3 9 188M 0.03 536G 48
& Y% a. w4 R* e7 p+ ]- A5 b pool4 10 192M 0.03 536G 49
* }5 s1 `' e6 ]5 b' N' m" n$ n pool5 11 204M 0.04 536G 52 + s' n; c0 H: R
pool6 12 148M 0.03 536G 38
5 o8 D* m" j3 E, D pool7 13 184M 0.03 536G 47 & b# F$ K! T0 \
pool8 14 200M 0.04 536G 51 " q( {/ K3 V( m! W, ^6 _9 S( { Q
pool9 15 200M 0.04 536G 51 ' `1 H& f# `, ?; G: \2 _9 C
pool10 16 248M 0.04 536G 63
" w, g; r0 m8 I# y0 D pool11 17 232M 0.04 536G 59 2 [0 t2 e( r7 _5 G. D7 q% H
pool12 18 264M 0.05 536G 67
7 r @( E& z* y) O+ k) i9 l查看存储池的pg个数4 {4 p# E' Q6 ~9 x7 Y9 |
[root@lab8106 ~]# ceph osd dump|grep pool
( `2 S+ y) \. L7 \6 \pool 6 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 132 flags hashpspool stripe_width 0" t3 X! v/ a' X# l
pool 7 'pool1' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 134 flags hashpspool stripe_width 0
% e% }* s) l t1 R* f3 d' Rpool 8 'pool2' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 136 flags hashpspool stripe_width 0
2 n- ]1 k- V2 Q0 \5 D7 D$ H; kpool 9 'pool3' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 138 flags hashpspool stripe_width 0
( z) n# x3 b( Ppool 10 'pool4' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 140 flags hashpspool stripe_width 0# o- p; G4 j# v" P
pool 11 'pool5' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 142 flags hashpspool stripe_width 0
i& T2 k: F& z5 V9 t6 K H+ S: N2 cpool 12 'pool6' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 144 flags hashpspool stripe_width 02 E% | i9 y9 N' y4 x3 s
pool 13 'pool7' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 146 flags hashpspool stripe_width 0' n L8 _; U P/ p" @( S
pool 14 'pool8' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 148 flags hashpspool stripe_width 0. O( w- W# m/ | V
pool 15 'pool9' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 150 flags hashpspool stripe_width 0
5 E# N$ I* r) X: n6 bpool 16 'pool10' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 152 flags hashpspool stripe_width 0" ?& {, N `* E; ]( z1 }
pool 17 'pool11' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 154 flags hashpspool stripe_width 00 n) C r: C# Z* W* [3 s4 n
pool 18 'pool12' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 200 pgp_num 200 last_change 156 flags hashpspool stripe_width 0
8 j! C( o) }$ ^, A我们看下这个是怎么得到的
) n7 T7 ]# r' b% w% fpool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)
7 y6 C+ v9 L1 g+ g" x/ a$ a) urbd objects_per_pg = 15296 / 8 = 1912% B& A0 G2 ^/ t: V3 B
objects_per_pg = 15920 /417 ≈ 381 {% y7 B( D" z
50.3158 = rbd objects_per_pg / objects_per_pg = 1912 / 38
/ d$ W6 q- V" I/ U' B2 P) X* I8 ~也就是出现其他pool的对象太少,而这个pg少,对象多,就会提示这个了,我们看下代码里面的判断
9 ]+ S: C3 t+ g/ K3 B& K$ L, ~https://github.com/ceph/ceph/blob/master/src/mon/PGMonitor.cc; }' `( p0 @ o4 s. b" @" o( ~& P
int average_objects_per_pg = pg_map.pg_sum.stats.sum.num_objects / pg_map.pg_stat.size();
% q! g3 x! G) t7 E( I/ D if (average_objects_per_pg > 0 &&4 R2 q- R P6 \" k8 X4 i
pg_map.pg_sum.stats.sum.num_objects >= g_conf->mon_pg_warn_min_objects &&: }8 l- @! t3 \. K
p->second.stats.sum.num_objects >= g_conf->mon_pg_warn_min_pool_objects) {
0 p2 ^: K6 I# `) j* ^4 C int objects_per_pg = p->second.stats.sum.num_objects / pi->get_pg_num();1 _3 n: h& z' e! D; V4 B6 X
float ratio = (float)objects_per_pg / (float)average_objects_per_pg;) E0 ]) O4 A" R
if (g_conf->mon_pg_warn_max_object_skew > 0 &&
' Z$ y' {. t1 h, d; e ratio > g_conf->mon_pg_warn_max_object_skew) {* E( F% H0 O( o
ostringstream ss;3 k7 `. |1 B0 O) O8 R6 d
ss << "pool " << name << " has many more objects per pg than average (too few pgs?)";4 n, i5 n: k e- n* T6 ]; E3 @
summary.push_back(make_pair(HEALTH_WARN, ss.str()));( y8 F3 r8 O6 _1 p& R
if (detail) {
, m3 v- |. r5 y ostringstream ss;9 L1 q% E% V5 U7 J/ W* [
ss << "pool " << name << " objects per pg ("8 c: D2 i/ [7 I* o p
<< objects_per_pg << ") is more than " << ratio << " times cluster average ("$ a+ G' M5 q% A3 V- z
<< average_objects_per_pg << ")";
7 ?& r) Q' ~6 I9 ?0 s detail->push_back(make_pair(HEALTH_WARN, ss.str()));
; V6 T9 a( z- D( R9 ^5 M) m }
1 o1 Z& F) i7 @, |主要下面的几个限制条件
" j7 B; M1 {2 D; h d! D/ E4 B/ Dmon_pg_warn_min_objects = 10000 //总的对象超过10000
0 y8 E( K1 ~. T5 u% e$ v7 t4 r* emon_pg_warn_min_pool_objects = 1000 //存储池对象超过1000
- m, b2 ]* F5 |mon_pg_warn_max_object_skew = 10 //就是上面的存储池的平均对象与所有pg的平均值的倍数关系
( Q; S# y+ S8 e" a解决问题
y& d l3 _. O8 ?) ]0 L% v有三个方法解决这个警告的提示:8 r( a. L' i. _8 J# y0 {: P
删除无用的存储池
' s* t4 \) p/ X0 y如果集群中有一些不用的存储池,并且相对的pg数目还比较高,那么可以删除一些这样的存储池,从而降低mon_pg_warn_max_object_skew这个值,警告就会没有了
, C& r; u1 _( T" b. u增加提示的pool的pg数目9 x# V/ _; \4 G7 {8 J. s
有可能的情况就是,这个存储池的pg数目从一开始就不够,增加pg和pgp数目,同样降低了mon_pg_warn_max_object_skew这个值了
. o- d! _4 R4 s* b% [增加mon_pg_warn_max_object_skew的参数值
. m L# \! J; n如果集群里面已经有足够多的pg了,再增加pg会不稳定,如果想去掉这个警告,就可以增加这个参数值,默认为103 K: M( r: g& G# a" E
总结7 O! h4 _5 x) D- `
这个警告是比较的是存储池中的对象数目与整个集群的pg的平均对象数目的偏差,如果偏差太大就会发出警告 t9 V: j* a% R, t c! @6 Z+ T
检查的步骤:
q+ L8 P8 f& Z' r) N9 uceph health detail
& v: W; Y. v4 n6 W4 B$ wceph df$ ]: c6 y; A; d
ceph osd dump | grep pool
/ n) X: h- \9 B1 Lmon_pg_warn_max_object_skew = 10.0
+ T1 a0 h8 N# m. h((objects/pg_num) in the affected pool)/(objects/pg_num in the entire system) >= 10.0 警告就会出现
2 |7 D' \/ K4 K% S变更记录* h. l s1 H4 I" d; m! h3 `
) ~% a6 X# B% T! l; Y ~6 r" {$ s: }" b3 R
. P; r t) W9 B& A. W9 F
|
|