找回密码
 注册
查看: 2838|回复: 0

ceph health detail HEALTH_WARN 1 pools have many more objects per pg than averag

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2021-9-8 17:27:55 | 显示全部楼层 |阅读模式
# ceph health detail 1 j, g5 Z3 I! }* R( {& S% Z
HEALTH_WARN 1 pools have many more objects per pg than average
- a/ q. H; @0 DMANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average$ g0 a0 i; M6 K8 ~% Q3 h
    pool pool-hdd-2 objects per pg (5503) is more than 12.9482 times cluster average (425)5 X  S. I% \0 {" ]1 T3 o

% _4 b! V1 f) V2 `
- p1 j2 b3 x1 }) f$ Q/ ?) Y& g- g定位问题& i( Q0 c+ `2 N) H' w* t
[root@lab8106 ~]# ceph -s
. y4 D& a6 l- o: H: |    cluster fa7ec1a1-662a-4ba3-b478-7cb570482b629 D6 k2 }/ \; G/ \- |0 N% b7 x
     health HEALTH_WARN
! `/ _5 ]  ?# f, h) Y2 G5 O8 \: B            pool rbd has many more objects per pg than average (too few pgs?)( n% C! f$ z; i' H% V8 R
     monmap e1: 1 mons at {lab8106=192.168.8.106:6789/0}# k' \9 D6 \: n/ j: O6 l3 A
            election epoch 30, quorum 0 lab8106+ |! k& \* v% X. x2 q
     osdmap e157: 2 osds: 2 up, 2 in4 C4 O9 Q3 Z( \( |( J7 H5 u
            flags sortbitwise
- u# ^& k% A8 n      pgmap v1023: 417 pgs, 13 pools, 18519 MB data, 15920 objects1 `1 i. H( g8 f+ S: z6 c0 l
            18668 MB used, 538 GB / 556 GB avail$ y' t  w) O8 X8 w* _: m
                 417 active+clean
" C5 N4 O4 z4 m- `4 P/ ?$ W集群出现了这个警告,pool rbd has many more objects per pg than average (too few pgs?) 这个警告在hammer版本里面的提示是 pool rbd has too few pgs& {& K5 E4 ?$ c& Y- I0 m9 f# `
这个地方查看集群详细信息:
; w, J; U1 A6 l( ?% z# p! ~3 m/ @[root@lab8106 ~]# ceph health detail
5 H/ b$ L& Q- W: L* U; ]2 M  z2 q, H/ GHEALTH_WARN pool rbd has many more objects per pg than average (too few pgs?); mon.lab8106 low disk space
1 W) v+ k2 R! y9 ]/ Wpool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)3 i0 E6 ^4 h. F/ r4 i
看下集群的pool的对象状态
" I1 \" T* y2 T" l; f- }1 g0 d[root@lab8106 ~]# ceph df
' i  l/ D: u( e3 AGLOBAL:
2 V6 S+ k; W* V* ~( Y7 v; T2 K( Y    SIZE     AVAIL     RAW USED     %RAW USED
4 q" u& N; q9 L8 M- \: S) `* H; Q    556G      538G       18668M          3.28 * S& b' ]7 @  ^* |$ C
POOLS:
. s% Q3 y* \, O( @( `0 S4 t    NAME       ID     USED       %USED     MAX AVAIL     OBJECTS 9 o/ j& E5 S6 }0 d; ^7 u3 j
    rbd        6      16071M      2.82          536G       15296
; s# ]# c' B( R2 H/ ?$ m2 x0 j  r6 |    pool1      7        204M      0.04          536G          52 7 @' b  G, m( J) P! g
    pool2      8        184M      0.03          536G          47 : Z, L" x. ^5 ~; m) @
    pool3      9        188M      0.03          536G          48
! X6 e- s- ~2 x8 [    pool4      10       192M      0.03          536G          49 : d3 D" a* ^; @" b
    pool5      11       204M      0.04          536G          52 # {3 I9 f1 ?" j+ `0 m0 \% U- Y
    pool6      12       148M      0.03          536G          38 9 Q0 _  ?% a7 D2 j# Z6 l( o% u
    pool7      13       184M      0.03          536G          47 ' b( u- D" r7 E, \
    pool8      14       200M      0.04          536G          51 6 ]& w' T* D3 B5 q% t
    pool9      15       200M      0.04          536G          51
4 _1 y7 s  G% r3 I    pool10     16       248M      0.04          536G          63 : P' m; \2 S) ^
    pool11     17       232M      0.04          536G          59
0 Q+ H0 s3 r3 v' m    pool12     18       264M      0.05          536G          67
& z& j9 C* I8 h5 W+ O0 D查看存储池的pg个数0 h8 ~* g3 W) H9 q6 k
[root@lab8106 ~]# ceph osd dump|grep pool  Q7 a3 g# r. r6 A. a: p" r
pool 6 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 132 flags hashpspool stripe_width 0! d) X+ j( u' M0 h0 W# p
pool 7 'pool1' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 134 flags hashpspool stripe_width 0
: h2 z$ f  Y4 l4 o6 |) Vpool 8 'pool2' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 136 flags hashpspool stripe_width 0
, v4 c/ ^+ `; s' g& ipool 9 'pool3' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 138 flags hashpspool stripe_width 0
, m' D! h) ~2 I& t( x2 z" ?pool 10 'pool4' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 140 flags hashpspool stripe_width 0  F6 n5 e5 x1 M5 X! G5 j" l
pool 11 'pool5' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 142 flags hashpspool stripe_width 0
0 @' u. }( u. y. wpool 12 'pool6' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 144 flags hashpspool stripe_width 0
- ^; h1 g! O# Tpool 13 'pool7' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 146 flags hashpspool stripe_width 0
2 F8 A/ M2 W( I3 a4 mpool 14 'pool8' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 148 flags hashpspool stripe_width 0
, I* z' E& f7 k. {! ?* ]! Tpool 15 'pool9' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 150 flags hashpspool stripe_width 0
5 ^( }2 C3 b( y% U, tpool 16 'pool10' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 152 flags hashpspool stripe_width 0
4 z& e2 [" R7 B! ?pool 17 'pool11' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 154 flags hashpspool stripe_width 0  C% n  I+ u' X% @  f7 S0 L0 U" U/ f  C
pool 18 'pool12' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 200 pgp_num 200 last_change 156 flags hashpspool stripe_width 0! ]8 w- M6 \/ v/ f2 z- o
我们看下这个是怎么得到的
. N% j& V: F( O2 m) D. mpool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)0 e, o) q2 e, Q5 f
rbd objects_per_pg = 15296 / 8 = 1912
- F  |- i+ r9 u7 i3 N2 K' Lobjects_per_pg = 15920 /417 ≈ 38
& Z* B" y- p; L% |50.3158 = rbd objects_per_pg / objects_per_pg = 1912 / 38, P" K. b  \' G9 N$ \
也就是出现其他pool的对象太少,而这个pg少,对象多,就会提示这个了,我们看下代码里面的判断
; g2 [! F, E& A# y, _https://github.com/ceph/ceph/blob/master/src/mon/PGMonitor.cc
1 {9 w. r5 k% V* i9 [) Y int average_objects_per_pg = pg_map.pg_sum.stats.sum.num_objects / pg_map.pg_stat.size();
6 E1 a) a) g9 c) p: P1 e4 T      if (average_objects_per_pg > 0 &&; t/ t+ w, }- b
          pg_map.pg_sum.stats.sum.num_objects >= g_conf->mon_pg_warn_min_objects &&0 S( V( \' \  l  N/ j4 F3 h3 J
          p->second.stats.sum.num_objects >= g_conf->mon_pg_warn_min_pool_objects) {
2 ^% z0 x$ s- S% c" T int objects_per_pg = p->second.stats.sum.num_objects / pi->get_pg_num();
- {* }) C7 C! S+ r9 W5 V9 i float ratio = (float)objects_per_pg / (float)average_objects_per_pg;6 a# Z! g: y# M* {( O
if (g_conf->mon_pg_warn_max_object_skew > 0 &&
" I7 k8 @* N+ d     ratio > g_conf->mon_pg_warn_max_object_skew) {
. i- e9 U* C& y9 [8 Z   ostringstream ss;
0 _/ Z/ z9 l" M   ss << "pool " << name << " has many more objects per pg than average (too few pgs?)";
- _3 t  D/ l; L8 l5 g9 K; z   summary.push_back(make_pair(HEALTH_WARN, ss.str()));
: ?& a6 a4 t6 i. \  [- g1 y   if (detail) {
+ M  u1 T& t# @5 {/ Q; Q     ostringstream ss;
, L2 F: Z4 j6 t- E/ \& A. [2 v     ss << "pool " << name << " objects per pg ("
, ]- v; D6 t. Y1 z# o# W( Q        << objects_per_pg << ") is more than " << ratio << " times cluster average ("" D" T5 k* a, [2 r
        << average_objects_per_pg << ")";
9 k# E. h& G% a! W0 O     detail->push_back(make_pair(HEALTH_WARN, ss.str()));9 H2 b3 z1 J3 R- N& \
   }$ Y1 `9 b8 h' S0 S# r% E4 `7 b# @
主要下面的几个限制条件
1 k! E& [: L! \1 [5 z1 K, n/ Jmon_pg_warn_min_objects = 10000 //总的对象超过10000" |4 b& t* E) @
mon_pg_warn_min_pool_objects = 1000 //存储池对象超过1000: C8 i9 V; i7 a+ q; i" [+ H
mon_pg_warn_max_object_skew = 10 //就是上面的存储池的平均对象与所有pg的平均值的倍数关系7 g( j7 ^: j) Q$ p4 j3 A$ N4 ]0 E. t
解决问题8 r2 j! i5 i3 {0 Z' w1 m
有三个方法解决这个警告的提示:
0 F  F4 V& ]8 m删除无用的存储池6 r# I2 W, p. T4 \4 ^! n
如果集群中有一些不用的存储池,并且相对的pg数目还比较高,那么可以删除一些这样的存储池,从而降低mon_pg_warn_max_object_skew这个值,警告就会没有了3 Y8 l. I1 @4 x4 m
增加提示的pool的pg数目; u; _. R, J8 g/ e5 @
有可能的情况就是,这个存储池的pg数目从一开始就不够,增加pg和pgp数目,同样降低了mon_pg_warn_max_object_skew这个值了( _4 f0 f: f  ?$ D! H" ?
增加mon_pg_warn_max_object_skew的参数值% ]. K; t4 R3 i
如果集群里面已经有足够多的pg了,再增加pg会不稳定,如果想去掉这个警告,就可以增加这个参数值,默认为10" C% M, B' t4 `" F1 i( C
总结7 ^. w' o5 P% ^
这个警告是比较的是存储池中的对象数目与整个集群的pg的平均对象数目的偏差,如果偏差太大就会发出警告( X& o! E! {" ~. H, T
检查的步骤:
" L9 a  R" r# a' h( e* J$ Oceph health detail) I7 x/ d( X3 [9 S* F, k2 L
ceph df0 z0 w9 k* X- p  i! R
ceph osd dump | grep pool
' `' a& i' Z$ l" M, D- U! V1 bmon_pg_warn_max_object_skew = 10.0* Z7 w; @, @3 F% x3 t' e
((objects/pg_num) in the affected pool)/(objects/pg_num in the entire system) >= 10.0 警告就会出现' S0 s; J- }3 B3 p" B
变更记录
$ r9 _; i: k8 Q; t# F) F4 x
  F8 A8 O& i+ d: i5 ]# J/ z& e/ X

( g/ _* C: ]. b5 E& _! X
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 00:02 , Processed in 0.014443 second(s), 23 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表