找回密码
 注册
查看: 2840|回复: 0

ceph health detail HEALTH_WARN 1 pools have many more objects per pg than averag

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2021-9-8 17:27:55 | 显示全部楼层 |阅读模式
# ceph health detail
$ h- b( b7 v3 ^( D5 x7 i4 y4 r; RHEALTH_WARN 1 pools have many more objects per pg than average. K- N6 y, E$ K6 K- V- k
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average2 I6 |7 B4 y) f$ u
    pool pool-hdd-2 objects per pg (5503) is more than 12.9482 times cluster average (425)% z3 g5 V, r6 {

" z0 s1 b5 k$ W0 `6 c6 @& I. y% V, `- Q) p9 F& x, c; g$ y/ [$ R
定位问题6 a/ T, `( p, }5 i
[root@lab8106 ~]# ceph -s
* a4 C( A8 U- s8 D    cluster fa7ec1a1-662a-4ba3-b478-7cb570482b62
) ]& [8 n; }7 a2 |  U- d     health HEALTH_WARN
5 D9 p. U( D9 ~% a/ O* K$ t            pool rbd has many more objects per pg than average (too few pgs?), z  S, k3 @( K; i9 v
     monmap e1: 1 mons at {lab8106=192.168.8.106:6789/0}
1 a' Q2 r5 ]% B5 d            election epoch 30, quorum 0 lab8106
6 c7 C* n: Z. V1 i) ~$ Z     osdmap e157: 2 osds: 2 up, 2 in
. p9 u7 ^% @' V            flags sortbitwise% {; S: T7 @2 q# t$ \( W
      pgmap v1023: 417 pgs, 13 pools, 18519 MB data, 15920 objects( f5 ]9 i4 O" f% [" V! z3 o* e
            18668 MB used, 538 GB / 556 GB avail
9 U0 E. t! h3 V( t4 D+ r* }0 W                 417 active+clean" B8 G* Y& z: `0 y" c4 x4 ?0 w
集群出现了这个警告,pool rbd has many more objects per pg than average (too few pgs?) 这个警告在hammer版本里面的提示是 pool rbd has too few pgs
" Z: G# Q% k, {; q+ U* z  V3 R这个地方查看集群详细信息:
% Q7 Z5 `% n0 ~4 J2 ?, B8 {[root@lab8106 ~]# ceph health detail
5 m0 o$ J4 ?# i% {" m& THEALTH_WARN pool rbd has many more objects per pg than average (too few pgs?); mon.lab8106 low disk space: X' ]. }2 S7 Y5 O4 A% s4 i
pool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)5 {" \3 U9 R5 m, \! Z0 t" G- G7 n
看下集群的pool的对象状态
1 `+ n/ Z8 v3 n' e  Z' |. ~. j[root@lab8106 ~]# ceph df7 [3 M2 a+ Y4 M6 S3 \- l2 f
GLOBAL:6 O& F- J9 v! u- Z! o) h
    SIZE     AVAIL     RAW USED     %RAW USED
$ b8 A- I, z  s+ X# E! ~; `    556G      538G       18668M          3.28
& n* G* ?" n# {( U9 e5 APOOLS:
7 w. X# Z  G9 f! |% y    NAME       ID     USED       %USED     MAX AVAIL     OBJECTS $ e/ z. j* O, k# y9 A
    rbd        6      16071M      2.82          536G       15296 . R$ `9 n7 u2 P! |4 Z( g) h# L
    pool1      7        204M      0.04          536G          52 : r; m8 O  `  F
    pool2      8        184M      0.03          536G          47 ! U2 T% o7 i1 Z+ ]2 I
    pool3      9        188M      0.03          536G          48
& Y% a. w4 R* e7 p+ ]- A5 b    pool4      10       192M      0.03          536G          49
* }5 s1 `' e6 ]5 b' N' m" n$ n    pool5      11       204M      0.04          536G          52 + s' n; c0 H: R
    pool6      12       148M      0.03          536G          38
5 o8 D* m" j3 E, D    pool7      13       184M      0.03          536G          47 & b# F$ K! T0 \
    pool8      14       200M      0.04          536G          51 " q( {/ K3 V( m! W, ^6 _9 S( {  Q
    pool9      15       200M      0.04          536G          51 ' `1 H& f# `, ?; G: \2 _9 C
    pool10     16       248M      0.04          536G          63
" w, g; r0 m8 I# y0 D    pool11     17       232M      0.04          536G          59 2 [0 t2 e( r7 _5 G. D7 q% H
    pool12     18       264M      0.05          536G          67
7 r  @( E& z* y) O+ k) i9 l查看存储池的pg个数4 {4 p# E' Q6 ~9 x7 Y9 |
[root@lab8106 ~]# ceph osd dump|grep pool
( `2 S+ y) \. L7 \6 \pool 6 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 132 flags hashpspool stripe_width 0" t3 X! v/ a' X# l
pool 7 'pool1' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 134 flags hashpspool stripe_width 0
% e% }* s) l  t1 R* f3 d' Rpool 8 'pool2' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 136 flags hashpspool stripe_width 0
2 n- ]1 k- V2 Q0 \5 D7 D$ H; kpool 9 'pool3' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 138 flags hashpspool stripe_width 0
( z) n# x3 b( Ppool 10 'pool4' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 140 flags hashpspool stripe_width 0# o- p; G4 j# v" P
pool 11 'pool5' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 142 flags hashpspool stripe_width 0
  i& T2 k: F& z5 V9 t6 K  H+ S: N2 cpool 12 'pool6' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 144 flags hashpspool stripe_width 02 E% |  i9 y9 N' y4 x3 s
pool 13 'pool7' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 146 flags hashpspool stripe_width 0' n  L8 _; U  P/ p" @( S
pool 14 'pool8' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 148 flags hashpspool stripe_width 0. O( w- W# m/ |  V
pool 15 'pool9' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1 pgp_num 1 last_change 150 flags hashpspool stripe_width 0
5 E# N$ I* r) X: n6 bpool 16 'pool10' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 152 flags hashpspool stripe_width 0" ?& {, N  `* E; ]( z1 }
pool 17 'pool11' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 100 pgp_num 100 last_change 154 flags hashpspool stripe_width 00 n) C  r: C# Z* W* [3 s4 n
pool 18 'pool12' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 200 pgp_num 200 last_change 156 flags hashpspool stripe_width 0
8 j! C( o) }$ ^, A我们看下这个是怎么得到的
) n7 T7 ]# r' b% w% fpool rbd objects per pg (1912) is more than 50.3158 times cluster average (38)
7 y6 C+ v9 L1 g+ g" x/ a$ a) urbd objects_per_pg = 15296 / 8 = 1912% B& A0 G2 ^/ t: V3 B
objects_per_pg = 15920 /417 ≈ 381 {% y7 B( D" z
50.3158 = rbd objects_per_pg / objects_per_pg = 1912 / 38
/ d$ W6 q- V" I/ U' B2 P) X* I8 ~也就是出现其他pool的对象太少,而这个pg少,对象多,就会提示这个了,我们看下代码里面的判断
9 ]+ S: C3 t+ g/ K3 B& K$ L, ~https://github.com/ceph/ceph/blob/master/src/mon/PGMonitor.cc; }' `( p0 @  o4 s. b" @" o( ~& P
int average_objects_per_pg = pg_map.pg_sum.stats.sum.num_objects / pg_map.pg_stat.size();
% q! g3 x! G) t7 E( I/ D      if (average_objects_per_pg > 0 &&4 R2 q- R  P6 \" k8 X4 i
          pg_map.pg_sum.stats.sum.num_objects >= g_conf->mon_pg_warn_min_objects &&: }8 l- @! t3 \. K
          p->second.stats.sum.num_objects >= g_conf->mon_pg_warn_min_pool_objects) {
0 p2 ^: K6 I# `) j* ^4 C int objects_per_pg = p->second.stats.sum.num_objects / pi->get_pg_num();1 _3 n: h& z' e! D; V4 B6 X
float ratio = (float)objects_per_pg / (float)average_objects_per_pg;) E0 ]) O4 A" R
if (g_conf->mon_pg_warn_max_object_skew > 0 &&
' Z$ y' {. t1 h, d; e     ratio > g_conf->mon_pg_warn_max_object_skew) {* E( F% H0 O( o
   ostringstream ss;3 k7 `. |1 B0 O) O8 R6 d
   ss << "pool " << name << " has many more objects per pg than average (too few pgs?)";4 n, i5 n: k  e- n* T6 ]; E3 @
   summary.push_back(make_pair(HEALTH_WARN, ss.str()));( y8 F3 r8 O6 _1 p& R
   if (detail) {
, m3 v- |. r5 y     ostringstream ss;9 L1 q% E% V5 U7 J/ W* [
     ss << "pool " << name << " objects per pg ("8 c: D2 i/ [7 I* o  p
        << objects_per_pg << ") is more than " << ratio << " times cluster average ("$ a+ G' M5 q% A3 V- z
        << average_objects_per_pg << ")";
7 ?& r) Q' ~6 I9 ?0 s     detail->push_back(make_pair(HEALTH_WARN, ss.str()));
; V6 T9 a( z- D( R9 ^5 M) m   }
1 o1 Z& F) i7 @, |主要下面的几个限制条件
" j7 B; M1 {2 D; h  d! D/ E4 B/ Dmon_pg_warn_min_objects = 10000 //总的对象超过10000
0 y8 E( K1 ~. T5 u% e$ v7 t4 r* emon_pg_warn_min_pool_objects = 1000 //存储池对象超过1000
- m, b2 ]* F5 |mon_pg_warn_max_object_skew = 10 //就是上面的存储池的平均对象与所有pg的平均值的倍数关系
( Q; S# y+ S8 e" a解决问题
  y& d  l3 _. O8 ?) ]0 L% v有三个方法解决这个警告的提示:8 r( a. L' i. _8 J# y0 {: P
删除无用的存储池
' s* t4 \) p/ X0 y如果集群中有一些不用的存储池,并且相对的pg数目还比较高,那么可以删除一些这样的存储池,从而降低mon_pg_warn_max_object_skew这个值,警告就会没有了
, C& r; u1 _( T" b. u增加提示的pool的pg数目9 x# V/ _; \4 G7 {8 J. s
有可能的情况就是,这个存储池的pg数目从一开始就不够,增加pg和pgp数目,同样降低了mon_pg_warn_max_object_skew这个值了
. o- d! _4 R4 s* b% [增加mon_pg_warn_max_object_skew的参数值
. m  L# \! J; n如果集群里面已经有足够多的pg了,再增加pg会不稳定,如果想去掉这个警告,就可以增加这个参数值,默认为103 K: M( r: g& G# a" E
总结7 O! h4 _5 x) D- `
这个警告是比较的是存储池中的对象数目与整个集群的pg的平均对象数目的偏差,如果偏差太大就会发出警告  t9 V: j* a% R, t  c! @6 Z+ T
检查的步骤:
  q+ L8 P8 f& Z' r) N9 uceph health detail
& v: W; Y. v4 n6 W4 B$ wceph df$ ]: c6 y; A; d
ceph osd dump | grep pool
/ n) X: h- \9 B1 Lmon_pg_warn_max_object_skew = 10.0
+ T1 a0 h8 N# m. h((objects/pg_num) in the affected pool)/(objects/pg_num in the entire system) >= 10.0 警告就会出现
2 |7 D' \/ K4 K% S变更记录* h. l  s1 H4 I" d; m! h3 `

) ~% a6 X# B% T! l; Y  ~6 r" {$ s: }" b3 R
. P; r  t) W9 B& A. W9 F
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 04:22 , Processed in 0.017246 second(s), 23 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表