找回密码
 注册
查看: 9158|回复: 1

REQUEST_SLOW: 115 slow requests are blocked > 32 sec. Implicated osds 2

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2019-5-18 17:22:35 | 显示全部楼层 |阅读模式
REQUEST_SLOW: 115 slow requests are blocked > 32 sec. Implicated osds 26 v! w: u  V3 C/ x; H6 i5 X3 J
2019-05-18 17:23:04.713654 [WRN]  slow request 60.115662 seconds old, received at 2019-05-18 05:22:04.597888: osd_op(client.14197.0:5989 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
- s% f# i$ @1 r/ s2 C/ x$ U2019-05-18 17:23:04.713650 [WRN]  slow request 240.111689 seconds old, received at 2019-05-18 05:19:04.601861: osd_op(client.14197.0:5953 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks 9 l5 Y0 o% n0 I: D" F( L+ X" x; W( j
2019-05-18 17:23:04.713646 [WRN]  slow request 480.106078 seconds old, received at 2019-05-18 05:15:04.607472: osd_op(client.14197.0:5905 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
1 W/ ?7 A- ?  A) m3 S2019-05-18 17:23:04.713642 [WRN]  slow request 30.116352 seconds old, received at 2019-05-18 05:22:34.597198: osd_op(client.14197.0:5995 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks ' R2 y) ?* F& j! y+ Y0 i
5 o# v. j* m6 T% u3 R( w

, U" p; ~! |) V6 W, o. ?, d% t7 @( w, C
虚拟机遭遇停电后,用磁盘文件记录恢复后,虚机内的 ceph 出现故障
故障初始状态:  ~ ceph -s  cluster:    id:     cf09d650-4629-4509-81d1-3e7005ca3595    health: HEALTH_ERR            2 scrub errors            Possible data damage: 3 pgs inconsistent            2 slow requests are blocked > 32 sec. Implicated osds 10,15  services:    mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03    mgr: ceph02(active), standbys: ceph01, ceph03, controller    mds: cephfs-1/1/1 up  {0=ceph03=up:active}, 3 up:standby    osd: 16 osds: 16 up, 16 in  data:    pools:   8 pools, 1280 pgs    objects: 72 objects, 342MiB    usage:   18.2GiB used, 15.1TiB / 15.1TiB avail    pgs:     1276 active+clean             3    active+clean+inconsistent             1    active+clean+scrubbing针对 slow requests 设置超时时长➜  ~ ceph tell osd.\* injectargs '--osd_op_thread_timeout 30'osd.0: osd_op_thread_timeout = '30' (not observed, change may require restart)osd.1: osd_op_thread_timeout = '30' (not observed, change may require restart)osd.2: osd_op_thread_timeout = '30' (not observed, change may require restart)......➜  ~ ceph tell osd.\* injectargs '--osd_op_thread_suicide_timeout 300'osd.0: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)osd.1: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)osd.2: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)......
ceph osd down 和 ceph osd in,依然无效
  ~ ceph health detailHEALTH_ERR nodown flag(s) set; 2 scrub errors; Possible data damage: 2 pgs inconsistent; 2 slow requests are blocked > 32 sec. Implicated osds 10,15OSD_SCRUB_ERRORS 2 scrub errorsPG_DAMAGED Possible data damage: 2 pgs inconsistent    pg 10.f is active+clean+inconsistent, acting [11,9,10]    pg 10.1d is active+clean+inconsistent, acting [11,9,10]REQUEST_SLOW 2 slow requests are blocked > 32 sec. Implicated osds 10,15    2 ops are blocked > 262.144 sec    osds 10,15 have blocked requests > 262.144 sec修复 pg➜  ~ ceph pg repair 10.finstructing pg 10.f on osd.11 to repair➜  ~ ceph pg repair 10.1dinstructing pg 10.1d on osd.11 to repair
毫无变化,查看故障 pg 所在 osd
➜  ~ ceph pg 10.f query | grep primary            "same_primary_since": 763,                "num_objects_missing_on_primary": 0,            "up_primary": 11,            "acting_primary": 11                "same_primary_since": 763,                    "num_objects_missing_on_primary": 0,                "up_primary": 11,                "acting_primary": 11
关掉这个 osd 再修复
➜  ~ ceph pg repair 10.finstructing pg 10.f on osd.11 to repair➜  ~ ceph pg repair 10.1dinstructing pg 10.1d on osd.11 to repair➜  ~ ceph health detailHEALTH_WARN nodown flag(s) set; 1 osds down; 1 host (1 osds) down; Reduced data availability: 374 pgs inactive, 374 pgs peering; Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degradedOSDMAP_FLAGS nodown flag(s) setOSD_DOWN 1 osds down    osd.11 (root=root-ssd,host=ceph03-ssd) is downOSD_HOST_DOWN 1 host (1 osds) down    host ceph03-ssd (root=root-ssd) (1 osds) is downPG_AVAILABILITY Reduced data availability: 374 pgs inactive, 374 pgs peering    pg 2.5c is stuck peering for 2391.685850, current state peering, last acting [10,15]    pg 2.5e is stuck peering for 2535.599036, current state stale+peering, last acting [11,9]    pg 2.5f is stuck inactive for 2387.032031, current state peering, last acting [10,15]   ......PG_DEGRADED Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degraded    pg 10.7 is active+undersized+degraded, acting [10,9]    pg 10.b is active+undersized+degraded, acting [10,9]    pg 10.f is active+undersized+degraded, acting [9,10]    pg 10.13 is active+undersized+degraded, acting [10,9]    pg 10.15 is active+undersized+degraded, acting [9,10]    pg 10.1d is active+undersized+degraded, acting [9,10]➜  ~ ceph pg 10.f query | grep primary            "same_primary_since": 829,                "num_objects_missing_on_primary": 0,            "up_primary": 9,            "acting_primary": 9                "same_primary_since": 829,                    "num_objects_missing_on_primary": 0,                "up_primary": 9,                "acting_primary": 9➜  ~ ceph pg 10.1d query | grep primary            "same_primary_since": 829,                "num_objects_missing_on_primary": 0,            "up_primary": 9,            "acting_primary": 9                "same_primary_since": 829,                    "num_objects_missing_on_primary": 0,                "up_primary": 9,                "acting_primary": 9
重启 osd 11,稍等一会
  ~ ceph -s  cluster:    id:     cf09d650-4629-4509-81d1-3e7005ca3595    health: HEALTH_WARN            Reduced data availability: 374 pgs inactive, 374 pgs peering, 55 pgs stale  services:    mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03    mgr: ceph02(active), standbys: ceph01, ceph03, controller    mds: cephfs-1/1/1 up  {0=ceph03=up:active}, 3 up:standby    osd: 16 osds: 16 up, 16 in  data:    pools:   8 pools, 1280 pgs    objects: 72 objects, 342MiB    usage:   16.9GiB used, 14.3TiB / 14.3TiB avail    pgs:     29.219% pgs not active             906 active+clean             319 peering             55  stale+peering  ~ ceph health detailHEALTH_WARN 15/216 objects misplaced (6.944%); 1 slow requests are blocked > 32 sec. Implicated osds 11OBJECT_MISPLACED 15/216 objects misplaced (6.944%)REQUEST_SLOW 1 slow requests are blocked > 32 sec. Implicated osds 11    1 ops are blocked > 32.768 sec    osd.11 has blocked requests > 32.768 sec  ~ ceph health detailHEALTH_WARN 5/216 objects misplaced (2.315%)OBJECT_MISPLACED 5/216 objects misplaced (2.315%)  ~ ceph health detailHEALTH_OK
解决

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
 楼主| 发表于 2019-5-18 19:54:28 | 显示全部楼层
等半个小时候之后,发现好了,自动修复了。分布式存储还是比较靠谱了。/ k9 X8 R  v' R4 k
2019-05-18 17:26:00.734924 [INF]  Cluster is now healthy
: P  R2 M9 F+ L( Y2019-05-18 17:26:00.734874 [INF]  Health check cleared: REQUEST_SLOW (was: 160 slow requests are blocked > 32 sec. Implicated osds 2)
7 o! R  I$ z9 S: P/ n, h2019-05-18 17:25:59.388999 [WRN]  Health check update: 160 slow requests are blocked > 32 sec. Implicated osds 2 (REQUEST_SLOW)
2 b  I% h- `& O1 ~7 z  ^$ d2019-05-18 17:26:15.130154 [WRN]  slow request 240.532377 seconds old, received at 2019-05-18 05:22:14.597670: osd_op(client.14197.0:5991 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks + L. _  l$ r2 X' }  V
2019-05-18 17:26:15.130150 [WRN]  slow request 480.526993 seconds old, received at 2019-05-18 05:18:14.603053: osd_op(client.14197.0:5943 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks 0 C; E/ x: F" E
2019-05-18 17:26:15.130146 [WRN]  slow request 30.536851 seconds old, received at 2019-05-18 05:25:44.593195: osd_op(client.14197.0:6033 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
9 n% q5 ~/ A7 F9 a2019-05-18 17:26:15.130142 [WRN]  slow request 120.535038 seconds old, received at 2019-05-18 05:24:14.595008: osd_op(client.14197.0:6015 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
5 Y: d( z3 F; j1 S  U8 ]2019-05-18 17:26:15.130138 [WRN]  slow request 60.536286 seconds old, received at 2019-05-18 05:25:14.593760: osd_op(client.14197.0:6027 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks $ [2 X! ?9 U. e/ Q6 P6 Z( w
2019-05-18 17:26:15.130132 [WRN]  158 slow requests, 5 included below; oldest blocked for > 948.867328 secs
1 X4 L" |: a( Z  V! O2019-05-18 17:26:10.247783 [WRN]  slow request 240.649906 seconds old, received at 2019-05-18 05:22:09.597779: osd_op(client.14197.0:5990 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks ! [+ B4 X" F( o
2019-05-18 17:26:10.247779 [WRN]  slow request 480.644498 seconds old, received at 2019-05-18 05:18:09.603186: osd_op(client.14197.0:5942 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
- F* @' M6 G$ h9 x2019-05-18 17:26:10.247776 [WRN]  slow request 30.654432 seconds old, received at 2019-05-18 05:25:39.593253: osd_op(client.14197.0:6032 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks 1 Z2 ?9 I& R1 d5 e4 j
2019-05-18 17:26:10.247771 [WRN]  slow request 120.652546 seconds old, received at 2019-05-18 05:24:09.595139: osd_op(client.14197.0:6014 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
: X6 E0 K) }3 P$ `# U# |# u8 k2019-05-18 17:26:10.247767 [WRN]  slow request 60.653823 seconds old, received at 2019-05-18 05:25:09.593862: osd_op(client.14197.0:6026 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
/ z. W6 }' a+ X( d7 x- O2019-05-18 17:26:10.247761 [WRN]  157 slow requests, 5 included below; oldest blocked for > 943.984967 secs
6 g6 z# F* A# ~2019-05-18 17:25:54.388692 [WRN]  Health check update: 159 slow requests are blocked > 32 sec. Implicated osds 2 (REQUEST_SLOW) - z) @3 p. ^6 D  q3 |& v
2019-05-18 17:25:49.388351 [WRN]  Health check update: 158 slow requests are blocked > 32 sec. Implicated osds 2 (REQUEST_SLOW)
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 00:03 , Processed in 0.013937 second(s), 22 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表