找回密码
 注册
查看: 9157|回复: 1

REQUEST_SLOW: 115 slow requests are blocked > 32 sec. Implicated osds 2

[复制链接]

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
发表于 2019-5-18 17:22:35 | 显示全部楼层 |阅读模式
REQUEST_SLOW: 115 slow requests are blocked > 32 sec. Implicated osds 2
' W, w* m! h, k5 b/ j. Y2019-05-18 17:23:04.713654 [WRN]  slow request 60.115662 seconds old, received at 2019-05-18 05:22:04.597888: osd_op(client.14197.0:5989 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks : Z% c' J9 D) o$ @  m+ t
2019-05-18 17:23:04.713650 [WRN]  slow request 240.111689 seconds old, received at 2019-05-18 05:19:04.601861: osd_op(client.14197.0:5953 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
( Y, t9 {8 j" G% z' f( _5 @( j2019-05-18 17:23:04.713646 [WRN]  slow request 480.106078 seconds old, received at 2019-05-18 05:15:04.607472: osd_op(client.14197.0:5905 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
' Z% W7 Z4 K  _) W% l( A4 U, r2019-05-18 17:23:04.713642 [WRN]  slow request 30.116352 seconds old, received at 2019-05-18 05:22:34.597198: osd_op(client.14197.0:5995 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks 6 w6 L5 l. k5 ~9 M6 L

: V+ F# \$ R- \4 H  x# \  @2 A" y2 d4 J' Z* k6 `
$ v) C$ L* t1 A$ s3 ~) a( }! ^
虚拟机遭遇停电后,用磁盘文件记录恢复后,虚机内的 ceph 出现故障
故障初始状态:  ~ ceph -s  cluster:    id:     cf09d650-4629-4509-81d1-3e7005ca3595    health: HEALTH_ERR            2 scrub errors            Possible data damage: 3 pgs inconsistent            2 slow requests are blocked > 32 sec. Implicated osds 10,15  services:    mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03    mgr: ceph02(active), standbys: ceph01, ceph03, controller    mds: cephfs-1/1/1 up  {0=ceph03=up:active}, 3 up:standby    osd: 16 osds: 16 up, 16 in  data:    pools:   8 pools, 1280 pgs    objects: 72 objects, 342MiB    usage:   18.2GiB used, 15.1TiB / 15.1TiB avail    pgs:     1276 active+clean             3    active+clean+inconsistent             1    active+clean+scrubbing针对 slow requests 设置超时时长➜  ~ ceph tell osd.\* injectargs '--osd_op_thread_timeout 30'osd.0: osd_op_thread_timeout = '30' (not observed, change may require restart)osd.1: osd_op_thread_timeout = '30' (not observed, change may require restart)osd.2: osd_op_thread_timeout = '30' (not observed, change may require restart)......➜  ~ ceph tell osd.\* injectargs '--osd_op_thread_suicide_timeout 300'osd.0: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)osd.1: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)osd.2: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)......
ceph osd down 和 ceph osd in,依然无效
  ~ ceph health detailHEALTH_ERR nodown flag(s) set; 2 scrub errors; Possible data damage: 2 pgs inconsistent; 2 slow requests are blocked > 32 sec. Implicated osds 10,15OSD_SCRUB_ERRORS 2 scrub errorsPG_DAMAGED Possible data damage: 2 pgs inconsistent    pg 10.f is active+clean+inconsistent, acting [11,9,10]    pg 10.1d is active+clean+inconsistent, acting [11,9,10]REQUEST_SLOW 2 slow requests are blocked > 32 sec. Implicated osds 10,15    2 ops are blocked > 262.144 sec    osds 10,15 have blocked requests > 262.144 sec修复 pg➜  ~ ceph pg repair 10.finstructing pg 10.f on osd.11 to repair➜  ~ ceph pg repair 10.1dinstructing pg 10.1d on osd.11 to repair
毫无变化,查看故障 pg 所在 osd
➜  ~ ceph pg 10.f query | grep primary            "same_primary_since": 763,                "num_objects_missing_on_primary": 0,            "up_primary": 11,            "acting_primary": 11                "same_primary_since": 763,                    "num_objects_missing_on_primary": 0,                "up_primary": 11,                "acting_primary": 11
关掉这个 osd 再修复
➜  ~ ceph pg repair 10.finstructing pg 10.f on osd.11 to repair➜  ~ ceph pg repair 10.1dinstructing pg 10.1d on osd.11 to repair➜  ~ ceph health detailHEALTH_WARN nodown flag(s) set; 1 osds down; 1 host (1 osds) down; Reduced data availability: 374 pgs inactive, 374 pgs peering; Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degradedOSDMAP_FLAGS nodown flag(s) setOSD_DOWN 1 osds down    osd.11 (root=root-ssd,host=ceph03-ssd) is downOSD_HOST_DOWN 1 host (1 osds) down    host ceph03-ssd (root=root-ssd) (1 osds) is downPG_AVAILABILITY Reduced data availability: 374 pgs inactive, 374 pgs peering    pg 2.5c is stuck peering for 2391.685850, current state peering, last acting [10,15]    pg 2.5e is stuck peering for 2535.599036, current state stale+peering, last acting [11,9]    pg 2.5f is stuck inactive for 2387.032031, current state peering, last acting [10,15]   ......PG_DEGRADED Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degraded    pg 10.7 is active+undersized+degraded, acting [10,9]    pg 10.b is active+undersized+degraded, acting [10,9]    pg 10.f is active+undersized+degraded, acting [9,10]    pg 10.13 is active+undersized+degraded, acting [10,9]    pg 10.15 is active+undersized+degraded, acting [9,10]    pg 10.1d is active+undersized+degraded, acting [9,10]➜  ~ ceph pg 10.f query | grep primary            "same_primary_since": 829,                "num_objects_missing_on_primary": 0,            "up_primary": 9,            "acting_primary": 9                "same_primary_since": 829,                    "num_objects_missing_on_primary": 0,                "up_primary": 9,                "acting_primary": 9➜  ~ ceph pg 10.1d query | grep primary            "same_primary_since": 829,                "num_objects_missing_on_primary": 0,            "up_primary": 9,            "acting_primary": 9                "same_primary_since": 829,                    "num_objects_missing_on_primary": 0,                "up_primary": 9,                "acting_primary": 9
重启 osd 11,稍等一会
  ~ ceph -s  cluster:    id:     cf09d650-4629-4509-81d1-3e7005ca3595    health: HEALTH_WARN            Reduced data availability: 374 pgs inactive, 374 pgs peering, 55 pgs stale  services:    mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03    mgr: ceph02(active), standbys: ceph01, ceph03, controller    mds: cephfs-1/1/1 up  {0=ceph03=up:active}, 3 up:standby    osd: 16 osds: 16 up, 16 in  data:    pools:   8 pools, 1280 pgs    objects: 72 objects, 342MiB    usage:   16.9GiB used, 14.3TiB / 14.3TiB avail    pgs:     29.219% pgs not active             906 active+clean             319 peering             55  stale+peering  ~ ceph health detailHEALTH_WARN 15/216 objects misplaced (6.944%); 1 slow requests are blocked > 32 sec. Implicated osds 11OBJECT_MISPLACED 15/216 objects misplaced (6.944%)REQUEST_SLOW 1 slow requests are blocked > 32 sec. Implicated osds 11    1 ops are blocked > 32.768 sec    osd.11 has blocked requests > 32.768 sec  ~ ceph health detailHEALTH_WARN 5/216 objects misplaced (2.315%)OBJECT_MISPLACED 5/216 objects misplaced (2.315%)  ~ ceph health detailHEALTH_OK
解决

1

主题

0

回帖

12

积分

管理员

积分
12
QQ
 楼主| 发表于 2019-5-18 19:54:28 | 显示全部楼层
等半个小时候之后,发现好了,自动修复了。分布式存储还是比较靠谱了。& A$ e( O# {" ?' n) v
2019-05-18 17:26:00.734924 [INF]  Cluster is now healthy
4 l. K2 m' p/ E' a* g, m1 ?2019-05-18 17:26:00.734874 [INF]  Health check cleared: REQUEST_SLOW (was: 160 slow requests are blocked > 32 sec. Implicated osds 2) 1 Q% E- f$ ^; O) V. ^7 Z! p3 Z8 a5 E
2019-05-18 17:25:59.388999 [WRN]  Health check update: 160 slow requests are blocked > 32 sec. Implicated osds 2 (REQUEST_SLOW)
3 A* M' o# }5 E- E2 `- ~2019-05-18 17:26:15.130154 [WRN]  slow request 240.532377 seconds old, received at 2019-05-18 05:22:14.597670: osd_op(client.14197.0:5991 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
: z  D- R8 w" ?, j- x7 I* ^2019-05-18 17:26:15.130150 [WRN]  slow request 480.526993 seconds old, received at 2019-05-18 05:18:14.603053: osd_op(client.14197.0:5943 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
; ?( z: A3 V0 i' S* m7 }7 U9 }3 Y  I7 Z2019-05-18 17:26:15.130146 [WRN]  slow request 30.536851 seconds old, received at 2019-05-18 05:25:44.593195: osd_op(client.14197.0:6033 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks $ ?" t+ s( ~) F( y& n( C
2019-05-18 17:26:15.130142 [WRN]  slow request 120.535038 seconds old, received at 2019-05-18 05:24:14.595008: osd_op(client.14197.0:6015 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks 6 i$ W- m& U; _3 A3 ]: G, e8 c
2019-05-18 17:26:15.130138 [WRN]  slow request 60.536286 seconds old, received at 2019-05-18 05:25:14.593760: osd_op(client.14197.0:6027 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
  I& o1 Y0 q, r/ O, v" N2019-05-18 17:26:15.130132 [WRN]  158 slow requests, 5 included below; oldest blocked for > 948.867328 secs
& u; y; e4 ^  t% z2019-05-18 17:26:10.247783 [WRN]  slow request 240.649906 seconds old, received at 2019-05-18 05:22:09.597779: osd_op(client.14197.0:5990 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks ' D- {/ X. [4 n3 ?
2019-05-18 17:26:10.247779 [WRN]  slow request 480.644498 seconds old, received at 2019-05-18 05:18:09.603186: osd_op(client.14197.0:5942 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
# ^. d+ W/ B0 X2019-05-18 17:26:10.247776 [WRN]  slow request 30.654432 seconds old, received at 2019-05-18 05:25:39.593253: osd_op(client.14197.0:6032 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks ; T* D+ C) d6 m
2019-05-18 17:26:10.247771 [WRN]  slow request 120.652546 seconds old, received at 2019-05-18 05:24:09.595139: osd_op(client.14197.0:6014 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
3 _, F; N( X. x$ h$ H2019-05-18 17:26:10.247767 [WRN]  slow request 60.653823 seconds old, received at 2019-05-18 05:25:09.593862: osd_op(client.14197.0:6026 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
7 g- k2 {0 r2 \" r. r3 r" p5 n9 K2019-05-18 17:26:10.247761 [WRN]  157 slow requests, 5 included below; oldest blocked for > 943.984967 secs 6 C9 z5 _. d, F+ K9 ^5 f) o! ~
2019-05-18 17:25:54.388692 [WRN]  Health check update: 159 slow requests are blocked > 32 sec. Implicated osds 2 (REQUEST_SLOW)
3 h, r( k; M- ^% P2 H" Z6 X2019-05-18 17:25:49.388351 [WRN]  Health check update: 158 slow requests are blocked > 32 sec. Implicated osds 2 (REQUEST_SLOW)
您需要登录后才可以回帖 登录 | 注册

本版积分规则

返回首页|Archiver|手机版|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )

GMT+8, 2026-6-12 00:00 , Processed in 0.020710 second(s), 22 queries .

Powered by Discuz! X5.0

© 2001-2026 Discuz! Team.

快速回复 返回顶部 返回列表