REQUEST_SLOW: 115 slow requests are blocked > 32 sec. Implicated osds 26 v! w: u V3 C/ x; H6 i5 X3 J
2019-05-18 17:23:04.713654 [WRN] slow request 60.115662 seconds old, received at 2019-05-18 05:22:04.597888: osd_op(client.14197.0:5989 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
- s% f# i$ @1 r/ s2 C/ x$ U2019-05-18 17:23:04.713650 [WRN] slow request 240.111689 seconds old, received at 2019-05-18 05:19:04.601861: osd_op(client.14197.0:5953 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks 9 l5 Y0 o% n0 I: D" F( L+ X" x; W( j
2019-05-18 17:23:04.713646 [WRN] slow request 480.106078 seconds old, received at 2019-05-18 05:15:04.607472: osd_op(client.14197.0:5905 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks
1 W/ ?7 A- ? A) m3 S2019-05-18 17:23:04.713642 [WRN] slow request 30.116352 seconds old, received at 2019-05-18 05:22:34.597198: osd_op(client.14197.0:5995 1.7f 1:ff13974d:::rbd_header.114f6b8b4567:head [watch ping cookie 94500249493504] snapc 0=[] ondisk+write+known_if_redirected e30) currently waiting for rw locks ' R2 y) ?* F& j! y+ Y0 i
5 o# v. j* m6 T% u3 R( w
, U" p; ~! |) V6 W, o. ?, d% t7 @( w, C
虚拟机遭遇停电后,用磁盘文件记录恢复后,虚机内的 ceph 出现故障 故障初始状态:➜ ~ ceph -s cluster: id: cf09d650-4629-4509-81d1-3e7005ca3595 health: HEALTH_ERR 2 scrub errors Possible data damage: 3 pgs inconsistent 2 slow requests are blocked > 32 sec. Implicated osds 10,15 services: mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03 mgr: ceph02(active), standbys: ceph01, ceph03, controller mds: cephfs-1/1/1 up {0=ceph03=up:active}, 3 up:standby osd: 16 osds: 16 up, 16 in data: pools: 8 pools, 1280 pgs objects: 72 objects, 342MiB usage: 18.2GiB used, 15.1TiB / 15.1TiB avail pgs: 1276 active+clean 3 active+clean+inconsistent 1 active+clean+scrubbing针对 slow requests 设置超时时长➜ ~ ceph tell osd.\* injectargs '--osd_op_thread_timeout 30'osd.0: osd_op_thread_timeout = '30' (not observed, change may require restart)osd.1: osd_op_thread_timeout = '30' (not observed, change may require restart)osd.2: osd_op_thread_timeout = '30' (not observed, change may require restart)......➜ ~ ceph tell osd.\* injectargs '--osd_op_thread_suicide_timeout 300'osd.0: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)osd.1: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)osd.2: osd_op_thread_suicide_timeout = '300' (not observed, change may require restart)......ceph osd down 和 ceph osd in,依然无效 ➜ ~ ceph health detailHEALTH_ERR nodown flag(s) set; 2 scrub errors; Possible data damage: 2 pgs inconsistent; 2 slow requests are blocked > 32 sec. Implicated osds 10,15OSD_SCRUB_ERRORS 2 scrub errorsPG_DAMAGED Possible data damage: 2 pgs inconsistent pg 10.f is active+clean+inconsistent, acting [11,9,10] pg 10.1d is active+clean+inconsistent, acting [11,9,10]REQUEST_SLOW 2 slow requests are blocked > 32 sec. Implicated osds 10,15 2 ops are blocked > 262.144 sec osds 10,15 have blocked requests > 262.144 sec修复 pg➜ ~ ceph pg repair 10.finstructing pg 10.f on osd.11 to repair➜ ~ ceph pg repair 10.1dinstructing pg 10.1d on osd.11 to repair毫无变化,查看故障 pg 所在 osd ➜ ~ ceph pg 10.f query | grep primary "same_primary_since": 763, "num_objects_missing_on_primary": 0, "up_primary": 11, "acting_primary": 11 "same_primary_since": 763, "num_objects_missing_on_primary": 0, "up_primary": 11, "acting_primary": 11关掉这个 osd 再修复 ➜ ~ ceph pg repair 10.finstructing pg 10.f on osd.11 to repair➜ ~ ceph pg repair 10.1dinstructing pg 10.1d on osd.11 to repair➜ ~ ceph health detailHEALTH_WARN nodown flag(s) set; 1 osds down; 1 host (1 osds) down; Reduced data availability: 374 pgs inactive, 374 pgs peering; Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degradedOSDMAP_FLAGS nodown flag(s) setOSD_DOWN 1 osds down osd.11 (root=root-ssd,host=ceph03-ssd) is downOSD_HOST_DOWN 1 host (1 osds) down host ceph03-ssd (root=root-ssd) (1 osds) is downPG_AVAILABILITY Reduced data availability: 374 pgs inactive, 374 pgs peering pg 2.5c is stuck peering for 2391.685850, current state peering, last acting [10,15] pg 2.5e is stuck peering for 2535.599036, current state stale+peering, last acting [11,9] pg 2.5f is stuck inactive for 2387.032031, current state peering, last acting [10,15] ......PG_DEGRADED Degraded data redundancy: 6/216 objects degraded (2.778%), 6 pgs degraded pg 10.7 is active+undersized+degraded, acting [10,9] pg 10.b is active+undersized+degraded, acting [10,9] pg 10.f is active+undersized+degraded, acting [9,10] pg 10.13 is active+undersized+degraded, acting [10,9] pg 10.15 is active+undersized+degraded, acting [9,10] pg 10.1d is active+undersized+degraded, acting [9,10]➜ ~ ceph pg 10.f query | grep primary "same_primary_since": 829, "num_objects_missing_on_primary": 0, "up_primary": 9, "acting_primary": 9 "same_primary_since": 829, "num_objects_missing_on_primary": 0, "up_primary": 9, "acting_primary": 9➜ ~ ceph pg 10.1d query | grep primary "same_primary_since": 829, "num_objects_missing_on_primary": 0, "up_primary": 9, "acting_primary": 9 "same_primary_since": 829, "num_objects_missing_on_primary": 0, "up_primary": 9, "acting_primary": 9重启 osd 11,稍等一会 ➜ ~ ceph -s cluster: id: cf09d650-4629-4509-81d1-3e7005ca3595 health: HEALTH_WARN Reduced data availability: 374 pgs inactive, 374 pgs peering, 55 pgs stale services: mon: 4 daemons, quorum controller,ceph01,ceph02,ceph03 mgr: ceph02(active), standbys: ceph01, ceph03, controller mds: cephfs-1/1/1 up {0=ceph03=up:active}, 3 up:standby osd: 16 osds: 16 up, 16 in data: pools: 8 pools, 1280 pgs objects: 72 objects, 342MiB usage: 16.9GiB used, 14.3TiB / 14.3TiB avail pgs: 29.219% pgs not active 906 active+clean 319 peering 55 stale+peering➜ ~ ceph health detailHEALTH_WARN 15/216 objects misplaced (6.944%); 1 slow requests are blocked > 32 sec. Implicated osds 11OBJECT_MISPLACED 15/216 objects misplaced (6.944%)REQUEST_SLOW 1 slow requests are blocked > 32 sec. Implicated osds 11 1 ops are blocked > 32.768 sec osd.11 has blocked requests > 32.768 sec➜ ~ ceph health detailHEALTH_WARN 5/216 objects misplaced (2.315%)OBJECT_MISPLACED 5/216 objects misplaced (2.315%)➜ ~ ceph health detailHEALTH_OK解决 |