易陆发现互联网技术论坛

 找回密码
 开始注册
查看: 4138|回复: 0
收起左侧

Recover from a failed compute node

[复制链接]
发表于 2018-12-20 02:02:18 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?开始注册

x
Recover from a failed compute node
' j# [& C4 V+ m0 p# pupdated: 2018-12-18 05:070 r, b; ~! i* V7 w8 |5 M
Recover from a failed compute node¶
1 ^0 j) K) T2 @, \& I0 XIf you deploy Compute with a shared file system, you can use several methods to quickly recover from a node failure. This section discusses manual recovery.! p. ~4 r) j  A* E0 Y3 ?- p

: b2 t( c) ?' `Evacuate instances¶0 t- K3 Z1 C# M9 d7 ?
If a hardware malfunction or other error causes the cloud compute node to fail, you can use the nova evacuate command to evacuate instances. See evacuate instances for more information on using the command.
. _- d5 p* o* i5 H* R! [3 J; [, {
7 }2 l8 o7 ~+ NManual recovery¶) k: X; [0 |$ i: A$ ]4 ]1 n
To manually recover a failed compute node:7 v" Y) J6 \& B. N$ O

; M3 |4 X4 k+ v% B( K4 Z. nIdentify the VMs on the affected hosts by using a combination of the openstack server list and openstack server show commands or the euca-describe-instances command.* `. c6 x2 k- b& i3 T9 c
. [* c" W6 v$ Y" r
For example, this command displays information about the i-000015b9 instance that runs on the np-rcc54 node:
* m* y: X) R) i. @' t  Z; B2 O( d! r% n4 n+ U
$ euca-describe-instances
; Z/ P6 q. I' _6 ~8 ^" Qi-000015b9 at3-ui02 running nectarkey (376, np-rcc54) 0 m1.xxlarge 2012-06-19T00:48:11.000Z 115.146.93.60
# t$ _& m/ U$ b. n. aQuery the Compute database for the status of the host. This example converts an EC2 API instance ID to an OpenStack ID. If you use the nova commands, you can substitute the ID directly. This example output is truncated:
/ o& z! `5 {8 u7 ]: S9 j$ ?- c+ z. v% p# {! ]8 K8 u# \9 @! ^# L. A
mysql> SELECT * FROM instances WHERE id = CONV('15b9', 16, 10) \G;
' v( @; k4 T$ P. S9 }3 [% E4 }*************************** 1. row **************************** T; }  j8 b* M3 S& b! f
created_at: 2012-06-19 00:48:11
' r3 d; T5 N3 ?" r, @/ qupdated_at: 2012-07-03 00:35:11
2 N" k' M  r* {7 t* |1 edeleted_at: NULL
0 X% b* L6 H1 @- M" d...3 |, ^. @5 W5 U, M0 q0 G
id: 55618 E$ b& O% N. x& E$ F
...
( N: p" M' _) F$ k! kpower_state: 54 u4 q/ k2 d% a! ]* }5 w- y/ e! j. e! m
vm_state: shutoff
8 g! E5 X, B+ R& ~  \...( r! B) k; m9 u" t+ l" d
hostname: at3-ui025 U* S# s' _: R8 y  C
host: np-rcc54- k7 ]! ^5 ~9 q  B% I+ Y. H( c: q
...
( [5 c( U  G7 z0 y/ Y9 Cuuid: 3f57699a-e773-4650-a443-b4b37eed5a06% l) S5 _. j1 W6 ^/ y
...# \1 i6 T' W- D1 F: R
task_state: NULL
- S/ @3 F- i% D5 ?...
: _) o7 Y. k& _' \0 H% v! iNote
! L1 z8 }6 H( Q+ Y( b
# l" o: W6 \4 h1 r7 `5 eFind the credentials for your database in /etc/nova.conf file.
# c! V9 Q3 B. p! k) ~
. D7 C  x% ]+ F7 j. W) \- kDecide to which compute host to move the affected VM. Run this database command to move the VM to that host:
1 \+ t8 X% P; r7 a6 t! ?, s" X* j) b3 J% Y/ z# {1 w3 u5 A
mysql> UPDATE instances SET host = 'np-rcc46' WHERE uuid = '3f57699a-e773-4650-a443-b4b37eed5a06';1 _- e+ w5 r3 x/ D+ R
If you use a hypervisor that relies on libvirt, such as KVM, update the libvirt.xml file in /var/lib/nova/instances/[instance ID] with these changes:
$ D8 z4 W  Y( g  p6 D
3 f* G! ~# @/ Q6 v2 i" v) uChange the DHCPSERVER value to the host IP address of the new compute host.
. c6 \: p& X2 e9 V6 W! G  dUpdate the VNC IP to 0.0.0.0.
+ `1 R+ V1 D3 \  e5 w$ \; }* fReboot the VM:0 H& y7 |# w2 G% M

' `3 c  Y5 v4 K- {! Q4 [+ W4 S$ openstack server reboot 3f57699a-e773-4650-a443-b4b37eed5a06
. j  d2 V2 f# o8 t4 e! L3 KTypically, the database update and openstack server reboot command recover a VM from a failed host. However, if problems persist, try one of these actions:
: K8 i5 j1 F, @+ E0 A& C4 ?4 j# x# c' k5 _3 b1 \
Use virsh to recreate the network filter configuration.- D  ^) X0 n5 O# c. J
Restart Compute services.
% ?1 T5 {9 @( M! KUpdate the vm_state and power_state fields in the Compute database.1 M: ]! S# i9 C  _
Recover from a UID/GID mismatch¶
; N+ s4 L- Z' a# l# ]Sometimes when you run Compute with a shared file system or an automated configuration tool, files on your compute node might use the wrong UID or GID. This UID or GID mismatch can prevent you from running live migrations or starting virtual machines.
/ t  q9 o$ L$ |; q. k( {8 D' T7 z7 M9 C* [9 v2 s* c! Q
This procedure runs on nova-compute hosts, based on the KVM hypervisor:3 h! u# r9 w/ M; d
9 m4 X( ^$ ?& e) ?7 Q6 v5 |7 X  l
Set the nova UID to the same number in /etc/passwd on all hosts. For example, set the UID to 112.% p: T* l) y/ C8 p. N! E

+ B+ h% e0 R" }Note, U7 {$ C0 _' U% u+ r
: T/ b$ X5 x- M/ s  e! @/ a
Choose UIDs or GIDs that are not in use for other users or groups.
) I1 {; o, T6 B- m! m* T0 ~- [4 J) f! _+ ^5 C4 p2 L
Set the libvirt-qemu UID to the same number in the /etc/passwd file on all hosts. For example, set the UID to 119.
4 U( c0 @' h) u: G
! U( ^( S2 X/ F$ ~& h$ ]; k9 XSet the nova group to the same number in the /etc/group file on all hosts. For example, set the group to 120.2 \; C  j0 X, [

8 u3 j; k. D  U* y. DSet the libvirtd group to the same number in the /etc/group file on all hosts. For example, set the group to 119.0 n( I& @, g' r' j8 K

! @3 l$ W0 e9 N2 l: S8 Q2 @7 vStop the services on the compute node., P9 J6 e+ \8 F9 E# {: r2 T( z
0 J0 @; Q& D6 a  C
Change all files that the nova user or group owns. For example:( ?* Q5 C/ _/ I/ W/ B# W3 C
2 X* N! p) C9 ~3 A+ a
# find / -uid 108 -exec chown nova {} \;# ~- H+ h* r0 `0 r+ w4 f$ c
# note the 108 here is the old nova UID before the change
5 k+ L/ l7 i; X7 x' N' z  f" e# find / -gid 120 -exec chgrp nova {} \;
# G. `* D8 y1 ~: G" ~' q" J: pRepeat all steps for the libvirt-qemu files, if required.
7 U6 X& H: H/ y& {- N" J( w1 S$ S8 T. u, [7 _4 p4 f% k
Restart the services./ [2 v: n1 ~3 w: k! k) ^3 X
; `' N% ~) h1 X
To verify that all files use the correct IDs, run the find command.
2 W  n' ~1 s* `3 ~. M" Q# B5 J' P
! v1 s( [+ N+ p, ?& I7 U& h6 lRecover cloud after disaster¶1 ?& y8 g5 e  Q' y9 K9 V1 z
This section describes how to manage your cloud after a disaster and back up persistent storage volumes. Backups are mandatory, even outside of disaster scenarios.
  ]& W5 \3 s9 y  l2 ?( f. n) i& j7 r/ `3 q5 V: v
For a definition of a disaster recovery plan (DRP), see https://en.wikipedia.org/wiki/Disaster_Recovery_Plan.
! i. R; ]$ s- t: f. D
! G7 N% \6 b' i6 Y  l) U7 L% pA disk crash, network loss, or power failure can affect several components in your cloud architecture. The worst disaster for a cloud is a power loss. A power loss affects these components:
0 A' t0 O/ Y7 l& ]( S! P  Q; O) k& g2 B0 N
A cloud controller (nova-api, nova-objectstore, nova-network)
9 }% j4 J4 ]8 N' DA compute node (nova-compute)
) k1 V% K, D$ J% g3 M6 pA storage area network (SAN) used by OpenStack Block Storage (cinder-volumes)
9 P: }/ s/ {0 q# n( V3 vBefore a power loss:: ]% i( a' [* Y6 D7 x0 S
8 B, c5 i1 y8 }' }- e" C! R$ s! h
Create an active iSCSI session from the SAN to the cloud controller (used for the cinder-volumes LVM's VG).
1 ~3 t& [1 m7 w8 q/ j) ]6 {; wCreate an active iSCSI session from the cloud controller to the compute node (managed by cinder-volume).
* B& I5 V6 }" x( H! ^Create an iSCSI session for every volume (so 14 EBS volumes requires 14 iSCSI sessions).
) N6 \% |  ^+ u2 j& aCreate iptables or ebtables rules from the cloud controller to the compute node. This allows access from the cloud controller to the running instance.
# B! S6 M5 _' n( A* USave the current state of the database, the current state of the running instances, and the attached volumes (mount point, volume ID, volume status, etc), at least from the cloud controller to the compute node.
4 m1 ?9 D' I1 P7 NAfter power resumes and all hardware components restart:
. T, c1 Z' ?' C* h# T
2 `9 }9 n& }8 b( j9 \The iSCSI session from the SAN to the cloud no longer exists.! F  H1 _9 i: x, P, `

% M3 c! m! a! FThe iSCSI session from the cloud controller to the compute node no longer exists.
7 m1 v5 ]3 F: I) `7 J7 z3 b. x2 l4 W% P
nova-network reapplies configurations on boot and, as a result, recreates the iptables and ebtables from the cloud controller to the compute node.
0 E. q2 m/ k4 T7 l; B+ z, e0 c- O1 ]( R9 m2 P5 y
Instances stop running.
7 `- l! S% L* g: s$ M" k; `- [1 D% [% j: ^7 Q, p( ?2 w
Instances are not lost because neither destroy nor terminate ran. The files for the instances remain on the compute node./ ?; B+ v( |5 @; r6 Q4 y' K

8 q( P' H! m# sThe database does not update.- V* H* }, D! j& I4 M+ a
3 K7 q9 |( ~: d- h0 f# ^9 M; \% b
Begin recovery" u. |/ c/ F  `" D. J/ d" n

8 z0 n: \6 S! {( W6 B: e4 ^, R+ FWarning
' K4 ], Q  \3 X# K8 d/ R$ y5 S/ K+ w; L( S& Z
Do not add any steps or change the order of steps in this procedure.$ y1 m# ?; {: H: ]1 @( X3 w

6 H: d9 E* o4 e$ i$ Z/ V: BCheck the current relationship between the volume and its instance, so that you can recreate the attachment.
/ W, ?1 Q: i: T
; z/ j& O+ }* _5 t6 uUse the openstack volume list command to get this information. Note that the openstack client can get volume information from OpenStack Block Storage.; \" E" d1 U' ?1 }+ D# z( \! F

1 t! d, A, d. C8 P; I. p0 iUpdate the database to clean the stalled state. Do this for every volume by using these queries:0 M# R8 S; c2 |8 a2 ]! M6 g! E

5 z  l5 s1 u8 [7 ^mysql> use cinder;
$ z( S# M. Z7 Z3 bmysql> update volumes set mountpoint=NULL;
2 r( d1 [! l4 k- `4 J0 umysql> update volumes set status="available" where status <>"error_deleting";
& L3 `" C  M  C5 H0 i4 [mysql> update volumes set attach_status="detached";; m, u+ T5 t7 |1 _% O+ A: t
mysql> update volumes set instance_id=0;
2 \! U0 D: p6 {$ @. Z. i7 IUse openstack volume list command to list all volumes.
9 [4 {  }% y/ ]9 C( `$ [* ?! [+ d8 v( q: B" G. }
Restart the instances by using the openstack server reboot INSTANCE command.
2 O+ m" [+ _9 m+ x3 `
6 m/ [$ j' Y; g1 mImportant5 N8 |8 {& N0 C; [; r" p
8 I! E1 [. Z7 F) w5 ]3 ]' |
Some instances completely reboot and become reachable, while some might stop at the plymouth stage. This is expected behavior. DO NOT reboot a second time.
% ~* x7 X5 C# N0 p% F6 O# P' H5 ^" c' A$ i+ t+ y8 ?; u' z8 T
Instance state at this stage depends on whether you added an /etc/fstab entry for that volume. Images built with the cloud-init package remain in a pending state, while others skip the missing volume and start. You perform this step to ask Compute to reboot every instance so that the stored state is preserved. It does not matter if not all instances come up successfully. For more information about cloud-init, see help.ubuntu.com/community/CloudInit/.
9 O& O) l/ H) H' n5 d# h, @! _5 ]/ H" W) [/ V, |# W: j
If required, run the openstack server add volume command to reattach the volumes to their respective instances. This example uses a file of listed volumes to reattach them:
5 N- \0 y1 V' g* W* T% X9 J2 `% j! b5 u
#!/bin/bash
3 {# ~0 d, Y$ l0 u
' U2 J: m+ h" D: O4 Uwhile read line; do
; T& L1 T9 R) l4 x    volume=`echo $line | $CUT -f 1 -d " "`
) s( r5 H. ]7 n6 o9 P6 j2 K. {    instance=`echo $line | $CUT -f 2 -d " "`
- @4 Z# T2 l2 Y1 H3 Y    mount_point=`echo $line | $CUT -f 3 -d " "`: y  r* C0 d; j, L& t$ }
        echo "ATTACHING VOLUME FOR INSTANCE - $instance"
+ A* B7 m! R4 x, ^$ Z    openstack server add volume $instance $volume $mount_point) i; Z, f! j1 f: a: j+ J7 Y! R% O
    sleep 2
7 r% V$ }. b5 _1 ^1 Hdone < $volumes_tmp_file7 s- v* @9 g: Y$ ], \- ?
Instances that were stopped at the plymouth stage now automatically continue booting and start normally. Instances that previously started successfully can now see the volume.' X9 T1 `& I6 l
) l5 r' o' i: p7 {. y1 {  k8 R7 ~$ ~+ ^
Log in to the instances with SSH and reboot them.
& E( E( p" A: Z' l4 e4 J; C; Z* e6 b+ S9 Z# S& A. B2 y7 A. ~
If some services depend on the volume or if a volume has an entry in fstab, you can now restart the instance. Restart directly from the instance itself and not through nova:# E4 l( q/ g3 u; Z1 W) z( W
3 W/ ~5 A* P" {/ v0 N+ S* j
# shutdown -r now( K  u  j# t: {# m( X
When you plan for and complete a disaster recovery, follow these tips:% K3 `& O9 K# g6 d" ^
: r8 x2 B4 b6 ^: @( n: h
Use the errors=remount option in the fstab file to prevent data corruption.4 k( Y6 Z7 _+ S9 r; b6 ^
* h5 R+ W+ |! F1 h% k0 H
In the event of an I/O error, this option prevents writes to the disk. Add this configuration option into the cinder-volume server that performs the iSCSI connection to the SAN and into the instances' fstab files.& l5 A5 T* m+ R4 L7 v# L6 y
Do not add the entry for the SAN's disks to the cinder-volume's fstab file.  I1 {, n- g) S/ h9 n. U" _6 y
" z* y1 T; V& Z% E- @
Some systems hang on that step, which means you could lose access to your cloud-controller. To re-run the session manually, run this command before performing the mount:/ [9 }- ?# z6 n+ a4 D! ^$ f  ?0 `

! R( `$ d6 [3 N5 j% [# iscsiadm -m discovery -t st -p $SAN_IP $ iscsiadm -m node --target-name $IQN -p $SAN_IP -l* c# h$ s. U& D
On your instances, if you have the whole /home/ directory on the disk, leave a user's directory with the user's bash files and the authorized_keys file instead of emptying the /home/ directory and mapping the disk on it.2 L3 M! _* ]# i7 C2 {7 s- z$ U, y

2 R* \# ^  S5 _) k7 D; hThis action enables you to connect to the instance without the volume attached, if you allow only connections through public keys.
) |# d4 t! I4 B2 p5 @4 `2 e
5 ~! [3 h( U# s/ b- ^9 \7 s4 HTo script the disaster recovery plan (DRP), use the https://github.com/Razique bash script.
+ O5 h: a# q5 C* H' J
9 V3 C4 F5 B/ lThis script completes these steps:
( l- U" |, L: B+ j7 ~3 j4 t# A& D8 @$ l$ k2 L, K
Creates an array for instances and their attached volumes.3 ~( W# W5 Z3 p* Q# V* [
Updates the MySQL database.' Q8 d0 N1 G! ~% y. |1 x: f: u4 E8 Y
Restarts all instances with euca2ools.
3 [; Y7 |# h5 g- E6 wReattaches the volumes.. q7 m3 Q) B, L$ O* W
Uses Compute credentials to make an SSH connection into every instance.1 H0 z9 u  N7 Y' W4 M! L1 \
The script includes a test mode, which enables you to perform the sequence for only one instance.; w$ v7 J8 E9 H* o* ^. N- w7 a# A

' q' O1 M6 o# e: s5 I2 yTo reproduce the power loss, connect to the compute node that runs that instance and close the iSCSI session. Do not detach the volume by using the openstack server remove volume command. You must manually close the iSCSI session. This example closes an iSCSI session with the number 15:
3 z- v  d6 M# b3 l
/ S- G& c" W4 j, O3 J, E# iscsiadm -m session -u -r 15
/ I9 [3 ~  T+ [( sDo not forget the -r option. Otherwise, all sessions close.3 E  H8 l, W+ g2 e8 K

& G9 L, @  w- d# @8 P9 X' {# I' ?$ ]Warning. K7 C, G5 I- }8 `8 q9 q: T% S0 o* r

/ b6 V# N) S* B1 {There is potential for data loss while running instances during this procedure. If you are using Liberty or earlier, ensure you have the correct patch and set the options appropriately.
5 m1 p6 m+ Z3 B, b% Z* t$ u0 C  G9 @
updated: 2018-12-18 05:07
您需要登录后才可以回帖 登录 | 开始注册

本版积分规则

关闭

站长推荐上一条 /4 下一条

北京云银创陇科技有限公司以云计算运维,代码开发

QQ|返回首页|Archiver|小黑屋|易陆发现技术论坛 ( 蜀ICP备2026014127号-1 )点击这里给我发消息

GMT+8, 2026-4-9 00:37 , Processed in 0.052652 second(s), 22 queries .

Powered by Discuz! X3.4 Licensed

© 2012-2025 Discuz! Team.

快速回复 返回顶部 返回列表