terça-feira, 27 de novembro de 2018

Oracle RAC: Using a Second NIC for Interconnect HA

Introduction


One of the most important feature of Oracle RAC is High Availability, more resilients components you have better is your Clusterware HA Score. For Oracle Clusterware the Interconnect Network Plays a big rule in your enviroment, let's supose you lost connection between your private network Oracle will choose  some nodes to be evicted. For that  reason we should look a little foward on this.

To ensure resilient over network we use Link Aggregation, this could be implement over a variety of hardware components such NICs and Network Switchs. We can use some OS Techinics to implement link aggregation like bounding, and the system administrator is the responsable to garantee resilient over this network. There  is no problem on this approach, actually a large number of environment use it, but, Oracle has it's approach too.

Oracle High Availability IP


From Oracle 11.2.0.2 we can use HAIP (High Availability IP) instead the OS method. For this we have to configure a second network card over a different subnet from the first interconnect, that is for garantee HA, keep in mind that you should use the same MTU and have the network interface name over all nodes.

You can have up to four active device for HAIP but you can configure more, but Oracle will use only four, in case you lost one Oracle will choose another configured device to replace the lost one. Still in case of a failure of a single device you will not soffer for bounces or disconnects Oracle will available over all nodes.


















Checkin the current environment:


We can use oifcfg getif to verify all network used by Oracle Clusterware, the main ideia is to add the 192.168.2.0 subnet as a second network in a link aggregation.

[oracle@srv-ora-rac01 ~]$ oifcfg getif

enp0s3  192.168.1.0  global  public

enp0s8  192.168.0.0  global  cluster_interconnect

[oracle@srv-ora-rac01 ~]$ 

Check if the IP is reachable in both nodes

[root@srv-ora-rac01 ~]# ping srv-ora-rac01-priv2
PING srv-ora-rac01-priv2 (192.168.2.74) 56(84) bytes of data.
64 bytes from srv-ora-rac01-priv2 (192.168.2.74): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from srv-ora-rac01-priv2 (192.168.2.74): icmp_seq=2 ttl=64 time=0.039 ms
^C
--- srv-ora-rac01-priv2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1053ms
rtt min/avg/max/mdev = 0.039/0.044/0.049/0.005 ms

[root@srv-ora-rac01 ~]# ping srv-ora-rac02-priv2
PING srv-ora-rac02-priv2 (192.168.2.75) 56(84) bytes of data.
64 bytes from srv-ora-rac02-priv2 (192.168.2.75): icmp_seq=1 ttl=64 time=1.98 ms
64 bytes from srv-ora-rac02-priv2 (192.168.2.75): icmp_seq=2 ttl=64 time=0.253 ms
^C
--- srv-ora-rac02-priv2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.253/1.119/1.985/0.866 ms
[root@srv-ora-rac01 ~]# 


Configuring HAIP


Ok now we can configure a second network at our interconnect, of this we only have to

[root@srv-ora-rac01 ~]# oifcfg setif -global enp0s9/192.168.2.0:cluster_interconnect
[root@srv-ora-rac01 ~]# 

Now we can check the Clusterware Network and validate if we have now thwo interconnect networks:

[root@srv-ora-rac01 ~]# oifcfg getif
enp0s3  192.168.1.0  global  public
enp0s8  192.168.0.0  global  cluster_interconnect
enp0s9  192.168.2.0  global  cluster_interconnect
[root@srv-ora-rac01 ~]# 

From this we already have Interconnect capable to failover but we need to restart CRS in all nodes to make full use of HAIP.


POC - Fail of one Network interface


With HAIP configured we can deal with the failure  of interconnect communication, I'll simulate this bringing down one interface and the clusterware need to stay up and running:

[root@srv-ora-rac01 ~]# oifcfg getif
enp0s3  192.168.1.0  global  public
enp0s8  192.168.0.0  global  cluster_interconnect
enp0s9  192.168.2.0  global  cluster_interconnect

[root@srv-ora-rac01 ~]# crsctl check cluster -all
**************************************************************
srv-ora-rac01:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
srv-ora-rac02:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
[root@srv-ora-rac01 ~]# 

[root@srv-ora-rac01 ~]# ifconfig -a enp0s9
enp0s9: flags=4163  mtu 1500
        inet 192.168.2.74  netmask 255.255.255.0  broadcast 192.168.2.255
        inet6 fe80::117f:3ef7:eb06:9c26  prefixlen 64  scopeid 0x20
        ether 08:00:27:e6:c9:32  txqueuelen 1000  (Ethernet)
        RX packets 4120  bytes 2929493 (2.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4645  bytes 3708582 (3.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@srv-ora-rac01 ~]# 

[root@srv-ora-rac01 ~]# ifdown enp0s9
Device 'enp0s9' successfully disconnected.
[root@srv-ora-rac01 ~]# ifconfig -a enp0s9
enp0s9: flags=4163  mtu 1500
        ether 08:00:27:e6:c9:32  txqueuelen 1000  (Ethernet)
        RX packets 5070  bytes 3622826 (3.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5458  bytes 4208014 (4.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@srv-ora-rac01 ~]# 


[root@srv-ora-rac01 ~]# oifcfg getif
enp0s3  192.168.1.0  global  public
enp0s8  192.168.0.0  global  cluster_interconnect
enp0s9  192.168.2.0  global  cluster_interconnect
[root@srv-ora-rac01 ~]# crsctl check cluster -all

**************************************************************
srv-ora-rac01:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
srv-ora-rac02:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************

Ok, nothing happen to node, all services remain available as expected.

Hope you enjoy!

Diogo 


Nenhum comentário: