Re: Remote "Local Address" of netstat and SCAN listener and VIP on different hosts

I find that "Local Address" of netstat is not reliable and I can reproduce its 
odd behavior, although I still can't reproduce the Oracle problem.

[root@dcprperschdb1b oracle]# export PS1='NodeB# ' <- make prompt shorter
NodeB# netstat -anp | grep 1234         <- port is not taken
NodeB# ifconfig | grep 10.111.108.167   <- this IP is on this box
          inet addr:10.111.108.167  Bcast:10.111.108.255  Mask:255.255.255.128
NodeB# cat myserver.pl          <- my dummy server/listener will use that IP
#!/usr/bin/perl

use Socket;

$server_port = 1234;
socket(Server, PF_INET, SOCK_STREAM, getprotobyname('tcp'));
setsockopt(Server, SOL_SOCKET, SO_REUSEADDR, 1);
#$my_addr = sockaddr_in($server_port, INADDR_ANY);
$my_addr = sockaddr_in($server_port, inet_aton("10.111.108.167"));

bind(Server, $my_addr) or die "Can't bind to port $server_port: $!\n";
listen(Server, SOMAXCONN) or die "Can't listen on port $server_port: $!\n";
while (accept(Client, Server))
{
}
client Server;
NodeB# ./myserver.pl &          <- run dummy listener on that IP on port 1234
[1] 21092
NodeB# su - oracle
dcprperschdb1b ~ $ srvctl config scan -i 3 <- verify 10.111.108.167 is SCAN VIP
SCAN name: scan_erschp.mdanderson.edu, Network: 
1/10.111.108.128/255.255.255.128/bond0
SCAN VIP name: scan3, IP: /scan_erschp.mdanderson.edu/10.111.108.167
dcprperschdb1b ~ $ srvctl status scan -i 3 <- it runs on node b
SCAN VIP scan3 is enabled
SCAN VIP scan3 is running on node dcprperschdb1b
dcprperschdb1b ~ $ srvctl status scan_listener -i 3 <- also on node b
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node dcprperschdb1b
dcprperschdb1b ~ $ srvctl relocate scan -i 3
dcprperschdb1b ~ $ srvctl status scan -i 3      <- changes to node a
SCAN VIP scan3 is enabled
SCAN VIP scan3 is running on node dcprperschdb1a
dcprperschdb1b ~ $ srvctl status scan_listener -i 3 <- also changes to node a
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node dcprperschdb1a

The above shows that both SCAN VIP and SCAN listener are relocated together 
even if the IP is being used by a process (owned by root or oracle, which 
makes no difference).

The following shows the non-existing "Local Address" of `netstat -an':

dcprperschdb1b ~ $ exit
logout
NodeB# ifconfig | grep 10.111.108.167           <- IP no longer on this box
NodeB# netstat -anp | grep 10.111.108.167       <- but netstat Local Address 
still has it!
tcp        0      0 10.111.108.167:1234         0.0.0.0:*                   
LISTEN      21092/perl
tcp        0      0 10.111.108.160:20909        10.111.108.167:1521         
ESTABLISHED 17743/ora_pmon_ersc
NodeB#

On node a, everything is normal:

dcprperschdb1a ~ $ sudo netstat -anp | grep 10.111.108.167 <- node a runs 
SCAN_LISTENER3 now
[sudo] password for oracle:
tcp        0      0 10.111.108.167:1521         0.0.0.0:*                   
LISTEN      29943/tnslsnr
tcp        0      0 10.111.108.167:53           0.0.0.0:*                   
LISTEN      8328/named
tcp        0      0 10.111.108.159:60641        10.111.108.167:1521         
ESTABLISHED 26129/ora_pmon_ersc
tcp        0      0 10.111.108.167:1521         10.111.108.159:60641        
ESTABLISHED 29943/tnslsnr
tcp        0      0 10.111.108.167:1521         10.111.108.160:20909        
ESTABLISHED 29943/tnslsnr
udp        0      0 10.111.108.167:53           0.0.0.0:*                       
        8328/named
dcprperschdb1a ~ $ ps -fp 29943
UID        PID  PPID  C STIME TTY          TIME CMD
oracle   29943     1  0 09:29 ?        00:00:00 
/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN3 -inherit

As expected, from a 3rd box, I can no longer connect to my dummy server (was 
able to before relocate):

$ telnet 10.111.108.167 1234
Trying 10.111.108.167...
telnet: connect to address 10.111.108.167: Connection refused

Yong Huang

------ Original message ------

2-node Oracle 11.2.0.1 RAC, RHEL 5.7 x86_64, kernel 2.6.18-274.7.1.el5

One of the 3 SCAN listeners listens on an IP which exists on the other
node of this 2-node RAC.

C:\>nslookup scancs4.<domainname>
...
Name:    scancs4.<domainname>
Addresses:  10.111.76.85
          10.111.76.84
          10.111.76.86

dcsrpcora4a ~ $ ifconfig | egrep '10.111.76.84|10.111.76.85|10.111.76.86'
          inet addr:10.111.76.85  Bcast:10.111.76.127  Mask:255.255.255.128
          inet addr:10.111.76.84  Bcast:10.111.76.127  Mask:255.255.255.128

dcsrpcora4b ~ $ ifconfig | egrep '10.111.76.84|10.111.76.85|10.111.76.86'
          inet addr:10.111.76.86  Bcast:10.111.76.127  Mask:255.255.255.128

The problem is that the 3 IP's, supposedly each backed by one Oracle
SCAN listener, do not all have SCAN listeners listening on them.
Specifically, 10.111.76.84 on node a has no listener, and on node b there
*is* a SCAN listener that claims to be listening on that IP. (Note that
the 4th field of `netstat -an' is "Local Address".)

dcsrpcora4b ~ $ netstat -anp 2>/dev/null | grep 10.111.76.84 <-- this IP exists 
on node a
tcp        0      0 10.111.76.84:1521           0.0.0.0:*                   
LISTEN      15130/tnslsnr
tcp        0      0 10.111.76.70:55578          10.111.76.84:1521           
ESTABLISHED 12061/ora_pmon_orac
dcsrpcora4b ~ $ ps -fp 15130
UID        PID  PPID  C STIME TTY          TIME CMD
oracle   15130     1  0 Jan27 ?        00:04:40 
/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN3 -inherit

How can a listener process running on its own server (node b) claim to be
listening on an IP which is physically located on a different server (node
a)? On node a, everything looks normal from the OS perspective, and there
actually is a process, named, listening on 10.111.76.84 using port 53.
(Not sure why named uses a virtual interface created by Oracle.)

[root@dcsrpcora4a ~]# netstat -anp | grep 10.111.76.84
tcp        0      0 10.111.76.84:53             0.0.0.0:*                   
LISTEN      5001/named
udp        0      0 10.111.76.84:53             0.0.0.0:*                       
        5001/named

We know a SCAN VIP can "float" or relocate between the 2 nodes. But at any
give point in time, when netstat says a specific IP is local to a specific
host, that IP must be given by that host (as shown by ifconfig), not by a
different host, regardless what magic Oracle's SCAN listener software does.

Checking with srvctl:

dcsrpcora4a ~ $ srvctl status scan -i 3
SCAN VIP scan3 is enabled
SCAN VIP scan3 is running on node dcsrpcora4a
dcsrpcora4a ~ $ srvctl status scan_listener -i 3
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node dcsrpcora4b <-- not dcsrpcora4a!

On another RAC cluster, I tried 'srvctl relocate scan' and 'srvctl
relocate scan_listener'. In both cases, both the SCAN VIP and SCAN
listener are relocated *together* to a different node. It's not possible
to reproduce relocating one but not the other.

I believe to correct the problem we have now, we may just run srvctl
relocate either the VIP (...84) or the SCAN listener (LISTENER_SCAN3). But
I'd like to find out what caused this situation.

Yong Huang

--
http://www.freelists.org/webpage/oracle-l


Other related posts: