Friday, 7 February 2014

Troubleshooting 11.2 Clusterware Node Evictions

Troubleshooting 11.2 Clusterware Node Evictions (Note 1050693.1)

Starting, a node eviction may not actually reboot the machine.  This is called a rebootless restart.

To identify which process initiates a reboot, you need to review below are important files

  • Clusterware alert log in /log/alertnodename
  • The cssdagent log(s) in /log//agent/ohasd/oracssdagent_root
  • The cssdmonitor log(s) in /log//agent/ohasd/oracssdmonitor_root
  • The ocssd log(s) in /log//cssd
  • The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
  • IPD/OS or OS Watcher data.  IPD/OS is an old name for the Cluster Health Monitor.  The names can be used interchaneably although Oracle now calls the tool Cluster Health Monitor
  • 'opatch lsinventory -detail' output for the GRID home
  • Message files /var/log/message
Common Causes of eviction:

OCSSD Eviction: 1) Network failure or latencies issue between nodes.  It takes 30 consecutive missed checkins to cause a node eviction.  2)  Problem writing / reading the voting disk  3) A member kill escallation like the LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanisim.  If this times out, it could escalate to a node evict.

CSSDAGENT or CSSDMONITOR Eviction:  1) OS Scheduler problem as a result of OS is locked upor execsive amounts of load on the server such as CPU utilization is as high as 100% 2) CSS process is hung 3) Oracle bug

No comments: