Friday, 7 February 2014

Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)

Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)


Summary

In Oracle RAC environments, RDBMS gathers global cache work load statistics which are reported in STATSPACK, AWRs and GRID CONTROL. Global cache lost blocks statistics ("gc cr block lost" and/or "gc current block lost") for each node in the cluster as well as aggregate statistics for the cluster represent a problem or inefficiencies in packet processing for the interconnect traffic. These statistics should be monitored and evaluated regularly to guarantee efficient interconnect Global Cache and Enqueue Service (GCS/GES) and cluster processing. Any block loss indicates a problem in network packet processing and should be investigated.

The vast majority of escalations attributed to RDBMS global cache lost blocks can be directly related to faulty or mis-configured interconnects. This document serves as guide for evaluating and investigating common (and sometimes obvious) causes.

Even though much of the discussion focuses on Performance issues, it is possible to get a node/instance eviction due to these problems. Oracle Clusterware & Oracle RAC instances rely on heartbeats for node memberships. If network Heartbeats are consistently dropped, Instance/Node eviction may occur. The Symptoms below are therefore relevant for Node/Instance evictions.

Symptoms:

Primary:
  • "gc cr block lost" / "gc current block lost" in top 5 or significant wait event
Secondary:

  • SQL traces report multiple gc cr requests / gc current request /
  •  gc cr multiblock requests with long and uniform elapsed times
  • Poor application performance / throughput
  • Packet send/receive errors as displayed in ifconfig or vendor supplied utility
  • Netstat reports errors/retransmits/reassembly failures
  • Node failures and node integration failures
  • Abnormal cpu consumption attributed to network processing

Troubleshooting 11.2 Clusterware Node Evictions

Troubleshooting 11.2 Clusterware Node Evictions (Note 1050693.1)

Starting 11.2.0.2, a node eviction may not actually reboot the machine.  This is called a rebootless restart.

To identify which process initiates a reboot, you need to review below are important files

  • Clusterware alert log in /log/alertnodename
  • The cssdagent log(s) in /log//agent/ohasd/oracssdagent_root
  • The cssdmonitor log(s) in /log//agent/ohasd/oracssdmonitor_root
  • The ocssd log(s) in /log//cssd
  • The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
  • IPD/OS or OS Watcher data.  IPD/OS is an old name for the Cluster Health Monitor.  The names can be used interchaneably although Oracle now calls the tool Cluster Health Monitor
  • 'opatch lsinventory -detail' output for the GRID home
  • Message files /var/log/message
Common Causes of eviction:

OCSSD Eviction: 1) Network failure or latencies issue between nodes.  It takes 30 consecutive missed checkins to cause a node eviction.  2)  Problem writing / reading the voting disk  3) A member kill escallation like the LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanisim.  If this times out, it could escalate to a node evict.

CSSDAGENT or CSSDMONITOR Eviction:  1) OS Scheduler problem as a result of OS is locked upor execsive amounts of load on the server such as CPU utilization is as high as 100% 2) CSS process is hung 3) Oracle bug