Oracle RAC 11g Overview
Before introducing the details for building a RAC cluster, it might be helpful to first clarify what a cluster is. A cluster is a group of two or more interconnected computers or servers that appear as if they are one server to end users and applications and generally share the same set of physical disks. The key benefit of clustering is to provide a highly available framework where the failure of one node (for example a database server running an instance of Oracle) does not bring down an entire application. In the case of failure with one of the servers, the other surviving server (or servers) can take over the workload from the failed server and the application continues to function normally as if nothing has happened.
The concept of clustering computers actually started several decades ago. The first successful cluster product was developed by DataPoint in 1977 named ARCnet. The ARCnet product enjoyed much success by academia types in research labs, but didn't really take off in the commercial market. It wasn't until the 1980's when Digital Equipment Corporation (DEC) released its VAX cluster product for the VAX/VMS operating system.
With the release of Oracle 6 for the Digital VAX cluster product, Oracle was the first commercial database to support clustering at the database level. It wasn't long, however, before Oracle realized the need for a more efficient and scalable distributed lock manager (DLM) as the one included with the VAX/VMS cluster product was not well suited for database applications. Oracle decided to design and write their own DLM for the VAX/VMS cluster product which provided the fine-grain block level locking required by the database. Oracle's own DLM was included in Oracle 6.2 which gave birth to Oracle Parallel Server (OPS) - the first database to run the parallel server.
By Oracle 7, OPS was extended to included support for not only the VAX/VMS cluster product but also with most flavors of UNIX. This framework required vendor-supplied clusterware which worked well, but made for a complex environment to setup and manage given the multiple layers involved. By Oracle8, Oracle introduced a generic lock manager that was integrated into the Oracle kernel. In later releases of Oracle, this became known as the Integrated Distributed Lock Manager (IDLM) and relied on an additional layer known as the Operating System Dependant (OSD) layer. This new model paved the way for Oracle to not only have their own DLM, but to also create their own clusterware product in future releases.
Oracle Real Application Clusters (RAC), introduced with Oracle9i, is the successor to Oracle Parallel Server. Using the same IDLM, Oracle 9i could still rely on external clusterware but was the first release to include their own clusterware product named Cluster Ready Services (CRS). With Oracle 9i, CRS was only available for Windows and Linux. By Oracle 10g release 1, Oracle's clusterware product was available for all operating systems and was the required cluster technology for Oracle RAC. With the release of Oracle Database 10g Release 2 (10.2), Cluster Ready Services was renamed to Oracle Clusterware. When using Oracle 10g or higher, Oracle Clusterware is the only clusterware that you need for most platforms on which Oracle RAC operates (except for Tru cluster, in which case you need vendor clusterware). You can still use clusterware from other vendors if the clusterware is certified, but keep in mind that Oracle RAC still requires Oracle Clusterware as it is fully integrated with the database software. This guide uses Oracle Clusterware which as of 11g Release 2 (11.2), is now a component of Oracle grid infrastructure.
Like OPS, Oracle RAC allows multiple instances to access the same database (storage) simultaneously. RAC provides fault tolerance, load balancing, and performance benefits by allowing the system to scale out, and at the same time since all instances access the same database, the failure of one node will not cause the loss of access to the database.
At the heart of Oracle RAC is a shared disk subsystem. Each instance in the cluster must be able to access all of the data, redo log files, control files and parameter file for all other instances in the cluster. The data disks must be globally available in order to allow all instances to access the database. Each instance has its own redo log files and UNDO tablespace that are locally read-writeable. The other instances in the cluster must be able to access them (read-only) in order to recover that instance in the event of a system failure. The redo log files for an instance are only writeable by that instance and will only be read from another instance during system failure. The UNDO, on the other hand, is read all the time during normal database operation (e.g. for CR fabrication).
A big difference between Oracle RAC and OPS is the addition of Cache Fusion. With OPS a request for data from one instance to another required the data to be written to disk first, then the requesting instance can read that data (after acquiring the required locks). This process was called disk pinging. With cache fusion, data is passed along a high-speed interconnect using a sophisticated locking algorithm.
Not all database clustering solutions use shared storage. Some vendors use an approach known as a Federated Cluster, in which data is spread across several machines rather than shared by all. With Oracle RAC, however, multiple instances use the same set of disks for storing data. Oracle's approach to clustering leverages the collective processing power of all the nodes in the cluster and at the same time provides failover security.
Pre-configured Oracle RAC solutions are available from vendors such as Dell, IBM and HP for production environments. This article, however, focuses on putting together your own Oracle RAC 11g environment for development and testing by using Linux servers and a low cost shared disk solution; iSCSI.
Shared-Storage Overview
Today, fibre channel is one of the most popular solutions for shared storage. As mentioned earlier, fibre channel is a high-speed serial-transfer interface that is used to connect systems and storage devices in either point-to-point (FC-P2P), arbitrated loop (FC-AL), or switched topologies (FC-SW). Protocols supported by Fibre Channel include SCSI and IP. Fibre channel configurations can support as many as 127 nodes and have a throughput of up to 2.12 Gigabits per second in each direction, and 4.25 Gbps is expected.
Fibre channel, however, is very expensive. Just the fibre channel switch alone can start at around US$1,000. This does not even include the fibre channel storage array and high-end drives, which can reach prices of about US$300 for a single 36GB drive. A typical fibre channel setup which includes fibre channel cards for the servers is roughly US$10,000, which does not include the cost of the servers that make up the cluster.
A less expensive alternative to fibre channel is SCSI. SCSI technology provides acceptable performance for shared storage, but for administrators and developers who are used to GPL-based Linux prices, even SCSI can come in over budget, at around US$2,000 to US$5,000 for a two-node cluster.
Another popular solution is the Sun NFS (Network File System) found on a NAS. It can be used for shared storage but only if you are using a network appliance or something similar. Specifically, you need servers that guarantee direct I/O over NFS, TCP as the transport protocol, and read/write block sizes of 32K. See the Certify page on Oracle Metalink for supported Network Attached Storage (NAS) devices that can be used with Oracle RAC. One of the key drawbacks that has limited the benefits of using NFS and NAS for database storage has been performance degradation and complex configuration requirements. Standard NFS client software (client systems that use the operating system provided NFS driver) is not optimized for Oracle database file I/O access patterns. With the introduction of Oracle 11g, a new feature known as Direct NFS Client integrates the NFS client functionality directly in the Oracle software. Through this integration, Oracle is able to optimize the I/O path between the Oracle software and the NFS server resulting in significant performance gains. Direct NFS Client can simplify, and in many cases automate, the performance optimization of the NFS client configuration for database workloads. To learn more about Direct NFS Client, see the Oracle White Paper entitled "Oracle Database 11g Direct NFS Client".
The shared storage that will be used for this article is based on iSCSI technology using a network storage server installed with Openfiler. This solution offers a low-cost alternative to fibre channel for testing and educational purposes, but given the low-end hardware being used, it should not be used in a production environment.
iSCSI Technology
For many years, the only technology that existed for building a network based storage solution was a Fibre Channel Storage Area Network (FC SAN). Based on an earlier set of ANSI protocols called Fiber Distributed Data Interface (FDDI), Fibre Channel was developed to move SCSI commands over a storage network.
Several of the advantages to FC SAN include greater performance, increased disk utilization, improved availability, better scalability, and most important to us — support for server clustering! Still today, however, FC SANs suffer from three major disadvantages. The first is price. While the costs involved in building a FC SAN have come down in recent years, the cost of entry still remains prohibitive for small companies with limited IT budgets. The second is incompatible hardware components. Since its adoption, many product manufacturers have interpreted the Fibre Channel specifications differently from each other which has resulted in scores of interconnect problems. When purchasing Fibre Channel components from a common manufacturer, this is usually not a problem. The third disadvantage is the fact that a Fibre Channel network is not Ethernet! It requires a separate network technology along with a second set of skill sets that need to exist with the data center staff.
With the popularity of Gigabit Ethernet and the demand for lower cost, Fibre Channel has recently been given a run for its money by iSCSI-based storage systems. Today, iSCSI SANs remain the leading competitor to FC SANs.
Ratified on February 11, 2003 by the Internet Engineering Task Force (IETF), the Internet Small Computer System Interface, better known as iSCSI, is an Internet Protocol (IP)-based storage networking standard for establishing and managing connections between IP-based storage devices, hosts, and clients. iSCSI is a data transport protocol defined in the SCSI-3 specifications framework and is similar to Fibre Channel in that it is responsible for carrying block-level data over a storage network. Block-level communication means that data is transferred between the host and the client in chunks called blocks. Database servers depend on this type of communication (as opposed to the file level communication used by most NAS systems) in order to work properly. Like a FC SAN, an iSCSI SAN should be a separate physical network devoted entirely to storage, however, its components can be much the same as in a typical IP network (LAN).
While iSCSI has a promising future, many of its early critics were quick to point out some of its inherent shortcomings with regards to performance. The beauty of iSCSI is its ability to utilize an already familiar IP network as its transport mechanism. The TCP/IP protocol, however, is very complex and CPU intensive. With iSCSI, most of the processing of the data (both TCP and iSCSI) is handled in software and is much slower than Fibre Channel which is handled completely in hardware. The overhead incurred in mapping every SCSI command onto an equivalent iSCSI transaction is excessive. For many the solution is to do away with iSCSI software initiators and invest in specialized cards that can offload TCP/IP and iSCSI processing from a server's CPU. These specialized cards are sometimes referred to as an iSCSI Host Bus Adaptor (HBA) or a TCP Offload Engine (TOE) card. Also consider that 10-Gigabit Ethernet is a reality today!
As with any new technology, iSCSI comes with its own set of acronyms and terminology. For the purpose of this article, it is only important to understand the difference between an iSCSI initiator and an iSCSI target.
iSCSI Initiator
Basically, an iSCSI initiator is a client device that connects and initiates requests to some service offered by a server (in this case an iSCSI target). The iSCSI initiator software will need to exist on each of the Oracle RAC nodes (racnode1 and racnode2).
An iSCSI initiator can be implemented using either software or hardware. Software iSCSI initiators are available for most major operating system platforms. For this article, we will be using the free Linux Open-iSCSI software driver found in the iscsi-initiator-utils RPM. The iSCSI software initiator is generally used with a standard network interface card (NIC) — a Gigabit Ethernet card in most cases. A hardware initiator is an iSCSI HBA (or a TCP Offload Engine (TOE) card), which is basically just a specialized Ethernet card with a SCSI ASIC on-board to offload all the work (TCP and SCSI commands) from the system CPU. iSCSI HBAs are available from a number of vendors, including Adaptec, Alacritech, Intel, and QLogic.
iSCSI Target
An iSCSI target is the "server" component of an iSCSI network. This is typically the storage device that contains the information you want and answers requests from the initiator(s). For the purpose of this article, the node openfiler1 will be the iSCSI target.
So with all of this talk about iSCSI, does this mean the death of Fibre Channel anytime soon? Probably not. Fibre Channel has clearly demonstrated its capabilities over the years with its capacity for extremely high speeds, flexibility, and robust reliability. Customers who have strict requirements for high performance storage, large complex connectivity, and mission critical reliability will undoubtedly continue to choose Fibre Channel.
Before closing out this section, I thought it would be appropriate to present the following chart that shows speed comparisons of the various types of disk interfaces and network technologies. For each interface, I provide the maximum transfer rates in kilobits (kb), kilobytes (KB), megabits (Mb), megabytes (MB), gigabits (Gb), and gigabytes (GB) per second with some of the more common ones highlighted in grey.
Disk Interface / Network / BUS | Speed |
Kb | KB | Mb | MB | Gb | GB |
Serial | 115 | 14.375 | 0.115 | 0.014 | | |
Parallel (standard) | 920 | 115 | 0.92 | 0.115 | | |
10Base-T Ethernet | | | 10 | 1.25 | | |
IEEE 802.11b wireless Wi-Fi (2.4 GHz band) | | | 11 | 1.375 | | |
USB 1.1 | | | 12 | 1.5 | | |
Parallel (ECP/EPP) | | | 24 | 3 | | |
SCSI-1 | | | 40 | 5 | | |
IEEE 802.11g wireless WLAN (2.4 GHz band) | | | 54 | 6.75 | | |
SCSI-2 (Fast SCSI / Fast Narrow SCSI) | | | 80 | 10 | | |
100Base-T Ethernet (Fast Ethernet) | | | 100 | 12.5 | | |
ATA/100 (parallel) | | | 100 | 12.5 | | |
IDE | | | 133.6 | 16.7 | | |
Fast Wide SCSI (Wide SCSI) | | | 160 | 20 | | |
Ultra SCSI (SCSI-3 / Fast-20 / Ultra Narrow) | | | 160 | 20 | | |
Ultra IDE | | | 264 | 33 | | |
Wide Ultra SCSI (Fast Wide 20) | | | 320 | 40 | | |
Ultra2 SCSI | | | 320 | 40 | | |
FireWire 400 - (IEEE1394a) | | | 400 | 50 | | |
USB 2.0 | | | 480 | 60 | | |
Wide Ultra2 SCSI | | | 640 | 80 | | |
Ultra3 SCSI | | | 640 | 80 | | |
FireWire 800 - (IEEE1394b) | | | 800 | 100 | | |
Gigabit Ethernet | | | 1000 | 125 | 1 | |
PCI - (33 MHz / 32-bit) | | | 1064 | 133 | 1.064 | |
Serial ATA I - (SATA I) | | | 1200 | 150 | 1.2 | |
Wide Ultra3 SCSI | | | 1280 | 160 | 1.28 | |
Ultra160 SCSI | | | 1280 | 160 | 1.28 | |
PCI - (33 MHz / 64-bit) | | | 2128 | 266 | 2.128 | |
PCI - (66 MHz / 32-bit) | | | 2128 | 266 | 2.128 | |
AGP 1x - (66 MHz / 32-bit) | | | 2128 | 266 | 2.128 | |
Serial ATA II - (SATA II) | | | 2400 | 300 | 2.4 | |
Ultra320 SCSI | | | 2560 | 320 | 2.56 | |
FC-AL Fibre Channel | | | 3200 | 400 | 3.2 | |
PCI-Express x1 - (bidirectional) | | | 4000 | 500 | 4 | |
PCI - (66 MHz / 64-bit) | | | 4256 | 532 | 4.256 | |
AGP 2x - (133 MHz / 32-bit) | | | 4264 | 533 | 4.264 | |
Serial ATA III - (SATA III) | | | 4800 | 600 | 4.8 | |
PCI-X - (100 MHz / 64-bit) | | | 6400 | 800 | 6.4 | |
PCI-X - (133 MHz / 64-bit) | | | | 1064 | 8.512 | 1 |
AGP 4x - (266 MHz / 32-bit) | | | | 1066 | 8.528 | 1 |
10G Ethernet - (IEEE 802.3ae) | | | | 1250 | 10 | 1.25 |
PCI-Express x4 - (bidirectional) | | | | 2000 | 16 | 2 |
AGP 8x - (533 MHz / 32-bit) | | | | 2133 | 17.064 | 2.1 |
PCI-Express x8 - (bidirectional) | | | | 4000 | 32 | 4 |
PCI-Express x16 - (bidirectional) | | | | 8000 | 64 | 8 |