Tuesday 6 January 2009

Oracle 10.2 RAC reboottime tweakables


http://mengmark.spaces.live.com/blog/cns!9BA8E9209B123692!179.entry


Oracle Database 10g Release 2 CSS (Cluster Synchronization Service) parameters:

 

With
different patch-sets of Oracle Database 10g Release 2; there exist
different timeout parameters which are used by CSS while accessing
storage data. In this document we will cover following Oracle Database
10g Release 2 patch-set versions:

1. Oracle Database 10.2.0.1

2. Oracle Database 10.2.0.1 + Patch for Bug 4896338

3. Oracle Database 10.2.0.2

4. Oracle Database 10.2.0.3

 

1. Oracle Database 10.2.0.1

 

There
is only one CSS parameter available in this version of Oracle and it is
called misscount which represents the maximum time in seconds that, a
heartbeat can be missed before entering into cluster reconfiguration to
evict the node, and the maximum time allowed for a voting file I/O to
complete.

The default value for misscount is 60 seconds.

 

2. Oracle Database 10.2.0.1 + Patch 4896338 and Oracle Database 10.2.0.2

There
is bug 4896338 with Oracle Database 10.2.0.1 which is a placeholder bug
for PCW 10.2.0.1 merge for very low brownout. Please refer
www.metalink.oracle.com for more details.

Oracle Database 10.2.0.2 has a fix for this bug.

There are three CSS parameters available in 10.2.0.2 and 10.2.0.1 + patch for bug 4896338; they are as follows:

 

a)
misscount - It represents maximum time in seconds that, a heartbeat can
be missed before entering into a cluster reconfiguration to evict the
node.

 

b)
disktimeout - It is the maximum amount of time allowed for a voting
file I/O to complete; if this time is exceeded the voting disk will be
marked as offline.

 

c) reboottime - It is the amount of time allowed for a node to complete a reboot after the CSS daemon has been evicted.

 

Default values for these parameters are as follows:

misscount = 60 seconds

disktimeout = 200 seconds

reboottime = 3 seconds

Using
"crsctl get css disktimeout / reboottime" will not show parameter value
unless you modify it explicitly. You can check the parameter's values
using ocssd.log under $CRS_HOME directory. 8

CRS internally calculates two parameters namely diskshorttimeout and disklongtimeout (can be checked in ocssd.log), where

 

a)
diskshorttimeout = misscount - reboottime : This value is used during
reconfiguration and initial cluster formation as a timeout for voting
file I/O to complete.

 

b)
disklongtimeout = disktimeout : This value is used during normal
operation of RAC as a timeout for voting file I/O to complete.

 

3. Oracle Database 10.2.0.3

 

This
version also has same parameters as that of Oracle Database 10.2.0.2;
also the default values are same as Oracle Database 10.2.0.2. There is
slight difference in the internal calculation of there parameter
values; If disktimeout is less than the misscount value then during
cluster formation and throughout cluster operation misscount -
reboottime is considered as disktimeout and the modified parameter
disktimeout is ignored.

That is in Oracle Database 10.2.0.3 diskshorttimeout = disklongtimeout if css disktimeout parameter is less than css misscount.

 

4. Recommendations for Oracle Database 10g Release 2 CSS parameter values to be used with NetApp storage:

 

As
diskshorttimeout = misscount - reboottime; and if misscount &
reboottime are kept as default values i.e. 60 seconds & 3 seconds
respectively; the time for accessing voting file will be considered as
57 seconds by CSS, so If the reconfiguration happens during the NetApp
Storage takeover or giveback process there are chances of CRS reboot
taking place; hence following are the recommended values for CSS
timeout parameters for Oracle Database 10g Release 2 RAC to work
smoothly during NetApp Storage takeover and giveback process.

1. Oracle Database 10.2.0.1

misscount = 120 seconds (default is 60 seconds)

2. Oracle Database 10.2.0.1 + Patch for Bug 4896338

misscount = 120 seconds (default is 60 seconds)

disktimeout = 200 seconds (default)

reboottime = 3 seconds (default)

3. Oracle Database 10.2.0.2

misscount = 120 seconds (default is 60 seconds)

disktimeout = 200 seconds (default)

reboottime = 3 seconds (default)

4. Oracle Database 10.2.0.3

misscount = 120 seconds (default is 60 seconds)

disktimeout = 200 seconds (default)

reboottime = 3 seconds (default)

All the above recommendations are for Linux Operating system.

Note:
The stock version of Oracle database 10g Release 2 lower than 10.2.0.2
do not provide all the configurable CSS parameters; hence it is
advisable to upgrade Oracle Database to 10.2.0.2 or higher.

 

Appendix

Commands to check / modify CSS parameters:

1. crsctl get css misscount ---------- to check misscount value

2. crsctl get css disktimeout --------- to check disktimeout value

3. crsctl get css reboottime ---------- to check reboottime value

4. crsctl set css misscount 120 --------- to set misscount to 120 seconds

5. crsctl set css disktimeout 200 ------- to set disktimeout to 200 seconds

6. crsctl set css reboottime 3 ----------- to set reboottime to 3 seconds 


http://el-caro.blogspot.com/2006/10/case-study-on-how-to-diagnose-node.html


There have been a couple of additional CSS related parameters introduced in the
10.2.0.2 patchset to address long I/O requirements of storage vendors such as
EMC and NetApp.

• reboottime: (default 3 seconds)
The amount of time
allowed for a node to complete a reboot after the CSS daemon has been
evicted.
This parameter can be set via the command
crsctl set css
reboottime R [-force] (R is seconds)

• disktimeout (default 200 seconds)

The maximum amount of time allowed for a voting file I/O to complete; if
this time is exceeded the voting disk will be marked as unavailable
This
parameter can be set via the command
crsctl set css disktimeout D [-force] (D
is seconds)

These commands must be run as root on a node where the CRSD
is up, unless
'-force' is specified, in which case the the CRS stack should
not be up on any other node.
After the commands are run all CSS daemons must
be restarted for the settings to take effect on all nodes asap.
You can
verify the settings by doing an ocrdump and checking the values in the
OCRDUMPFILE

 

0 comments: