AIX Ver. 4 SysAdmin IV:Storage Management (Unit 7) – Mirroring and Quorum (2 of 2)

Error Reporter Components and Data Flow
————————————————————–
When system is booted the errdemon begins running and continues running while the system is running. This records problems in the system including hardware, software, undetermined errors and operator generated erorrs.

errdemon – regularly checks the “in-memory buffer” in “/dev/error” for problems and matches them up with the error templates in “errtmplt” by ErrorIDs. It them looks in the ODM with “errnotify” for the information on the resource that has the error. These errors are then stored in “/var/adm/ras/errlog”. You can use “errpt” to look at thesen errors.

There are various notify methods in the ODM that can be used to let you know when errors occur by mail etc.

Error Attributes and Options of errpt (1 of 2)
—————————————————————
Long listing: errpt -a
Error Label:
– Unique name identifying the kind of error (for example, DISK_ERR4)
– errpt -J label

Error Id:
– Unique hexadecimal identifier (1:1 correspondence to an error label)
– errpt -j id

Error Class:
– Hardware, Software, Operator Message or Undetermined
– errpt -d { H | S | O | U }

Error Type:
– PERManent, TEMPorary, PERFormance, degradation, imPENDing loss of availability, UNKNown, INFOrmation only
– errpt -T type

errlogger “error message” – sends the message to the error log

Error Attributes and Options of errpt (2 of 2)
—————————————————————
Resource Class:
– Class name of the device affected as seen with lsdev -PH Ex. disk, tape, adapter
-errpt -S resclass

Resource Type:
– Type name of the device affected as seen with lsdev -PH
Ex. 1000mb, 2000mb16bitde, 8mm, 4mm2gb2, tokenring
-errpt -R restype

Resource Name:
– The name of the device affected as seen with lsdev -C
Ex. hdisk0, hdksi1, rmt0, tok0, ent1
errpt -N resname

Further Options: time range, sequence number, and so forth.

Using errpt: Examples
———————————
errpt -d H
– All hardware errors.

errpt -S adapter -R ascsi, vscsi
– All errors of fast&wide SCSI adapters or their protocol devices.

errpt -S disk
– All disk errors

errpt -S disk -R array
– All errors of RAIDiant LUNs

errpt -N hdisk3
– All errors of hdisk3

errpt -N LVDD
– All errors logged by the LVM

Disk-Related Errors
—————————-
DISK_ERR1
– Failure of physical volume media (heads crashed or “glued” on the surface)
– class=H, type=PERM

Note: This is normal ware and tear and needs to be replaced

DISK_ERR2 and DISK_ERR3
– Device does not respond (failure of power supply or disk’s own controller)
– class =H, type =PERM

DISK_ERR4 (Should monitor for this error, there is a limited
– Hardware bad block relocation occured
– class=H, type =TEMP

Note: You should monitor for this error since there is a limited amount of space set aside to relocate bad blocks. When this fills up you loose the disk. Replace the disk.

SCSI_ERR*
– Communication problems
– Possible reasons: cable length; cable quality; bad, missing or extraneous terminators; duplicate ids.

Error Log Entries generated by LVM
—————————————————–
LVM_SA_PVMISS
– Physical volume declared missing
– Class=H, type=UNKN

LVM_MISSPVADDED
– Physical volume defined as missing
– Class=S, type=UNKN

LVM_SA_STALEPP
– Physical partition marked stale
– Class=S, type=UNKN

LVM_SA_QUORCLOSE
– Quorum lost, volume group closing
– Class = H, type=UNKN

errpt Sample Output (same as smit – Error Summary)
—————————————————————————–
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION

errpt -a Sample Output (same as smit – Error Detail)
—————————————————————————————
LABEL:
IDENTIFIER
etc….

Error Notification
————————-
Concept:
ODM class errnotify: key to dynamic reactions on errors
Contains fields corresponding to errpt options
Own methods should have a symbolic name
A single error can trigger more than one entry
Parameters ($1-$9) allow creation of generic methods
No smit support except the HACMP server nodes

Possible Applications
Write a message to the console, send E-mail or call a pager
Create an SNMP trap (see trapgend in TME10 Netview)
Unconfigure a failed device and/or initiate a failover

Adding Your Own Notify Methods
————————————————–
smit – Add a Notification Method

diag-Panels for Disk Management
————————————————–
Hints:
Disk-related tasks can be found in the “Service Aids” menu
The menu structure is slightly different in older releases
diag is not available on older PCI desktop systems

What can you do with diag?
Initiate a low-level format and/or surface certification
Physically erase all data on a disk
Copy a disk sector-by-sector (including PVID)
alter the contents of individual disk sectors
Send a test transmission to a specific SCSI address

Formatting Media
—————————
diag – Service Aids

Physical Disk Copy
—————————-
diag – Service Aids

Considerations before Disk Replacement
————————————————————
Disk still working?
– Error rate
– Age of last backup (Can you do a backup now?)

Mirror copy available?
How much downtime allowed (if any)?
Free capacity in VG?
Harware
– Type of disk (SCSI, SSA)
– Hot swap possible?
– Free slots in cabinet?
– Cable length at its limit?
– Suitable cables available?
– Unused SCSI ids?

Key Techniques of Disk Replacement
——————————————————-
If the disk is still working
– Disk-to-Disk copy via diag
– migratepv or replacepv

If there is a working mirror copy
– Remove mirror copies from failed disk
– Remove failed disk vrom VG and add new disk
– Create new mirror copies
– May also be automated using replacepv

Otherwise
– Remove failed disk from VG and add new disk
– Restore a backup

Migrating Logical Volumes
————————————–
Syntax:
migratepv [ -l LVname ] FromPV toPV…

Effect:
Creates a temporary mirror of the data you want to move, after copied the mirror is broken and the data in the old location is deleted.

Note: If you are trying to move hd5 which contains the boot logical volume then you will have to do it first and then run bosboot to rebuild the boot logical volume so that the boot locations are updated. You also have to change the bootlist.
Thereafter, you can run migratepv to move everything else.

General Disk Replacement Procedure
———————————————————
Remove:
1. Deallocate partitions:
migratepv (if possible)
rmlvcopy LVname 1 hdisk X
Note: This turns off mirroring. 1 is the number of mirror copies.

2. Remove the disk from the VG:
reducevg -d VGname hdisk X
Note: Removes the information from the LVM.

3. Delete the disk at device level
rmdev -d -l hdisk X
rmdev -d -l pdisk Y
Note: This removes the information from the ODM.

4. Physically disconnect the drive.

Add:
1. Physically connect the drive.

2. cfgmgr creates ODM entries at device level; may be renamed for consistency reasons

3. Add the disk to a VG:
extendvg VGname hdiskN

4. Allocate partitions:
mklv, extendlv or mklvcopy LVname 2 hdisk N (2 is the number of mirror copies)
Note: Turns mirroring back on.

replacepv Command (Systems does most of the work for you)
—————————————————————————————–
Syntax:
replacepv [ -f ] [ -R wrkdir ] hdiskOld hdiskNew

Actions taken:
– extendvg [ -f ] VGname hdiskNew
– Migrate all LVs or LV copies on hdiskOld to hdiskNew
– reducevg VGname hdiskOld

Features and Limitations:
– Recovery (option -R) in case of an intermittent crash
– Dump devices are reset an set again
– Message issued if bosboot and bootlist are necessary
– No unmirrored LVs on HdiskOld if VG is at its limit of disks

replacepv Example
—————————
# lspv
hdisk0 … rootvg
hdisk1 … None

# replacepv hdisk0 hdisk1
Messages
If the system crashes before this process is finished run the following command to bring everything back to its original state:
replacecopy -R /tmp/replacecopy10888
Note: The replacepv command will even handle the hd5 logical volume issue.

# lspv
hdisk0 … None
hdisk1 … rootvg

# bootlist -m normal -o
hdisk0

# bosboot -a -d /dev/hdisk1
bosboot: Boot image is 6579 512 byte blocks

# bootlist -m normal hdisk1
#

Troubleshooting Disk Replacement
—————————————————
Situation 1:
– Physical disk has been replaced
– No system administration had taken place before (someone forgot to run reducevg)

Corrective Action:
– If possible: deactivate the new disk physically
– Perform removal steps 1-3 as described before
– Add the new disk to the system (steps 1-4)

If rmdev -d or diag -a has been run …
– You will probably get “unable to find device id …”
– Use the id displayed with high level commands in place of the disk’s name.
– or: ldeletepv -g VGid -p PVid

reducevg hdisk# or pvid

Migrating rootvg
————————
Special handling is required for
BLV and bootlist:
– hd5 requires separate invocation of migratepv
– bosboot -a -d /dev/hdisk Dest is also needed.
– bootlist -m normal hdisk Dest
Note: Then you run migratepv again to move everything else over.

Dump devices:
– Easiest way is to disable the dump device before migration
sysdumpdev -p /dev/sysdumpnull
– Reactivate the dump device after migration:
sysdumpdev -p /dev/dumplv (meaning hd6)

Total VG Failure
————————
If all disks of a VG have been lost
– exportvg VGname (removes it from the ODM)
– Check if /etc/filesystems needs cleanup (regarding the VGname)
– Connect new disks
– Recreate VG, LVs and filesystems as they were before
– Restore a backup

Miscellaneous Problems
————————————
Unable to find device id hexnumber in device configuration database
– Review your command. It may be just a typo.

If hexnumber is non-zero:
– Most likely the result of an “unattended” disk replacement
– ODM corruption (IDmismatch)
To find correct PVId if not in the ODM use lqueryvg to go right to the disk.
Enter the PVId number back in the ODM using 16 trailing zeros for 32 bytes

If hexnumber is zero:
– System crash or shutdown while running reorgvg or migratepv
– ODM corruption (less likely)
Come up in maintenance mode if necessary to discover the problem

Question marks in lspv command output
– ODM corruption
– LVCB was overwritten
– Errors in /etc/filesystems

LAB 5: Mirroring and Quorum
——————————————-
1. Collect information about your system
lsvg -p lab5vg
lsvg -o (make sure that the volume group is varied on)
lspv (look at the disks on the system)

5. df
file system is mounted and spans three disks
lsvg -o
umount /lab5fs
lsvg -o
varyoffvg lab5vg

diag
Task Selection
SSA Service Aids
Set or Reset Service Mode

varyonvg lab5vg (Use Ctrl-C to stop since it won’t work now)
varyonvg -f lab5vg (will varyon two of the disks)
hdisk7 … PVACTIVE
hdisk5 … PVMISSING
hdisk6 … PVACTIVE

errpt log

varyoffvg lab5vg
Set or Reset Service Mode
varyonvg lab5vg (didn’t have to force it this time)

mount /home/lab5
df (filesystem is mounted)

Part 2
Part 3
Part 4

lsvg
lspv
replacepv hdisk4 hdisk14
lspv (hdisk14 now has replvg volume group)

Check Point Questions:
————————————
1. What is necessary to reintegrate a previously failed disk with open mirrored LVs into production? varyonvg
2. Mirror write consistency is on by default for all types of logical volumes? FALSE
Not on for paging but for everything else.
3. Quorum checking should be turned off in general to improve data availability? FALSE
Turn it off if you are mirroring a volume group. Otherwise, leave it on.
4. How can you conclude from entries in the error log that a drive will need to be replace soon? Lots of disk ERR4
5. Which logical volumes require special handling during a rootvg migration? hd5 – boot logical volume, hd6 – Dump device
6. What should be deleted first when removing a failed disk from the system: VG membership or hdisk device? volume group membership (reducevg)

Leave a Reply

Your email address will not be published. Required fields are marked *

*