|
|
|
|
|
|
| Author |
Message |
Jack Estes *nix forums beginner
Joined: 11 Oct 2005
Posts: 14
|
Posted: Tue Oct 25, 2005 1:13 am Post subject:
Re: Trucluster (5.1a pk6) CDSL problems on cluster_root
|
|
|
Eric, while I completely agree with you, unfortunately as an instrument
of policy, I haven't the authority to command an immediate change. I
have made known the tragic flaws in the previous configuration and have
taken some steps to prevent a recurrence.
In the end, there was just no saving the cluster with all the damage
done to the CDSLs and member boot disks. Since the data domains were
safe and sound on the SAN, I just made a completely new 5.1B-3 build
disk, created a brand new cluster, and reattached the old data domains.
All of the Oracle configuration files were in these domains so there
were no special tweaks that had to be recreated on the new cluster. I
was able to remove about 95% of the user accounts (this is a database
server and really only requires sysadmin accounts and a service account
for the app servers to log on with). I was able to can telnetd, allow
only SSH2, and implement tcp wrappers for all network communications.
That should go pretty far since the system is completely segregated
from the rest of the intranet and has no connection to the Internet.
My thanks to Tom Smith @ HP and Adam Price for valuable assistance. |
|
| Back to top |
|
 |
Eric de Redelijkheid *nix forums addict
Joined: 29 Mar 2005
Posts: 55
|
Posted: Tue Oct 11, 2005 6:46 pm Post subject:
Re: Trucluster (5.1a pk6) CDSL problems on cluster_root
|
|
|
Anno Domini 11-10-2005 7:21, Adam Price sprak aldus:
| Quote: | On 10 Oct 2005 18:53:06 -0700, Jack Estes wrote:
I think I'm screwed on this one, but I thought I'd post to the group to
see if anyone else has seen this.
I have a two node trucluster serving an oracle 8i database and an
application originally written as single-instance so CAA manages it.
The cluster interconnect is memory channel and storage is fiber
channel-connected dual HSG80s into 3 StorageWorks SCSI enclosures.
For reasons passing understanding, about 40 people know the root
password on this cluster and by department policy, root is allowed to
login remotely. I'm assuming because this cluster is on a private
network serving only this department and is physically disconnected
from anything else, they think this is a good idea.
On to my problem: A user reluctantly admitted to me that he forgot his
user account password so he went ahead and logged in as root from his
PC. Thinking he was in his home directory, he executed an mv on /
thinking he would only move the files in his home directory to another
NFS mounted drive he used as a backup device. Shortly into this
process, I think he figured out he was getting the wrong stuff, killed
the command, then tried to mv the stuff back to /. In doing so, he
obviously destroyed all the CDSLs for both nodes on the cluster /,
/usr, /etc, and /var filesystems. Within about 10 minutes of this
catastrophe, a thermal breaker blew in the cabinet and uncerimoniously
shut off power to both nodes. Neither node can boot now because, among
other things, the cluster database is unavailable to them and neither
node has a unique identity (there's just one /etc/* between them).
I CAN boot the first member via the old build disk on the internal bus
and see down the HSGs to the AdvFS domains on the other side, and
they're all good. In fact, I mounted them read-only by monkeying
around in /etc/fdmns on the build disk to make temp domains and
filesets so I'm not worried about. That's how I found out there were no
more CDSLs on the cluster filesystem. The quorum disk out there is
perfectly intact as well. The entire application and oracle database
are on some filesets out on the SAN and the startup scripts are
manually executed so I'm not worried about recreating anything other
than users (I know the old UIDs to make home directories match up right
away, etc...)
I'm wondering if there's a reliable way to recreate the links and put
things back where they belong (um, like the correct /etc/sysconfigtab)
maybe with some combination of clu_delete_member and clu_add_member or
if I'm better off time-wise, to just finally upgrade to 5.1B-3 (i'm
licensed and have the media) by doing a fresh install and remounting
the AdvFS data domains on the SAN. LSM was not managing any disks.
Any help is graciously accepted. Thanks!
Jack
Your best bet is to recover it all from the backup tapes, but if for some
reason you can't do that then I think you are stuck with a re-install.
If you need a bare metal recovery guide then the one on the legato
networker site is quite good when it describes recovering a cluster, even
though the recovery commands are obviously related to their backup product
they are generic enough that you should be able to adapt them and specific
enough that you should get all that you need.
That said, I would say the upgrade to 5.1B-3 is a good idea anyway.
Adam
No, don't do anything. |
Take your hands of it until department policies change:
- no root password for anyone but system's administrator
- no remote login; system's administrator is member of group system and
does su to root (which is being logged in the event log)
- no ftp for root
- secure shell for clients
Your cluster is connected to a LAN. There is no way of knowing if some
wanne-be techie attached a modem to his PC, so assume no safety, but
ensure it.
With the current policies in effect, you should not waste any time if
the same problem reoccurs in a few months because someone thinks he/she
knows something about Tru64 Cluster. It's not your problem, it's the
department's problem for messing up. |
|
| Back to top |
|
 |
Adam Price *nix forums beginner
Joined: 31 May 2005
Posts: 23
|
Posted: Tue Oct 11, 2005 9:21 am Post subject:
Re: Trucluster (5.1a pk6) CDSL problems on cluster_root
|
|
|
On 10 Oct 2005 18:53:06 -0700, Jack Estes wrote:
| Quote: | I think I'm screwed on this one, but I thought I'd post to the group to
see if anyone else has seen this.
I have a two node trucluster serving an oracle 8i database and an
application originally written as single-instance so CAA manages it.
The cluster interconnect is memory channel and storage is fiber
channel-connected dual HSG80s into 3 StorageWorks SCSI enclosures.
For reasons passing understanding, about 40 people know the root
password on this cluster and by department policy, root is allowed to
login remotely. I'm assuming because this cluster is on a private
network serving only this department and is physically disconnected
from anything else, they think this is a good idea.
On to my problem: A user reluctantly admitted to me that he forgot his
user account password so he went ahead and logged in as root from his
PC. Thinking he was in his home directory, he executed an mv on /
thinking he would only move the files in his home directory to another
NFS mounted drive he used as a backup device. Shortly into this
process, I think he figured out he was getting the wrong stuff, killed
the command, then tried to mv the stuff back to /. In doing so, he
obviously destroyed all the CDSLs for both nodes on the cluster /,
/usr, /etc, and /var filesystems. Within about 10 minutes of this
catastrophe, a thermal breaker blew in the cabinet and uncerimoniously
shut off power to both nodes. Neither node can boot now because, among
other things, the cluster database is unavailable to them and neither
node has a unique identity (there's just one /etc/* between them).
I CAN boot the first member via the old build disk on the internal bus
and see down the HSGs to the AdvFS domains on the other side, and
they're all good. In fact, I mounted them read-only by monkeying
around in /etc/fdmns on the build disk to make temp domains and
filesets so I'm not worried about. That's how I found out there were no
more CDSLs on the cluster filesystem. The quorum disk out there is
perfectly intact as well. The entire application and oracle database
are on some filesets out on the SAN and the startup scripts are
manually executed so I'm not worried about recreating anything other
than users (I know the old UIDs to make home directories match up right
away, etc...)
I'm wondering if there's a reliable way to recreate the links and put
things back where they belong (um, like the correct /etc/sysconfigtab)
maybe with some combination of clu_delete_member and clu_add_member or
if I'm better off time-wise, to just finally upgrade to 5.1B-3 (i'm
licensed and have the media) by doing a fresh install and remounting
the AdvFS data domains on the SAN. LSM was not managing any disks.
Any help is graciously accepted. Thanks!
Jack
Your best bet is to recover it all from the backup tapes, but if for some |
reason you can't do that then I think you are stuck with a re-install.
If you need a bare metal recovery guide then the one on the legato
networker site is quite good when it describes recovering a cluster, even
though the recovery commands are obviously related to their backup product
they are generic enough that you should be able to adapt them and specific
enough that you should get all that you need.
That said, I would say the upgrade to 5.1B-3 is a good idea anyway.
Adam |
|
| Back to top |
|
 |
Jack Estes *nix forums beginner
Joined: 11 Oct 2005
Posts: 14
|
Posted: Tue Oct 11, 2005 5:53 am Post subject:
Trucluster (5.1a pk6) CDSL problems on cluster_root
|
|
|
I think I'm screwed on this one, but I thought I'd post to the group to
see if anyone else has seen this.
I have a two node trucluster serving an oracle 8i database and an
application originally written as single-instance so CAA manages it.
The cluster interconnect is memory channel and storage is fiber
channel-connected dual HSG80s into 3 StorageWorks SCSI enclosures.
For reasons passing understanding, about 40 people know the root
password on this cluster and by department policy, root is allowed to
login remotely. I'm assuming because this cluster is on a private
network serving only this department and is physically disconnected
from anything else, they think this is a good idea.
On to my problem: A user reluctantly admitted to me that he forgot his
user account password so he went ahead and logged in as root from his
PC. Thinking he was in his home directory, he executed an mv on /
thinking he would only move the files in his home directory to another
NFS mounted drive he used as a backup device. Shortly into this
process, I think he figured out he was getting the wrong stuff, killed
the command, then tried to mv the stuff back to /. In doing so, he
obviously destroyed all the CDSLs for both nodes on the cluster /,
/usr, /etc, and /var filesystems. Within about 10 minutes of this
catastrophe, a thermal breaker blew in the cabinet and uncerimoniously
shut off power to both nodes. Neither node can boot now because, among
other things, the cluster database is unavailable to them and neither
node has a unique identity (there's just one /etc/* between them).
I CAN boot the first member via the old build disk on the internal bus
and see down the HSGs to the AdvFS domains on the other side, and
they're all good. In fact, I mounted them read-only by monkeying
around in /etc/fdmns on the build disk to make temp domains and
filesets so I'm not worried about. That's how I found out there were no
more CDSLs on the cluster filesystem. The quorum disk out there is
perfectly intact as well. The entire application and oracle database
are on some filesets out on the SAN and the startup scripts are
manually executed so I'm not worried about recreating anything other
than users (I know the old UIDs to make home directories match up right
away, etc...)
I'm wondering if there's a reliable way to recreate the links and put
things back where they belong (um, like the correct /etc/sysconfigtab)
maybe with some combination of clu_delete_member and clu_add_member or
if I'm better off time-wise, to just finally upgrade to 5.1B-3 (i'm
licensed and have the media) by doing a fresh install and remounting
the AdvFS data domains on the SAN. LSM was not managing any disks.
Any help is graciously accepted. Thanks!
Jack
________________________
Jack M. Estes II, Ph.D.
Cinergy Corporation
1000 Main Street
Plainfield, Indiana 46168
PHONE #s STRIPPED FOR USENET |
|
| Back to top |
|
 |
Google
|
|
| Back to top |
|
 |
|
|
The time now is Thu Jan 08, 2009 9:41 am | All times are GMT
|
|
Guitar Lessons | Myspace Comments | Internet Dating | Mortgage | Looking for Credit Cards?
|
|
Copyright © 2004-2005 DeniX Solutions SRL
|
|
|
|
Other DeniX Solutions sites:
Unix/Linux blog |
electronics forum |
medicine forum |
science forum |
|
|
Privacy Policy
|
Powered by phpBB © 2001, 2005 phpBB Group
|
|