niXforums Forum Index
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   PreferencesPreferences   Log in to check your private messagesLog in to check your private messages   Log inLog in 
·  nixdoc.net ·  man pages ·  Linux HOWTOs ·  FreeBSD Tips ·  Forums
navigation Forum index » *nix » Tru64
Trucluster (5.1a pk6) CDSL problems on cluster_root
Post new topic   Reply to topic Page 1 of 1 [4 Posts] View previous topic :: View next topic
Author Message
Jack Estes
*nix forums beginner


Joined: 11 Oct 2005
Posts: 14

PostPosted: Tue Oct 25, 2005 1:13 am    Post subject: Re: Trucluster (5.1a pk6) CDSL problems on cluster_root Reply with quote

Eric, while I completely agree with you, unfortunately as an instrument
of policy, I haven't the authority to command an immediate change. I
have made known the tragic flaws in the previous configuration and have
taken some steps to prevent a recurrence.

In the end, there was just no saving the cluster with all the damage
done to the CDSLs and member boot disks. Since the data domains were
safe and sound on the SAN, I just made a completely new 5.1B-3 build
disk, created a brand new cluster, and reattached the old data domains.
All of the Oracle configuration files were in these domains so there
were no special tweaks that had to be recreated on the new cluster. I
was able to remove about 95% of the user accounts (this is a database
server and really only requires sysadmin accounts and a service account
for the app servers to log on with). I was able to can telnetd, allow
only SSH2, and implement tcp wrappers for all network communications.
That should go pretty far since the system is completely segregated
from the rest of the intranet and has no connection to the Internet.

My thanks to Tom Smith @ HP and Adam Price for valuable assistance.
Back to top
Eric de Redelijkheid
*nix forums addict


Joined: 29 Mar 2005
Posts: 55

PostPosted: Tue Oct 11, 2005 6:46 pm    Post subject: Re: Trucluster (5.1a pk6) CDSL problems on cluster_root Reply with quote

Anno Domini 11-10-2005 7:21, Adam Price sprak aldus:

Quote:
On 10 Oct 2005 18:53:06 -0700, Jack Estes wrote:



I think I'm screwed on this one, but I thought I'd post to the group to
see if anyone else has seen this.

I have a two node trucluster serving an oracle 8i database and an
application originally written as single-instance so CAA manages it.
The cluster interconnect is memory channel and storage is fiber
channel-connected dual HSG80s into 3 StorageWorks SCSI enclosures.

For reasons passing understanding, about 40 people know the root
password on this cluster and by department policy, root is allowed to
login remotely. I'm assuming because this cluster is on a private
network serving only this department and is physically disconnected
from anything else, they think this is a good idea.

On to my problem: A user reluctantly admitted to me that he forgot his
user account password so he went ahead and logged in as root from his
PC. Thinking he was in his home directory, he executed an mv on /
thinking he would only move the files in his home directory to another
NFS mounted drive he used as a backup device. Shortly into this
process, I think he figured out he was getting the wrong stuff, killed
the command, then tried to mv the stuff back to /. In doing so, he
obviously destroyed all the CDSLs for both nodes on the cluster /,
/usr, /etc, and /var filesystems. Within about 10 minutes of this
catastrophe, a thermal breaker blew in the cabinet and uncerimoniously
shut off power to both nodes. Neither node can boot now because, among
other things, the cluster database is unavailable to them and neither
node has a unique identity (there's just one /etc/* between them).

I CAN boot the first member via the old build disk on the internal bus
and see down the HSGs to the AdvFS domains on the other side, and
they're all good. In fact, I mounted them read-only by monkeying
around in /etc/fdmns on the build disk to make temp domains and
filesets so I'm not worried about. That's how I found out there were no
more CDSLs on the cluster filesystem. The quorum disk out there is
perfectly intact as well. The entire application and oracle database
are on some filesets out on the SAN and the startup scripts are
manually executed so I'm not worried about recreating anything other
than users (I know the old UIDs to make home directories match up right
away, etc...)

I'm wondering if there's a reliable way to recreate the links and put
things back where they belong (um, like the correct /etc/sysconfigtab)
maybe with some combination of clu_delete_member and clu_add_member or
if I'm better off time-wise, to just finally upgrade to 5.1B-3 (i'm
licensed and have the media) by doing a fresh install and remounting
the AdvFS data domains on the SAN. LSM was not managing any disks.

Any help is graciously accepted. Thanks!

Jack


Your best bet is to recover it all from the backup tapes, but if for some
reason you can't do that then I think you are stuck with a re-install.
If you need a bare metal recovery guide then the one on the legato
networker site is quite good when it describes recovering a cluster, even
though the recovery commands are obviously related to their backup product
they are generic enough that you should be able to adapt them and specific
enough that you should get all that you need.
That said, I would say the upgrade to 5.1B-3 is a good idea anyway.
Adam


No, don't do anything.


Take your hands of it until department policies change:

- no root password for anyone but system's administrator
- no remote login; system's administrator is member of group system and
does su to root (which is being logged in the event log)
- no ftp for root
- secure shell for clients

Your cluster is connected to a LAN. There is no way of knowing if some
wanne-be techie attached a modem to his PC, so assume no safety, but
ensure it.

With the current policies in effect, you should not waste any time if
the same problem reoccurs in a few months because someone thinks he/she
knows something about Tru64 Cluster. It's not your problem, it's the
department's problem for messing up.
Back to top
Adam Price
*nix forums beginner


Joined: 31 May 2005
Posts: 23

PostPosted: Tue Oct 11, 2005 9:21 am    Post subject: Re: Trucluster (5.1a pk6) CDSL problems on cluster_root Reply with quote

On 10 Oct 2005 18:53:06 -0700, Jack Estes wrote:

Quote:
I think I'm screwed on this one, but I thought I'd post to the group to
see if anyone else has seen this.

I have a two node trucluster serving an oracle 8i database and an
application originally written as single-instance so CAA manages it.
The cluster interconnect is memory channel and storage is fiber
channel-connected dual HSG80s into 3 StorageWorks SCSI enclosures.

For reasons passing understanding, about 40 people know the root
password on this cluster and by department policy, root is allowed to
login remotely. I'm assuming because this cluster is on a private
network serving only this department and is physically disconnected
from anything else, they think this is a good idea.

On to my problem: A user reluctantly admitted to me that he forgot his
user account password so he went ahead and logged in as root from his
PC. Thinking he was in his home directory, he executed an mv on /
thinking he would only move the files in his home directory to another
NFS mounted drive he used as a backup device. Shortly into this
process, I think he figured out he was getting the wrong stuff, killed
the command, then tried to mv the stuff back to /. In doing so, he
obviously destroyed all the CDSLs for both nodes on the cluster /,
/usr, /etc, and /var filesystems. Within about 10 minutes of this
catastrophe, a thermal breaker blew in the cabinet and uncerimoniously
shut off power to both nodes. Neither node can boot now because, among
other things, the cluster database is unavailable to them and neither
node has a unique identity (there's just one /etc/* between them).

I CAN boot the first member via the old build disk on the internal bus
and see down the HSGs to the AdvFS domains on the other side, and
they're all good. In fact, I mounted them read-only by monkeying
around in /etc/fdmns on the build disk to make temp domains and
filesets so I'm not worried about. That's how I found out there were no
more CDSLs on the cluster filesystem. The quorum disk out there is
perfectly intact as well. The entire application and oracle database
are on some filesets out on the SAN and the startup scripts are
manually executed so I'm not worried about recreating anything other
than users (I know the old UIDs to make home directories match up right
away, etc...)

I'm wondering if there's a reliable way to recreate the links and put
things back where they belong (um, like the correct /etc/sysconfigtab)
maybe with some combination of clu_delete_member and clu_add_member or
if I'm better off time-wise, to just finally upgrade to 5.1B-3 (i'm
licensed and have the media) by doing a fresh install and remounting
the AdvFS data domains on the SAN. LSM was not managing any disks.

Any help is graciously accepted. Thanks!

Jack
Your best bet is to recover it all from the backup tapes, but if for some

reason you can't do that then I think you are stuck with a re-install.
If you need a bare metal recovery guide then the one on the legato
networker site is quite good when it describes recovering a cluster, even
though the recovery commands are obviously related to their backup product
they are generic enough that you should be able to adapt them and specific
enough that you should get all that you need.
That said, I would say the upgrade to 5.1B-3 is a good idea anyway.
Adam
Back to top
Jack Estes
*nix forums beginner


Joined: 11 Oct 2005
Posts: 14

PostPosted: Tue Oct 11, 2005 5:53 am    Post subject: Trucluster (5.1a pk6) CDSL problems on cluster_root Reply with quote

I think I'm screwed on this one, but I thought I'd post to the group to
see if anyone else has seen this.

I have a two node trucluster serving an oracle 8i database and an
application originally written as single-instance so CAA manages it.
The cluster interconnect is memory channel and storage is fiber
channel-connected dual HSG80s into 3 StorageWorks SCSI enclosures.

For reasons passing understanding, about 40 people know the root
password on this cluster and by department policy, root is allowed to
login remotely. I'm assuming because this cluster is on a private
network serving only this department and is physically disconnected
from anything else, they think this is a good idea.

On to my problem: A user reluctantly admitted to me that he forgot his
user account password so he went ahead and logged in as root from his
PC. Thinking he was in his home directory, he executed an mv on /
thinking he would only move the files in his home directory to another
NFS mounted drive he used as a backup device. Shortly into this
process, I think he figured out he was getting the wrong stuff, killed
the command, then tried to mv the stuff back to /. In doing so, he
obviously destroyed all the CDSLs for both nodes on the cluster /,
/usr, /etc, and /var filesystems. Within about 10 minutes of this
catastrophe, a thermal breaker blew in the cabinet and uncerimoniously
shut off power to both nodes. Neither node can boot now because, among
other things, the cluster database is unavailable to them and neither
node has a unique identity (there's just one /etc/* between them).

I CAN boot the first member via the old build disk on the internal bus
and see down the HSGs to the AdvFS domains on the other side, and
they're all good. In fact, I mounted them read-only by monkeying
around in /etc/fdmns on the build disk to make temp domains and
filesets so I'm not worried about. That's how I found out there were no
more CDSLs on the cluster filesystem. The quorum disk out there is
perfectly intact as well. The entire application and oracle database
are on some filesets out on the SAN and the startup scripts are
manually executed so I'm not worried about recreating anything other
than users (I know the old UIDs to make home directories match up right
away, etc...)

I'm wondering if there's a reliable way to recreate the links and put
things back where they belong (um, like the correct /etc/sysconfigtab)
maybe with some combination of clu_delete_member and clu_add_member or
if I'm better off time-wise, to just finally upgrade to 5.1B-3 (i'm
licensed and have the media) by doing a fresh install and remounting
the AdvFS data domains on the SAN. LSM was not managing any disks.

Any help is graciously accepted. Thanks!

Jack
________________________
Jack M. Estes II, Ph.D.
Cinergy Corporation
1000 Main Street
Plainfield, Indiana 46168
PHONE #s STRIPPED FOR USENET
Back to top
Google

Back to top
Display posts from previous:   
Post new topic   Reply to topic Page 1 of 1 [4 Posts] View previous topic :: View next topic
The time now is Thu Jan 08, 2009 9:41 am | All times are GMT
navigation Forum index » *nix » Tru64
Jump to:  

Similar Topics
Topic Author Forum Replies Last Post
No new posts Winbind problems for ADS authentication nlinley networking 1 Tue Sep 19, 2006 9:22 am
No new posts problems using oddmuse with mod_perl2 inside apache2.2 pe... Fergus McMenemie Perl 0 Fri Jul 21, 2006 9:48 am
No new posts Problems with make-kpkg and skas patch Todd A. Jacobs Debian 0 Fri Jul 21, 2006 12:30 am
No new posts Problems with relay control Félix Martos Trenado Postfix 3 Thu Jul 20, 2006 3:33 pm
No new posts again a newbie... :( compiler problems Thorsten Kaben C++ 18 Thu Jul 20, 2006 2:52 am

Guitar Lessons | Myspace Comments | Internet Dating | Mortgage | Looking for Credit Cards?
Copyright © 2004-2005 DeniX Solutions SRL
 
Other DeniX Solutions sites: Unix/Linux blog |  electronics forum |  medicine forum |  science forum | 
Privacy Policy


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.2462s ][ Queries: 20 (0.1354s) ][ GZIP on - Debug on ]