ColNet Randoms: ClusterXL and FIBMGR down

[ClusterXL marks FIB as “problem]

I was doing some work the other day on a R75 H/A cluster to enable advanced routing and ran into some problems with ClusterXL. “My” problem was trying this on during business hours and not scheduling a maintenance window, writing up change control, getting approval, etc., etc.

Cowboy boots on, yeeharrrrr....

So environment is distributed management + a couple of UTM-1 1070's in H/A running R75.10. When you build the UTM's, the install just lays Secure Platform. To get advanced routing, you need to enable the “pro” version. Conveniently, Check Point have a command for that. So on the standby member, I run the following command and reboot:

# pro enable

When I'm working on clusters, I usually run the 'cphaprob state' command to keep an eye on the cluster state. Handy for installs, controlled fail overs, etc. So on the other member, I'm running:

# watch cpahprob state

After a while, I realised the secondary unit wasn't going standby and would stay down regardless of how long I lokked at it. Ran a 'cphaprob list' on it and found that FIB was being reported as down. Nice. Spent a bit of time looking through logs, scraping Secure Knowledge, etc. even put FIBMGR into debug more... great help that was. Logs showed it was trying to contact the active unit, and since advanced routing wasn't enabled on that member yet, it was failing.

Now, in a maintenance windows, I probably just would have 'pro enabled' the active unit and rebooted it. The standby member should have gone active, connection table would have been lost but whatever. Given it was business hours had to find another way. So...

Started by trying to get the FIB to report a good status by doing a:

# cphaprob -d FIB -s OK report

This did work but the process sets it back to 'problem' fairly quickly. Doing a 'cphaprob list' shows the status and the time last reported. Trying to set the timer with -t didn't work, this did:

# cpwd_admin list

{get PID of FIBMGR}

# cpwd_admin stop -name FIBMGRD -path "$ADVRDIR/bin/fibmgrd" -command "kill -TERM {FIBMGR PID}"

# cphaprob -d FIB -s OK report

The secondary firewall should go standby. Fail over, pro enable on other member, reboot. When its back to standby, fail over and reboot the member that had the FIBMGR terminated...

Happy days...

exit 0

ColNet Randoms

20110910

ClusterXL and FIBMGR down

No comments:

Post a Comment

Blog Archive

Security Podcasts

Total Pageviews