Troubleshooting the database, services and replication
DB2 problems
Q replication problems
- Basic Troubleshooting Methodology
- Recovering from a missing message
- Missing message is not lost: Recovering from a missing message using the dead letter queue
- Missing message is permanently lost: Recovering from a missing message by a manual resynchronization
- Deleting the Message Queues
- Uninstall WebSphere MQ
Basic Troubleshooting Methodology
The following basic steps are recommened for troubleshooting a problem:- WebSphere MQ file system check The file system check ensures that there is enough space available for the WebSphere MQ log. If the disk where it resides fills up, you will get WebSphere MQ transaction errors. Assuming that the log file is at the default location of /var/mqm, use the following command to check sys tem space:
- WebSphere MQ log check
- Check /var/mqm/errors for any global exceptions related to WebSphere MQ.
- Check /var/mqm/qmgrs/
/errors for any queue manager exceptions. - Check queue manager status
- Check if Q Capture and Q Apply programs are running on both send and receive servers To check capture program status, run, for example,
- Check the Q subscriptions states and receive queue states on both send and receive servers Look into capture program log file at /export/ldbd/var/db2/capture/ldbd.SEG_CIT.ASN.QCAP.log and look for the most recent message that looks like this:
- Look into Q Apply and Capture logs on both send and receive servers. Solve errors highlighted in the log, if any.
- Check the WebSphere MQ queues
- Transmit and receive queues' queue depth.
- Ensure listeners are running on both servers
- Ensure channels are running on both servers, run the command
df -k /var/mqm
dspmqwill list status for all queue managers.
asnqccmd capture_server=seg_cit capture_schema=ASN status, and you will see something like:[ldbd@ldas-cit ~]$ asnqccmd capture_server=seg_cit capture_schema=ASN status 2006-03-15-09.21.44.370661 ASN0520I "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "HoldLThread" thread is in the "is waiting" state. 2006-03-15-09.21.44.389233 ASN0520I "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "AdminThread" thread is in the "is resting" state. 2006-03-15-09.21.44.389505 ASN0520I "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "PruneThread" thread is in the "is resting" state. 2006-03-15-09.21.44.389586 ASN0520I "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "WorkerThread" thread is in the "is resting" state.To check apply program status, run, for example,
asnqacmd apply_server=seg_cit apply_schema=ASN status, and you will see something like this:
[ldbd@ldas-cit ~]$ asnqacmd apply_server=seg_cit apply_schema=ASN status 2008-07-15-09.25.17.681358 ASN0520I "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "HoldLThread" thread is in the "is waiting" state. 2008-07-15-09.25.17.682480 ASN0520I "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "AdminThread" thread is in the "is resting" state. 2008-07-15-09.25.17.682581 ASN0520I "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "MonitorThread" thread is in the "is resting" state. 2008-07-15-09.25.17.682652 ASN0520I "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "BR00000" thread is in the "is doing work" state. 2008-07-15-09.25.17.682723 ASN0520I "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "BR00001" thread is in the "is doing work" state.
2008-07-15-07.27.04.958564 ASN7000I "Q Capture" : "ASN" : "WorkerThread" : "40" subscriptions are active. "0" subscriptions are inactive. "0" subscriptions that were new and were successfully activated. "0" subscriptions that were new could not be activated and are now inactive.
runmqsc queue_manager_name dis ql(queue name) curdepth
ps -ef | grep lsrYou will see a running result like:
mqm 12456 1 0 07:25:23 ? 0:00 runmqlsr -t tcp -m QM3
ps -ef | grep chlYou will see the results like:
mqm 3123 1 0 06:45:45 ? 0:01 /opt/mqm/bin/runmqchl -c QM3_TO_QM2 -m QM3 mqm 3121 1 0 06:45:45 ? 0:01 /opt/mqm/bin/runmqchl -c QM3_TO_QM1 -m QM3Or use commands:
runmqsc queue_manager_name dis chstatus(channel name)You will see something like:
[ldbd@ldas-cit ~]$ runmqsc QM3
5724-H72 (C) Copyright IBM Corp. 1994, 2005. ALL RIGHTS RESERVED.
Starting MQSC for queue manager QM3.
dis chstatus(qm1_to_qm3)
1 : dis chstatus(qm1_to_qm3)
AMQ8417: Display Channel Status details.
CHANNEL(QM1_TO_QM3) CHLTYPE(RCVR)
CONNAME(198.129.208.245) CURRENT
RQMNAME(QM1) STATUS(RUNNING)
SUBSTATE(RECEIVE) XMITQ( )
Each replication message sent by the Q capture program to a Q apply program is tagged with a sequential number. If a message goes missing, the Q apply program will not process any further replication messages from the site with the missing message until the missing message is recovered.
The Q apply program at CIT will detect the missing message and print the error
ASN7551E "Q Apply" : "ASN" : "BR00000" : The Q Apply program detected a gap in message numbers on receive queue "ASN.QM2_TO_QM3.DATAQ", replication queue map "SEG_LLO_ASN_TO_SEG_CIT_ASN". It read message ID "51524550436871FF000000000000000000000000000E02FFFFFFFE", but expected to find message ID "51524550436871FF000000000000000000000000000E02FFFFFFFD". The Q Apply program cannot process any messages until if finds the expected message.to the apply program log file, which can be found under
/export/ldbd/var/db2/apply/ldbd.SEG_CIT.ASN.QAPP.logIt will refuse to process any subsequent transctions and continue to print the warning message
ASN7552W "Q Apply" : "ASN" : "BR00001" : The Q Apply program is polling receive queue "ASN.QM2_TO_QM3.DATAQ", replication queue map "SEG_LLO_ASN_TO_SEG_CIT_ASN" for message ID "51524550436871FF000000000000000000000000000E02FFFFFFFD".until the situation is manually resolved.
At each site there is a dead letter queue (DLQ) that stores messages that have not been sucessfully transmitted between sites. It is most likely that the missing message has been dumped into the dead letter queue and can be retrieved. If this is not possible, the message queues must be flushed and the asntdiff and asntrep programs run to re-synchronize the databases. The latter method takes a long time, so all efforts should be taken to recover the message from the dead letter queue.
Recovering from a missing message using the dead letter queue
All WebSphere MQ environments need a routine to process messages on the DLQ regularly. WebSphere MQ supplies a default routine, called the dead-letter queue handler (the DLQ handler), which can be invoked via the runmqdlq command. Instructions for processing messages on the DLQ are supplied to the DLQ handler by means of a user-written rules table (that is, the DLQ handler matches messages on the DLQ against entries in the rules table - when a DLQ message matches an entry in the rules table, the DLQ handler performs the action associated with that entry).
Attention: A user-written rules table is not provided for Q replication messages. Should one choose to define a DLQ for the Q replication environment, an appropriate DLQ handler rules table should be defined.
We assume that receive queue at CIT is full, and the incoming messages has been put into DLQ. Therefore, a missing message error ASN7551E accurs at CIT.
- Check the status of receive queue:
In this example, we assume receive queue ASN.QM1_TO_QM3.DATAQ is full. To confirm that, run:runmqsc QM3 dis ql(ASN.QM1_TO_QM3.DATAQ)
It returns something like:AMQ8409: Display Queue details. DESCR(WebSphere MQ Default Local Queue) PROCESS( ) BOQNAME( ) INITQ( ) TRIGDATA( ) CLUSTER( ) CLUSNL( ) QUEUE(ASN.QM1_TO_QM3.DATAQ) CRDATE(2005-11-01) CRTIME(21.42.20) ALTDATE(2007-11-09) ALTTIME(15.30.52) GET(ENABLED) PUT(ENABLED) DEFPRTY(0) DEFPSIST(NO) MAXDEPTH(50000) MAXMSGL(100000000) BOTHRESH(0) SHARE DEFSOPT(SHARED) HARDENBO MSGDLVSQ(PRIORITY) RETINTVL(999999999) USAGE(NORMAL) NOTRIGGER TRIGTYPE(FIRST) TRIGDPTH(1) TRIGMPRI(0) QDEPTHHI(80) QDEPTHLO(20) QDPMAXEV(ENABLED) QDPHIEV(DISABLED) QDPLOEV(DISABLED) QSVCINT(999999999) QSVCIEV(NONE) DISTL(NO) NPMCLASS(NORMAL) DEFTYPE(PREDEFINED) TYPE(QLOCAL) SCOPE(QMGR) DEFBIND(OPEN) IPPROCS(3) OPPROCS(1) CURDEPTH(50000) :Parameter MAXDEPTH identifies the maximum messages that can be waiting in this queue. As we can see MAXDEPTH(50000) defines the maximum number of messages to be 50000. Parameter CURDEPTH indicates the number of messages that are currently in the queue. In this example, ,CURDEPTH(50000) shows it reached the maximum number of messages that can be held in receive queue ASN.QM1_TO_QM3.DATAQ. - Find the dead letter queue name at CIT:
dis qmgr
Queue manager at CIT is QM3. LHO is QM1, LLO is QM2. Please change the queue manager name according to your site.
This will return something like:
- Display dead letter queue attributes
dis ql(QM3_DEADQ)
It returns:AMQ8409: Display Queue details. DESCR(WebSphere MQ Default Local Queue) PROCESS( ) BOQNAME( ) INITQ( ) TRIGDATA( ) CLUSTER( ) CLUSNL( ) QUEUE(QM3_DEADQ) CRDATE(2006-03-01) CRTIME(15.28.49) ALTDATE(2007-11-09) ALTTIME(15.30.52) GET(ENABLED) PUT(ENABLED) DEFPRTY(0) DEFPSIST(NO) MAXDEPTH(50000) MAXMSGL(100000000) BOTHRESH(0) SHARE DEFSOPT(SHARED) HARDENBO MSGDLVSQ(PRIORITY) RETINTVL(999999999) USAGE(NORMAL) NOTRIGGER TRIGTYPE(FIRST) TRIGDPTH(1) TRIGMPRI(0) QDEPTHHI(80) QDEPTHLO(20) QDPMAXEV(ENABLED) QDPHIEV(DISABLED) QDPLOEV(DISABLED) QSVCINT(999999999) QSVCIEV(NONE) DISTL(NO) NPMCLASS(NORMAL) DEFTYPE(PREDEFINED) TYPE(QLOCAL) SCOPE(QMGR) DEFBIND(OPEN) IPPROCS(0) OPPROCS(0) CURDEPTH(18900) :MAXDEPTH(50000) means that the maximum unscessfully transmitted messages dead letter queue QM3_DEADQ can hold is 50000. CURDEPTH indicates the number of messages QM3_DEADQ is currently holding. CURDEPTH(18900) shows there is 18900 messages in QM3_DEADQ at the moment. - Configure and run dead letter queue message handler
-
Create a txt file named dlqrule.txt (you can name your rules file differently), and write rules into the file:
INPUTQM(' ') INPUTQ(' ') WAIT(NO) RETRYINT(5) REASON(MQRC_Q_FULL) ACTION(RETRY) RETRY(5)INPUTQM(' ') identifies the name of the queue manager that owns this queue, which is QM3 in our case. INPUTQ(' ') identifies the name of the dead letter queue, which is QM3_DEADQ. REASON(MQRC_Q_FULL) shows that messages received because of a queue full condition. The RETRY keyword sets the number of tries made to implement an action. The RETRYINT keyword of the control data controls the interval between attempts. -
Run the dead letter queue message handler:
runmqdlq QM3_DEADQ QM3 < dlqrule.txt
Dead letter queue message handler will put the missing message back to its target queue (if the missing message is not yet lost).
-
Create a txt file named dlqrule.txt (you can name your rules file differently), and write rules into the file:
- Check to see if the number of messages in dead letter queue has decreased
dis ql(QM3_DEADQ) CURDEPTH
It returns:AMQ8409: Display Queue details. QUEUE(QM3_DEADQ) CURDEPTH(0)
CURDEPTH is indicating 0 messages in DLQ, which means DLQ handler has done its work. - To exit MQSC command environment, type in:
end
AMQ8408: Display Queue Manager details.
DESCR( ) DEADQ(QM3_DEADQ)
DEFXMITQ( ) CHADEXIT( )
CLWLEXIT( ) CLWLDATA( )
REPOS( ) REPOSNL( )
SSLKEYR(/var/mqm/qmgrs/QM3/ssl/key) SSLCRLNL( )
SSLCRYP( ) COMMANDQ(SYSTEM.ADMIN.COMMAND.QUEUE)
QMNAME(QM3) CRDATE(2005-11-01)
CRTIME(21.40.46) ALTDATE(2007-11-09)
ALTTIME(15.30.52) QMID(QM3_2005-11-01_21.40.46)
TRIGINT(999999999) MAXHANDS(256)
MAXUMSGS(10000) AUTHOREV(DISABLED)
INHIBTEV(DISABLED) LOCALEV(DISABLED)
REMOTEEV(DISABLED) PERFMEV(DISABLED)
STRSTPEV(ENABLED) CHAD(DISABLED)
CHADEV(DISABLED) CLWLLEN(100)
MAXMSGL(100000000) CCSID(819)
MAXPRTY(9) CMDLEVEL(530)
PLATFORM(UNIX) SYNCPT
DISTL(YES)
:
Note that DEADQ(QM3_DEADQ) means parameter DEADQ is defined as QM3_DEADQ. QM3_DEADQ is the dead letter queue at QM3.
Recovering from a missing message by a manual resynchronization
If there is a serious error, such as a RAID or OS crash on one of the gateway machines, it is possible that a replication message may be permenantly lost. In this case, the database tables need to be re-syncronized and replication restarted. Here, we show how to recover from a case in which a message sent from LLO to CIT is lost. To recover replication when the missing message cannot be retrieved from the dead letter queue, follow the steps below. Note that the running the asntdiff program to complare databases may take a very long time.
- In order to re-sychronize the database, no incomming data can be
inserted during the syncronization process. Shutdown all publishing
scripts and kill and ldbd servers are connected to the databases. This
includes
- All ldbdd servers that accept incoming segment publication (the LDBDServer processes at LHO, LLO and CIT).
- All publishing scripts that insert state segments from frames at LHO and LLO.
- Any cron jobs at CIT that insert data from the GEO frames into the database.
- Shut down the Q capture and apply programs with the commands
asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_LHO LOGSTDOUT=Y STOP asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_LHO LOGSTDOUT=Y STOP
This needs to be done at all three sites. Do not stop the WebSphere message queues, however.
- Deactivate the Q subscriptions by running the commands
db2 "UPDATE ASN.IBMQREP_SUBS SET STATE='I'" db2 "UPDATE ASN.IBMQREP_TARGETS SET STATE='I'"
at all three sites.
- Flush the message queues. At LHO run the commands
asnqmfmt ASN.QM1.ADMINQ QM1 -delmsg asnqmfmt ASN.QM1.RESTARTQ QM1 -delmsg asnqmfmt ASN.QM2_TO_QM1.DATAQ QM1 -delmsg asnqmfmt ASN.QM3_TO_QM1.DATAQ QM1 -delmsg
then at LLO run the commandsasnqmfmt ASN.QM2.ADMINQ QM2 -delmsg asnqmfmt ASN.QM2.RESTARTQ QM2 -delmsg asnqmfmt ASN.QM1_TO_QM2.DATAQ QM2 -delmsg asnqmfmt ASN.QM3_TO_QM2.DATAQ QM2 -delmsg
and finally at CIT run the commandsasnqmfmt ASN.QM3.ADMINQ QM3 -delmsg asnqmfmt ASN.QM3.RESTARTQ QM3 -delmsg asnqmfmt ASN.QM1_TO_QM3.DATAQ QM3 -delmsg asnqmfmt ASN.QM2_TO_QM3.DATAQ QM3 -delmsg
Once these commands are run, the message queues are flushed and cannot be recovered.
- In the glue src/glue/src/conf/db2/ directory, there is a file
called runstats.sql which updated the indices and run statistics on
all the database tables. Run this script at each site with the commands below.
It should be run simultaneously at all three sites, as will take a long time.
At LHO type:
db2 connect to seg_lho db2 -tvf ~/src/glue/src/conf/db2/runstats.sql
at LLO type:db2 connect to seg_llo db2 -tvf ~/src/glue/src/conf/db2/runstats.sql
and at CIT type:db2 connect to seg_cit db2 -tvf ~/src/glue/src/conf/db2/runstats.sql
Although rebuilding the indices is not strictly necessary, it make the asntdiff command run much faster.
-
In this example, we are reparing segments between LLO (which has 2) and CIT
(which has ID 3). The commands from here on are specific to this example, but
can easily be modified to fixed replication between other databases, e.g.
between LLO and LHO by changing the relevant ID numbers and database
names.
At LLO, create a connection to the CIT database so that DB2 at LLO can talk directly to the database at CIT. Do this by typingdb2 "catalog tcpip node db2cit remote ldas-cit.ligo.caltech.edu server 50002" db2 "catalog database seg_cit as seg_cit at node db2cit"
Then create a password file for the replication repair programs withasnpwd init asnpwd add alias seg_cit id ldbd password xxxxxxxx
Replace the string xxxxxxxx with the login password of the ldbd user at the remote site.
- Next figure out the ASN ID number of the tables that need to be repared by
typing
db2 connect to seg_llo db2 "select substr(SUBNAME,1,30) as SUBNAME, substr(SENDQ,1,20) as SENDQ from ASN.IBMQREP_SUBS where SENDQ = 'ASN.QM2_TO_QM3.DATAQ'"
This will return something likeSUBNAME SENDQ ------------------------------ -------------------- PROCESS0006 ASN.QM2_TO_QM3.DATAQ PROCESS_PARAMS0006 ASN.QM2_TO_QM3.DATAQ LFN0006 ASN.QM2_TO_QM3.DATAQ GRIDCERT0006 ASN.QM2_TO_QM3.DATAQ FILTER0006 ASN.QM2_TO_QM3.DATAQ FILTER_PARAMS0006 ASN.QM2_TO_QM3.DATAQ FRAMESET_CHANLIST0006 ASN.QM2_TO_QM3.DATAQ FRAMESET_WRITER0006 ASN.QM2_TO_QM3.DATAQ FRAMESET0006 ASN.QM2_TO_QM3.DATAQ FRAMESET_LOC0006 ASN.QM2_TO_QM3.DATAQ SEGMENT_DEFINER0006 ASN.QM2_TO_QM3.DATAQ SEGMENT0006 ASN.QM2_TO_QM3.DATAQ SEGMENT_DEF_MAP0006 ASN.QM2_TO_QM3.DATAQ SEGMENT_LFN_MAP0006 ASN.QM2_TO_QM3.DATAQ SUMM_VALUE0006 ASN.QM2_TO_QM3.DATAQ SUMM_STATISTICS0006 ASN.QM2_TO_QM3.DATAQ SUMM_SPECTRUM0006 ASN.QM2_TO_QM3.DATAQ SUMM_CSD0006 ASN.QM2_TO_QM3.DATAQ SUMM_COMMENT0006 ASN.QM2_TO_QM3.DATAQ SUMM_MIME0006 ASN.QM2_TO_QM3.DATAQ 20 record(s) selected.
The ASN ID is the 4 digit number (including the zeroes) after each table name. In this case it is 0006. Note that not all these tables are used by the segment database (see below).
- Create a file called tables.txt which contains the list of used tables in
the segment database schema. The used tables is the subset of all the tables
that contains at least one row. You can get this by running "select count(*)
tablename" on each of the tables in the segment database. As of Dec 2006, the
list of used tables is
PROCESS PROCESS_PARAMS LFN GRIDCERT SEGMENT_DEFINER SEGMENT SEGMENT_DEF_MAP SEGMENT_LFN_MAP
The repair processes checks for differences between the source and the target database for each of these tables.
- The differences between the tables can now be repaired. There are two
methods of doing this. For small discrepancies, use the first method below
which relies on the asntdiff/asntrep programs. For larger differences, use the
methods based on export/load.
- Repair using asntdiff/asntrep.
- Next create diff tables for each of the above tables using the
asntdiff program. This can be done using the following command. Note
that in the clause where subname = '${table}0006 the ASN ID number
0006 should be replaced with the appropriate number from step 7 above.
To create the diff tables run the command:
(for table in `cat tables.txt` do echo ${table} asntdiff db=seg_llo schema=ASN where="where subname = '${table}0006'" diff=${table}DF done) 2>&1 | tee diffs.txt -
Now use the asntrep program and the diff tables created above to repair
the differences between the tables. Again, check that the ASN ID number in the
where clause is approriate:
(for table in `cat tables.txt` do echo ${table} asntrep db=seg_llo schema=ASN where="where subname = '${table}0006'" diff=${table}DF done)
- Next create diff tables for each of the above tables using the
asntdiff program. This can be done using the following command. Note
that in the clause where subname = '${table}0006 the ASN ID number
0006 should be replaced with the appropriate number from step 7 above.
To create the diff tables run the command:
- Repair using export/load
- Export the table that needs to be corrected at both source and target sites.
For example, if replication was broken from CIT to LLO and table segment needs to be corrected,
execute the following at both sites (after connecting to the corresponding database with db2 connect to seg_cit or db2 connect to seg_llo):
db2 'export to /export/ldbd/SEGMENT.txt of del messages SEGMENT.msg select * from SEGMENT'
- Sort the resulting dump files (or one could do it in the SQL query in the previous step adding order by segment_id) so that they can be compared with diff or md5sum.
sort /export/ldbd/SEGMENT.txt > /export/ldbd/SEGMENT.txt.s
- Compute md5sum of the sorted table dumps at both sites and compare if you are not sure if the corresponding tables are different or not.
- scp CIT's sorted gzipped dump to llo.
- Use GNU version of diff (the one supplied with Solaris does not seem to be able to handle binary fields in a dump) to find the entries present at CIT and missing at LLO:
/ldcg/bin/diff -a SEGMENT.llo.txt.s SEGMENT.cit.txt.s | perl -pe 's/< //' | perl -e '@a=<>;print join("",@a[1..$#a])' > diff.SEGMENT - Load the resulting file into the corresponding table:
db2 connect to seg_llo db2 'load from diff.SEGMENT of del messages segment.msg insert into segment allow no access'
Notice that after load operation the database would be in backup pending state and the table will be in check pending state and one must execute the two steps below to return them into a usable state. - Create an offline backup:
db2 BACKUP DATABASE SEG_LLO TO /archive/backup/database/gateway/offline WITH 2 BUFFERS BUFFER 1024
- Check if the inserted rows satisfy constraints:
db2 CONNECT TO SEG_LLO db2 SET INTEGRITY FOR LDBD.SEGMENT IMMEDIATE CHECKED INCREMENTAL db2 CONNECT RESET
- You might want to repeat steps 1-3 for LLO to obtain a sorted dump again and check if md5sums are the same now.
- Export the table that needs to be corrected at both source and target sites.
For example, if replication was broken from CIT to LLO and table segment needs to be corrected,
execute the following at both sites (after connecting to the corresponding database with db2 connect to seg_cit or db2 connect to seg_llo):
-
While asntdiff/asntrep works fine for small tables and small difference between tables, it is not recommended to use for large tables or large differences between tables.
For the current database size (6-7G), asntdiff/asntrep takes
couple days to complete and if the difference between tables is big, it would fail anyway (it might also fail just because the table is big regardless of differences).
Another pecularity of asntdiff/asntrep is that unless one provides a maxdiff=N option to asntdiff (which is not mentioned in the original IBM documentation but can be found with Google), only up to 100,001 of different rows are repaired (if repaired) and asntdiff does not complain that it could not handle the rest, it just reports how many differences were handled and it is up to a user to check if that's what is required.
For large tables and large differences, export/load approach should be used and it works quite fast. For example, to export or load a table of 1.5G takes only ~20 minutes. Sorting, executing diff, (un)gzip, md5sum, scp might take another 40 minutes. One disadvantage of export/load method is that the database gets into an unusable state after load and steps 7-8 must be executed to bring it back to life (but it takes only a few minutes to do that).
- Repair using asntdiff/asntrep.
-
First make sure all the message queues are clear. In order to do this, the
message queues must be active. Restart and flush the message queues with the
following commands.
- First stop any running message queues.
- At Hanford use
runmqsc QM1 << EOF stop channel (QM1_TO_QM2) stop channel (QM1_TO_QM3) end EOF endmqm -i QM1 endmqlsr -m QM1
- At Livingston use
runmqsc QM2 << EOF stop channel (QM2_TO_QM1) stop channel (QM2_TO_QM3) end EOF endmqm -i QM2 endmqlsr -m QM2
- At Caltech use
runmqsc QM3 << EOF stop channel (QM3_TO_QM1) stop channel (QM3_TO_QM2) end EOF endmqm -i QM3 endmqlsr -m QM3
- At Hanford use
- Start the message queues by running the following commands:
- At Hanford use
strmqm QM1 runmqsc QM1 << EOF start channel (QM1_TO_QM2) start channel (QM1_TO_QM3) end EOF nohup runmqlsr -t tcp -m QM1 &> /export/ldbd/var/log/mqlsr.out </dev/null &
- At Livingston use
strmqm QM2 runmqsc QM2 << EOF start channel (QM2_TO_QM1) start channel (QM2_TO_QM3) end EOF nohup runmqlsr -t tcp -m QM2 &> /export/ldbd/var/log/mqlsr.out </dev/null &
- At Caltech use
strmqm QM3 runmqsc QM3 << EOF start channel (QM3_TO_QM1) start channel (QM3_TO_QM2) end EOF nohup runmqlsr -t tcp -m QM3 &> /export/ldbd/var/log/mqlsr.out </dev/null &
- At Hanford use
- Now that the message queues are sucessfully running, run the following
commands to delete any pending messages on the queues:
- At Hanford:
asnqmfmt ASN.QM1.ADMINQ QM1 -delmsg asnqmfmt ASN.QM1.RESTARTQ QM1 -delmsg asnqmfmt ASN.QM2_TO_QM1.DATAQ QM1 -delmsg asnqmfmt ASN.QM3_TO_QM1.DATAQ QM1 -delmsg
- At LLO
asnqmfmt ASN.QM2.ADMINQ QM2 -delmsg asnqmfmt ASN.QM2.RESTARTQ QM2 -delmsg asnqmfmt ASN.QM1_TO_QM2.DATAQ QM2 -delmsg asnqmfmt ASN.QM3_TO_QM2.DATAQ QM2 -delmsg
- At CIT
asnqmfmt ASN.QM3.ADMINQ QM3 -delmsg asnqmfmt ASN.QM3.RESTARTQ QM3 -delmsg asnqmfmt ASN.QM1_TO_QM3.DATAQ QM3 -delmsg asnqmfmt ASN.QM2_TO_QM3.DATAQ QM3 -delmsg
- At Hanford:
-
Reset the capture and apply control tables by running the following DB2 commands at
all sites (after connecting to the segment database):
db2 DELETE FROM ASN.IBMQREP_CAPTRACE db2 DELETE FROM ASN.IBMQREP_CAPMON db2 DELETE FROM ASN.IBMQREP_CAPQMON db2 DELETE FROM ASN.IBMQREP_SIGNAL db2 "UPDATE ASN.IBMQREP_SENDQUEUES SET STATE='A'" db2 "UPDATE ASN.IBMQREP_SUBS SET HAS_LOADPHASE='E'" db2 "UPDATE ASN.IBMQREP_SUBS SET STATE='I'" db2 "UPDATE ASN.IBMQREP_SUBS SET GROUP_MEMBERS=NULL" db2 DELETE FROM ASN.IBMQREP_SAVERI db2 DELETE FROM ASN.IBMQREP_APPLYTRACE db2 DELETE FROM ASN.IBMQREP_APPLYMON db2 DELETE FROM ASN.IBMQREP_SPILLQS db2 DELETE FROM ASN.IBMQREP_DONEMSG db2 DELETE FROM ASN.IBMQREP_SPILLEDROW db2 DELETE FROM ASN.IBMQREP_EXCEPTIONS db2 "UPDATE ASN.IBMQREP_RECVQUEUES SET STATE='A'" db2 "UPDATE ASN.IBMQREP_TARGETS SET HAS_LOADPHASE='E'" db2 "UPDATE ASN.IBMQREP_TARGETS SET STATE='I'"
-
Cold start QCapture and QApply using startmode=COLD.
This needs to be done at all sites.
- At LHO:
export CAP_PATH=/export/ldbd/var/db2/capture export APP_PATH=/export/ldbd/var/db2/apply asnqcap startmode=COLD capture_server=seg_lho capture_schema=ASN CAPTURE_PATH=/export/ldbd/var/db2/capture 1>> ${CAP_PATH}/asnqcap.stdout 2>> ${CAP_PATH}/asnqcap.stderr < /dev/null & asnqapp apply_server=seg_lho apply_schema=ASN APPLY_PATH=/export/ldbd/var/db2/apply 1>> ${APP_PATH}/asnqapp.stdout 2>> ${APP_PATH}/asnqapp.stderr < /dev/null & - At LLO:
export CAP_PATH=/export/ldbd/var/db2/capture export APP_PATH=/export/ldbd/var/db2/apply asnqcap startmode=COLD capture_server=seg_llo capture_schema=ASN CAPTURE_PATH=/export/ldbd/var/db2/capture 1>> ${CAP_PATH}/asnqcap.stdout 2>> ${CAP_PATH}/asnqcap.stderr < /dev/null & asnqapp apply_server=seg_llo apply_schema=ASN APPLY_PATH=/export/ldbd/var/db2/apply 1>> ${APP_PATH}/asnqapp.stdout 2>> ${APP_PATH}/asnqapp.stderr < /dev/null & - At CIT:
export CAP_PATH=/export/ldbd/var/db2/capture export APP_PATH=/export/ldbd/var/db2/apply asnqcap startmode=COLD capture_server=seg_cit capture_schema=ASN CAPTURE_PATH=/export/ldbd/var/db2/capture 1>> ${CAP_PATH}/asnqcap.stdout 2>> ${CAP_PATH}/asnqcap.stderr < /dev/null & asnqapp apply_server=seg_cit apply_schema=ASN APPLY_PATH=/export/ldbd/var/db2/apply 1>> ${APP_PATH}/asnqapp.stdout 2>> ${APP_PATH}/asnqapp.stderr < /dev/null &
ASN7000I "Q Capture" : "ASN" : "WorkerThread" : "0" subscriptions are active. "40" subscriptions are inactive. "0" subscriptions that were new and were successfully activated. "0" subscriptions that were new could not be activated and are now inactive.
- At LHO:
- Send the Start signal to Q replication to activate the subscriptions.
The replication processes are started by running the following commands at
LHO, not LLO or CIT. This triggers the replication to start at all the
databases.
cd ~/src/glue/src/conf/q_replication db2 connect to seg_lho db2 -tf start_1.sql db2 -tf loaddone_1.sql db2 -tf start_2.sql db2 -tf loaddone_2.sql db2 connect reset
- Once replication is restored, test the replication by inserting and
deleteing a test segment at each site. Do this as follows:
- The test insertion uses the ldbdd servers at each site, however it is
essential that they be started with a restricted grid-map file so that
the DMT or Keith cannot insert segments while this test is being carried out.
To do this, rename the existing grid-map file with
mv /export/ldbd/etc/grid-security/ldbdserver/grid-mapfile /export/ldbd/etc/grid-security/ldbdserver/grid-mapfile.orig
and create a new file that contains only the DN of the administrator performing the test and NOT the DMT's grid certificate. Do this at each site and start ldbdd withldbdd -d -c /export/ldbd/etc/ldbdserver.ini
- On a client with glue installed, download the file badgers.xml and insert it into the LHO database with
ldbdc --server ldas.ligo-wa.caltech.edu --insert badgers.xml
This creates a fake segment at LHO which you can check for by usingdb2 "select * from process where program = 'BadgerMon'"
Repeat this select command at LLO and CIT to check that replication has copied this data from LHO to CIT and LLO.
- The creator_db and process_id obtained by querying the
process table for this inserted segment can be used to query the other tables
to check that replication is working for all tables.
- Once replication has been verified to be working, delete the test segments
by deleteing the entry in the process table using the creator_db and
process_id obtained from the above query. For example,
db2 "delete from process where creator_db = 2 and process_id = x'20070620013201702626000000'"
Check that the delete has been sucessfully replicated to all sites by querying for this creator_db,process_id pair in the process table.
- If insertion and deletion of the above test segment at LHO and replication to CIT and LLO worked, repeat the process to check that insertion and deleteion at LLO is correctly propagated to CIT and LHO and insertion and deleteion at CIT is correctly propagated to LLO and LHO.
- The test insertion uses the ldbdd servers at each site, however it is
essential that they be started with a restricted grid-map file so that
the DMT or Keith cannot insert segments while this test is being carried out.
To do this, rename the existing grid-map file with
You should perform a final check of the apply and capture stdout files for error messages, but everything should now be running.
Deleting the Message Queues
To display the message queues on a machine run
dspmq
You can delete each message queue with the commands
endmqm -i QM1 endmqlsr -m QM1 dltmqm QM1
endmqm -i QM2 endmqlsr -m QM2 dltmqm QM2
endmqm -i QM3 endmqlsr -m QM3 dltmqm QM3
Uninstall WebSphere MQ
- Shutdown Q replication and WebSphere Q
- At Hanford use
- At Livingston use
- At Caltech use
- Clean up IPC:
- Remove shared resources owned by the mqm user and mqm group To list shared resources use ipcs command.
- Delete Queue Managers Change the queue manager name to the one that you need to delete. In this example, we delete queue manager QM2 at ldas-llo:
- Get a list of the mq related packages, and you will get something like:
- Remove the packages in reverse order
- Remove the other MQ associated packages (basically gskit):
pkginfo | grep gsk application gsk6bas Certificate and SSL Base Runtime (gsk6bas) pkgrm gsk6bas
asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_LHO LOGSTDOUT=Y STOP asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_LHO LOGSTDOUT=Y STOP runmqsc QM1 << EOF stop channel (QM1_TO_QM2) stop channel (QM1_TO_QM3) end EOF endmqm -i QM1 endmqlsr -m QM1
asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_LLO LOGSTDOUT=Y STOP asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_LLO LOGSTDOUT=Y STOP runmqsc QM2 << EOF stop channel (QM2_TO_QM1) stop channel (QM2_TO_QM3) end EOF endmqm -i QM2 endmqlsr -m QM2
asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_CIT LOGSTDOUT=Y STOP asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_CIT LOGSTDOUT=Y STOP runmqsc QM3 << EOF stop channel (QM3_TO_QM1) stop channel (QM3_TO_QM2) end EOF endmqm -i QM3 endmqlsr -m QM3
ipclean
[ldbd@ldas.ldas-la ~]$ ipcs IPC status from
To remove the user mqm and group mqm owned shared resources, use command ipcrm:as of Wed Jul 2 09:07:00 PDT 2008 T ID KEY MODE OWNER GROUP Message Queues: Shared Memory: m 956301409 0x7401c2 --rw-rw---- mqm mqm m 956301408 0x7401c0 --rw-rw---- mqm mqm m 956301403 0x40740006 --rw-rw-rw- mqm mqm m 118 0x78003fb8 --rw-r--r-- grid grid m 117 0x78003557 --rw-r--r-- grid grid m 28 0x790195cd --rw-rw-rw- grid grid Semaphores: s 1375731729 0x740008 --ra-ra-ra- mqm mqm s 50331664 0x740011 --ra-ra---- mqm mqm s 1375731727 0x740010 --ra-ra---- mqm mqm s 50331662 0x740006 --ra-ra-ra- mqm mqm s 37 0x790195cd --ra-ra-ra- grid grid [ldbd@ldas.ldas-la ~]$ ipcs | grep mqm | awk '{print " ipcrm -"$1" "$2}' ipcrm -m 956301409 ipcrm -m 956301408 ipcrm -m 956301403 ipcrm -s 1375731729 ipcrm -s 50331664 ipcrm -s 1375731727 ipcrm -s 50331662 [ldbd@ldas.ldas-la ~]$ ipcrm -m 956301409 [ldbd@ldas.ldas-la ~]$ ipcrm -m 956301408 [ldbd@ldas.ldas-la ~]$ ipcrm -m 956301403 [ldbd@ldas.ldas-la ~]$ ipcrm -s 1375731729 [ldbd@ldas.ldas-la ~]$ ipcrm -s 50331664 [ldbd@ldas.ldas-la ~]$ ipcrm -s 1375731727 [ldbd@ldas.ldas-la ~]$ ipcrm -s 50331662dltmqm QM2
pkginfo | grep mqm [ldbd@ldas.ldas-wa ~]$ pkginfo | grep mqm application mqm WebSphere MQ for Sun Solaris application mqm-upd11 WebSphere MQ for Sun Solaris 2 - U802142
pkgrm mqm-upd11 pkgrm mqm
- First stop any running message queues.