Search


DASWG LAL Doxygen

Docs

How-To's
Technical
Software Docs
Minutes

Download

Browse CVS, Git, or SVN
Software Repositories
OS Security Updates
LIGO software virtual machine
VMware SL6 Install

Participate

SCCB - Software Change Control Board
Edit these pages
Sub-committees
Mailing List
Telecon

Projects

DMT
DQSEGDB
Glue
GraceDB
gstlal
LALSuite
LDAS Tools
LDG Client/Server
LDR
ligoDV
LIGOtools
LVAlert Administration
LVAlert
MatApps
Metaio
NDS Client
PyLAL
LSCSOFT VM

Legacy Projects

geopp
LDAS
LDM
LIGOtools
LSCGIS
Onasys
OSG-LIGO

Troubleshooting the database, services and replication

DB2 problems

Q replication problems


Basic Troubleshooting Methodology

The following basic steps are recommened for troubleshooting a problem:
  1. WebSphere MQ file system check
  2. The file system check ensures that there is enough space available for the WebSphere MQ log. If the disk where it resides fills up, you will get WebSphere MQ transaction errors. Assuming that the log file is at the default location of /var/mqm, use the following command to check sys tem space:
    df -k /var/mqm

  3. WebSphere MQ log check
    1. Check /var/mqm/errors for any global exceptions related to WebSphere MQ.
    2. Check /var/mqm/qmgrs//errors for any queue manager exceptions.
  4. Check queue manager status
  5. dspmq
    will list status for all queue managers.
  6. Check if Q Capture and Q Apply programs are running on both send and receive servers
  7. To check capture program status, run, for example,asnqccmd capture_server=seg_cit capture_schema=ASN status, and you will see something like:
    [ldbd@ldas-cit ~]$ asnqccmd capture_server=seg_cit capture_schema=ASN status
    2006-03-15-09.21.44.370661 ASN0520I  "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "HoldLThread" thread is in the "is waiting" state.
    2006-03-15-09.21.44.389233 ASN0520I  "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "AdminThread" thread is in the "is resting" state.
    2006-03-15-09.21.44.389505 ASN0520I  "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "PruneThread" thread is in the "is resting" state.
    2006-03-15-09.21.44.389586 ASN0520I  "AsnQCcmd" : "ASN" : "Initial" : The STATUS command response: "WorkerThread" thread is in the "is resting" state.
    
    To check apply program status, run, for example, asnqacmd apply_server=seg_cit apply_schema=ASN status, and you will see something like this:

    [ldbd@ldas-cit ~]$ asnqacmd apply_server=seg_cit apply_schema=ASN status
    2008-07-15-09.25.17.681358 ASN0520I  "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "HoldLThread" thread is in the "is waiting" state.
    2008-07-15-09.25.17.682480 ASN0520I  "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "AdminThread" thread is in the "is resting" state.
    2008-07-15-09.25.17.682581 ASN0520I  "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "MonitorThread" thread is in the "is resting" state.
    2008-07-15-09.25.17.682652 ASN0520I  "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "BR00000" thread is in the "is doing work" state.
    2008-07-15-09.25.17.682723 ASN0520I  "AsnQAcmd" : "ASN" : "Initial" : The STATUS command response: "BR00001" thread is in the "is doing work" state.
    
  8. Check the Q subscriptions states and receive queue states on both send and receive servers
  9. Look into capture program log file at /export/ldbd/var/db2/capture/ldbd.SEG_CIT.ASN.QCAP.log and look for the most recent message that looks like this:

    2008-07-15-07.27.04.958564 ASN7000I "Q Capture" : "ASN" : "WorkerThread" : "40" subscriptions are active. "0" subscriptions are inactive. "0" subscriptions that were new and were successfully activated. "0" subscriptions that were new could not be activated and are now inactive.

  10. Look into Q Apply and Capture logs on both send and receive servers. Solve errors highlighted in the log, if any.
  11. Check the WebSphere MQ queues
    • Transmit and receive queues' queue depth.
    • runmqsc queue_manager_name
      dis ql(queue name) curdepth
      
    • Ensure listeners are running on both servers
    • ps -ef | grep lsr
      You will see a running result like:
      mqm 12456     1   0 07:25:23 ?           0:00 runmqlsr -t tcp -m QM3
    • Ensure channels are running on both servers, run the command
    • ps -ef |  grep chl
      You will see the results like:
      mqm  3123   1   0 06:45:45 ?     0:01 /opt/mqm/bin/runmqchl -c QM3_TO_QM2     -m QM3                           
      mqm  3121   1   0 06:45:45 ?     0:01 /opt/mqm/bin/runmqchl -c QM3_TO_QM1     -m QM3       
      
      Or use commands:

      runmqsc queue_manager_name
      dis chstatus(channel name)
      
      You will see something like:
      [ldbd@ldas-cit ~]$ runmqsc QM3
      5724-H72 (C) Copyright IBM Corp. 1994, 2005.  ALL RIGHTS RESERVED.
      Starting MQSC for queue manager QM3.
      
      
      dis chstatus(qm1_to_qm3) 
           1 : dis chstatus(qm1_to_qm3)
      AMQ8417: Display Channel Status details.
         CHANNEL(QM1_TO_QM3)                     CHLTYPE(RCVR)
         CONNAME(198.129.208.245)                CURRENT
         RQMNAME(QM1)                            STATUS(RUNNING)
         SUBSTATE(RECEIVE)                       XMITQ( )
      
      

Each replication message sent by the Q capture program to a Q apply program is tagged with a sequential number. If a message goes missing, the Q apply program will not process any further replication messages from the site with the missing message until the missing message is recovered.

The Q apply program at CIT will detect the missing message and print the error

ASN7551E  "Q Apply" : "ASN" : "BR00000" : The Q Apply program detected a 
gap in message numbers on receive queue "ASN.QM2_TO_QM3.DATAQ", replication 
queue map "SEG_LLO_ASN_TO_SEG_CIT_ASN". It read message ID 
"51524550436871FF000000000000000000000000000E02FFFFFFFE", but expected 
to find message ID "51524550436871FF000000000000000000000000000E02FFFFFFFD".
The Q Apply program cannot process any messages until if finds the expected 
message.
to the apply program log file, which can be found under
/export/ldbd/var/db2/apply/ldbd.SEG_CIT.ASN.QAPP.log
It will refuse to process any subsequent transctions and continue to print the warning message
ASN7552W "Q Apply" : "ASN" : "BR00001" : The Q Apply program is polling 
receive queue "ASN.QM2_TO_QM3.DATAQ", replication queue map 
"SEG_LLO_ASN_TO_SEG_CIT_ASN" for message ID 
"51524550436871FF000000000000000000000000000E02FFFFFFFD".
until the situation is manually resolved.

At each site there is a dead letter queue (DLQ) that stores messages that have not been sucessfully transmitted between sites. It is most likely that the missing message has been dumped into the dead letter queue and can be retrieved. If this is not possible, the message queues must be flushed and the asntdiff and asntrep programs run to re-synchronize the databases. The latter method takes a long time, so all efforts should be taken to recover the message from the dead letter queue.

Recovering from a missing message using the dead letter queue

All WebSphere MQ environments need a routine to process messages on the DLQ regularly. WebSphere MQ supplies a default routine, called the dead-letter queue handler (the DLQ handler), which can be invoked via the runmqdlq command. Instructions for processing messages on the DLQ are supplied to the DLQ handler by means of a user-written rules table (that is, the DLQ handler matches messages on the DLQ against entries in the rules table - when a DLQ message matches an entry in the rules table, the DLQ handler performs the action associated with that entry).

Attention: A user-written rules table is not provided for Q replication messages. Should one choose to define a DLQ for the Q replication environment, an appropriate DLQ handler rules table should be defined.

We assume that receive queue at CIT is full, and the incoming messages has been put into DLQ. Therefore, a missing message error ASN7551E accurs at CIT.

  1. Check the status of receive queue:
    In this example, we assume receive queue ASN.QM1_TO_QM3.DATAQ is full. To confirm that, run:
    runmqsc QM3
    dis ql(ASN.QM1_TO_QM3.DATAQ)
    
    It returns something like:
    AMQ8409: Display Queue details.
       DESCR(WebSphere MQ Default Local Queue)
       PROCESS( )                              BOQNAME( )
       INITQ( )                                TRIGDATA( )
       CLUSTER( )                              CLUSNL( )
       QUEUE(ASN.QM1_TO_QM3.DATAQ)             CRDATE(2005-11-01)
       CRTIME(21.42.20)                        ALTDATE(2007-11-09)
       ALTTIME(15.30.52)                       GET(ENABLED)
       PUT(ENABLED)                            DEFPRTY(0)
       DEFPSIST(NO)                            MAXDEPTH(50000)
       MAXMSGL(100000000)                      BOTHRESH(0)
       SHARE                                   DEFSOPT(SHARED)
       HARDENBO                                MSGDLVSQ(PRIORITY)
       RETINTVL(999999999)                     USAGE(NORMAL)
       NOTRIGGER                               TRIGTYPE(FIRST)
       TRIGDPTH(1)                             TRIGMPRI(0)
       QDEPTHHI(80)                            QDEPTHLO(20)
       QDPMAXEV(ENABLED)                       QDPHIEV(DISABLED)
       QDPLOEV(DISABLED)                       QSVCINT(999999999)
       QSVCIEV(NONE)                           DISTL(NO)
       NPMCLASS(NORMAL)                        DEFTYPE(PREDEFINED)
       TYPE(QLOCAL)                            SCOPE(QMGR)
       DEFBIND(OPEN)                           IPPROCS(3)
       OPPROCS(1)                              CURDEPTH(50000)
    
           : 
    
    Parameter MAXDEPTH identifies the maximum messages that can be waiting in this queue. As we can see MAXDEPTH(50000) defines the maximum number of messages to be 50000. Parameter CURDEPTH indicates the number of messages that are currently in the queue. In this example, ,CURDEPTH(50000) shows it reached the maximum number of messages that can be held in receive queue ASN.QM1_TO_QM3.DATAQ.

  2. Find the dead letter queue name at CIT:
    dis qmgr
    
  3. Queue manager at CIT is QM3. LHO is QM1, LLO is QM2. Please change the queue manager name according to your site.

    This will return something like:

    AMQ8408: Display Queue Manager details.
       DESCR( )                                DEADQ(QM3_DEADQ)
       DEFXMITQ( )                             CHADEXIT( )
       CLWLEXIT( )                             CLWLDATA( )
       REPOS( )                                REPOSNL( )
       SSLKEYR(/var/mqm/qmgrs/QM3/ssl/key)     SSLCRLNL( )
       SSLCRYP( )                              COMMANDQ(SYSTEM.ADMIN.COMMAND.QUEUE)
       QMNAME(QM3)                             CRDATE(2005-11-01)
       CRTIME(21.40.46)                        ALTDATE(2007-11-09)
       ALTTIME(15.30.52)                       QMID(QM3_2005-11-01_21.40.46)
       TRIGINT(999999999)                      MAXHANDS(256)
       MAXUMSGS(10000)                         AUTHOREV(DISABLED)
       INHIBTEV(DISABLED)                      LOCALEV(DISABLED)
       REMOTEEV(DISABLED)                      PERFMEV(DISABLED)
       STRSTPEV(ENABLED)                       CHAD(DISABLED)
       CHADEV(DISABLED)                        CLWLLEN(100)
       MAXMSGL(100000000)                      CCSID(819)
       MAXPRTY(9)                              CMDLEVEL(530)
       PLATFORM(UNIX)                          SYNCPT
       DISTL(YES)                           
    
           : 
    
    Note that DEADQ(QM3_DEADQ) means parameter DEADQ is defined as QM3_DEADQ. QM3_DEADQ is the dead letter queue at QM3.

  4. Display dead letter queue attributes
    dis ql(QM3_DEADQ)
    
    It returns:
    AMQ8409: Display Queue details.
       DESCR(WebSphere MQ Default Local Queue)
       PROCESS( )                              BOQNAME( )
       INITQ( )                                TRIGDATA( )
       CLUSTER( )                              CLUSNL( )
       QUEUE(QM3_DEADQ)                        CRDATE(2006-03-01)
       CRTIME(15.28.49)                        ALTDATE(2007-11-09)
       ALTTIME(15.30.52)                       GET(ENABLED)
       PUT(ENABLED)                            DEFPRTY(0)
       DEFPSIST(NO)                            MAXDEPTH(50000)
       MAXMSGL(100000000)                      BOTHRESH(0)
       SHARE                                   DEFSOPT(SHARED)
       HARDENBO                                MSGDLVSQ(PRIORITY)
       RETINTVL(999999999)                     USAGE(NORMAL)
       NOTRIGGER                               TRIGTYPE(FIRST)
       TRIGDPTH(1)                             TRIGMPRI(0)
       QDEPTHHI(80)                            QDEPTHLO(20)
       QDPMAXEV(ENABLED)                       QDPHIEV(DISABLED)
       QDPLOEV(DISABLED)                       QSVCINT(999999999)
       QSVCIEV(NONE)                           DISTL(NO)
       NPMCLASS(NORMAL)                        DEFTYPE(PREDEFINED)
       TYPE(QLOCAL)                            SCOPE(QMGR)
       DEFBIND(OPEN)                           IPPROCS(0)
       OPPROCS(0)                              CURDEPTH(18900)
    
           : 
    
    MAXDEPTH(50000) means that the maximum unscessfully transmitted messages dead letter queue QM3_DEADQ can hold is 50000. CURDEPTH indicates the number of messages QM3_DEADQ is currently holding. CURDEPTH(18900) shows there is 18900 messages in QM3_DEADQ at the moment.

  5. Configure and run dead letter queue message handler
    • Create a txt file named dlqrule.txt (you can name your rules file differently), and write rules into the file:
      INPUTQM(' ') INPUTQ(' ') WAIT(NO) RETRYINT(5)
      REASON(MQRC_Q_FULL) ACTION(RETRY) RETRY(5)
      
      INPUTQM(' ') identifies the name of the queue manager that owns this queue, which is QM3 in our case. INPUTQ(' ') identifies the name of the dead letter queue, which is QM3_DEADQ. REASON(MQRC_Q_FULL) shows that messages received because of a queue full condition. The RETRY keyword sets the number of tries made to implement an action. The RETRYINT keyword of the control data controls the interval between attempts.

    • Run the dead letter queue message handler:
      runmqdlq QM3_DEADQ QM3 < dlqrule.txt
      
      Dead letter queue message handler will put the missing message back to its target queue (if the missing message is not yet lost).

  6. Check to see if the number of messages in dead letter queue has decreased
    dis ql(QM3_DEADQ) CURDEPTH
    
    It returns:
    AMQ8409: Display Queue details.
       QUEUE(QM3_DEADQ)                        CURDEPTH(0)
    
    CURDEPTH is indicating 0 messages in DLQ, which means DLQ handler has done its work.
  7. To exit MQSC command environment, type in:
    end
    

Recovering from a missing message by a manual resynchronization

If there is a serious error, such as a RAID or OS crash on one of the gateway machines, it is possible that a replication message may be permenantly lost. In this case, the database tables need to be re-syncronized and replication restarted. Here, we show how to recover from a case in which a message sent from LLO to CIT is lost. To recover replication when the missing message cannot be retrieved from the dead letter queue, follow the steps below. Note that the running the asntdiff program to complare databases may take a very long time.

  1. In order to re-sychronize the database, no incomming data can be inserted during the syncronization process. Shutdown all publishing scripts and kill and ldbd servers are connected to the databases. This includes
    1. All ldbdd servers that accept incoming segment publication (the LDBDServer processes at LHO, LLO and CIT).
    2. All publishing scripts that insert state segments from frames at LHO and LLO.
    3. Any cron jobs at CIT that insert data from the GEO frames into the database.
    The LSCsegFindServer programs may be left running, but users should be warned to expect inconsistent results from the servers during the repair process.

  2. Shut down the Q capture and apply programs with the commands
    asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_LHO LOGSTDOUT=Y STOP
    asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_LHO LOGSTDOUT=Y STOP
    
    This needs to be done at all three sites. Do not stop the WebSphere message queues, however.

  3. Deactivate the Q subscriptions by running the commands
    db2 "UPDATE ASN.IBMQREP_SUBS SET STATE='I'"
    db2 "UPDATE ASN.IBMQREP_TARGETS SET STATE='I'"
    
    at all three sites.

  4. Flush the message queues. At LHO run the commands
    asnqmfmt ASN.QM1.ADMINQ QM1 -delmsg
    asnqmfmt ASN.QM1.RESTARTQ QM1 -delmsg 
    asnqmfmt ASN.QM2_TO_QM1.DATAQ QM1 -delmsg 
    asnqmfmt ASN.QM3_TO_QM1.DATAQ QM1 -delmsg 
    
    then at LLO run the commands
    asnqmfmt ASN.QM2.ADMINQ QM2 -delmsg
    asnqmfmt ASN.QM2.RESTARTQ QM2 -delmsg 
    asnqmfmt ASN.QM1_TO_QM2.DATAQ QM2 -delmsg 
    asnqmfmt ASN.QM3_TO_QM2.DATAQ QM2 -delmsg 
    
    and finally at CIT run the commands
    asnqmfmt ASN.QM3.ADMINQ QM3 -delmsg
    asnqmfmt ASN.QM3.RESTARTQ QM3 -delmsg 
    asnqmfmt ASN.QM1_TO_QM3.DATAQ QM3 -delmsg 
    asnqmfmt ASN.QM2_TO_QM3.DATAQ QM3 -delmsg 
    
    Once these commands are run, the message queues are flushed and cannot be recovered.

  5. In the glue src/glue/src/conf/db2/ directory, there is a file called runstats.sql which updated the indices and run statistics on all the database tables. Run this script at each site with the commands below. It should be run simultaneously at all three sites, as will take a long time. At LHO type:
    db2 connect to seg_lho
    db2 -tvf ~/src/glue/src/conf/db2/runstats.sql
    
    at LLO type:
    db2 connect to seg_llo
    db2 -tvf ~/src/glue/src/conf/db2/runstats.sql
    
    and at CIT type:
    db2 connect to seg_cit
    db2 -tvf ~/src/glue/src/conf/db2/runstats.sql
    
    Although rebuilding the indices is not strictly necessary, it make the asntdiff command run much faster.

  6. In this example, we are reparing segments between LLO (which has 2) and CIT (which has ID 3). The commands from here on are specific to this example, but can easily be modified to fixed replication between other databases, e.g. between LLO and LHO by changing the relevant ID numbers and database names.

    At LLO, create a connection to the CIT database so that DB2 at LLO can talk directly to the database at CIT. Do this by typing
    db2 "catalog tcpip node db2cit remote ldas-cit.ligo.caltech.edu server 50002"
    db2 "catalog database seg_cit as seg_cit at node db2cit"
    
    Then create a password file for the replication repair programs with
    asnpwd init
    asnpwd add alias seg_cit id ldbd password xxxxxxxx
    
    Replace the string xxxxxxxx with the login password of the ldbd user at the remote site.

  7. Next figure out the ASN ID number of the tables that need to be repared by typing
    db2 connect to seg_llo
    db2 "select substr(SUBNAME,1,30) as SUBNAME, substr(SENDQ,1,20) as SENDQ from ASN.IBMQREP_SUBS where SENDQ = 'ASN.QM2_TO_QM3.DATAQ'"
    
    This will return something like
    SUBNAME                        SENDQ               
    ------------------------------ --------------------
    PROCESS0006                    ASN.QM2_TO_QM3.DATAQ
    PROCESS_PARAMS0006             ASN.QM2_TO_QM3.DATAQ
    LFN0006                        ASN.QM2_TO_QM3.DATAQ
    GRIDCERT0006                   ASN.QM2_TO_QM3.DATAQ
    FILTER0006                     ASN.QM2_TO_QM3.DATAQ
    FILTER_PARAMS0006              ASN.QM2_TO_QM3.DATAQ
    FRAMESET_CHANLIST0006          ASN.QM2_TO_QM3.DATAQ
    FRAMESET_WRITER0006            ASN.QM2_TO_QM3.DATAQ
    FRAMESET0006                   ASN.QM2_TO_QM3.DATAQ
    FRAMESET_LOC0006               ASN.QM2_TO_QM3.DATAQ
    SEGMENT_DEFINER0006            ASN.QM2_TO_QM3.DATAQ
    SEGMENT0006                    ASN.QM2_TO_QM3.DATAQ
    SEGMENT_DEF_MAP0006            ASN.QM2_TO_QM3.DATAQ
    SEGMENT_LFN_MAP0006            ASN.QM2_TO_QM3.DATAQ
    SUMM_VALUE0006                 ASN.QM2_TO_QM3.DATAQ
    SUMM_STATISTICS0006            ASN.QM2_TO_QM3.DATAQ
    SUMM_SPECTRUM0006              ASN.QM2_TO_QM3.DATAQ
    SUMM_CSD0006                   ASN.QM2_TO_QM3.DATAQ
    SUMM_COMMENT0006               ASN.QM2_TO_QM3.DATAQ
    SUMM_MIME0006                  ASN.QM2_TO_QM3.DATAQ
    
      20 record(s) selected.
    
    The ASN ID is the 4 digit number (including the zeroes) after each table name. In this case it is 0006. Note that not all these tables are used by the segment database (see below).

  8. Create a file called tables.txt which contains the list of used tables in the segment database schema. The used tables is the subset of all the tables that contains at least one row. You can get this by running "select count(*) tablename" on each of the tables in the segment database. As of Dec 2006, the list of used tables is
    PROCESS
    PROCESS_PARAMS
    LFN
    GRIDCERT
    SEGMENT_DEFINER
    SEGMENT
    SEGMENT_DEF_MAP
    SEGMENT_LFN_MAP
    
    The repair processes checks for differences between the source and the target database for each of these tables.

  9. The differences between the tables can now be repaired. There are two methods of doing this. For small discrepancies, use the first method below which relies on the asntdiff/asntrep programs. For larger differences, use the methods based on export/load.

    • Repair using asntdiff/asntrep.

      1. Next create diff tables for each of the above tables using the asntdiff program. This can be done using the following command. Note that in the clause where subname = '${table}0006 the ASN ID number 0006 should be replaced with the appropriate number from step 7 above. To create the diff tables run the command:
        (for table in `cat tables.txt`
          do echo ${table}
          asntdiff db=seg_llo schema=ASN where="where subname = '${table}0006'" diff=${table}DF
        done) 2>&1 | tee diffs.txt
        
      2. Now use the asntrep program and the diff tables created above to repair the differences between the tables. Again, check that the ASN ID number in the where clause is approriate:
        (for table in `cat tables.txt` 
          do echo ${table} 
          asntrep db=seg_llo schema=ASN where="where subname = '${table}0006'" diff=${table}DF
        done)
        

    • Repair using export/load

      1. Export the table that needs to be corrected at both source and target sites. For example, if replication was broken from CIT to LLO and table segment needs to be corrected, execute the following at both sites (after connecting to the corresponding database with db2 connect to seg_cit or db2 connect to seg_llo):
        db2 'export to /export/ldbd/SEGMENT.txt of del messages SEGMENT.msg select * from SEGMENT'
        
      2. Sort the resulting dump files (or one could do it in the SQL query in the previous step adding order by segment_id) so that they can be compared with diff or md5sum.
        sort /export/ldbd/SEGMENT.txt > /export/ldbd/SEGMENT.txt.s
        
      3. Compute md5sum of the sorted table dumps at both sites and compare if you are not sure if the corresponding tables are different or not.
      4. scp CIT's sorted gzipped dump to llo.
      5. Use GNU version of diff (the one supplied with Solaris does not seem to be able to handle binary fields in a dump) to find the entries present at CIT and missing at LLO:
        /ldcg/bin/diff -a SEGMENT.llo.txt.s SEGMENT.cit.txt.s | perl -pe 's/< //' | perl -e '@a=<>;print join("",@a[1..$#a])' > diff.SEGMENT
        
      6. Load the resulting file into the corresponding table:
        db2 connect to seg_llo
        db2 'load from diff.SEGMENT of del messages segment.msg insert into segment allow no access'
        
        Notice that after load operation the database would be in backup pending state and the table will be in check pending state and one must execute the two steps below to return them into a usable state.
      7. Create an offline backup:
        db2 BACKUP DATABASE SEG_LLO TO /archive/backup/database/gateway/offline  WITH 2 BUFFERS BUFFER 1024
        
      8. Check if the inserted rows satisfy constraints:
        db2 CONNECT TO SEG_LLO
        db2 SET INTEGRITY FOR LDBD.SEGMENT IMMEDIATE CHECKED INCREMENTAL
        db2 CONNECT RESET
        
      9. You might want to repeat steps 1-3 for LLO to obtain a sorted dump again and check if md5sums are the same now.

    • While asntdiff/asntrep works fine for small tables and small difference between tables, it is not recommended to use for large tables or large differences between tables. For the current database size (6-7G), asntdiff/asntrep takes couple days to complete and if the difference between tables is big, it would fail anyway (it might also fail just because the table is big regardless of differences).
      Another pecularity of asntdiff/asntrep is that unless one provides a maxdiff=N option to asntdiff (which is not mentioned in the original IBM documentation but can be found with Google), only up to 100,001 of different rows are repaired (if repaired) and asntdiff does not complain that it could not handle the rest, it just reports how many differences were handled and it is up to a user to check if that's what is required.
      For large tables and large differences, export/load approach should be used and it works quite fast. For example, to export or load a table of 1.5G takes only ~20 minutes. Sorting, executing diff, (un)gzip, md5sum, scp might take another 40 minutes. One disadvantage of export/load method is that the database gets into an unusable state after load and steps 7-8 must be executed to bring it back to life (but it takes only a few minutes to do that).
  10. First make sure all the message queues are clear. In order to do this, the message queues must be active. Restart and flush the message queues with the following commands.

    1. First stop any running message queues.
      • At Hanford use
        runmqsc QM1 << EOF
        stop channel (QM1_TO_QM2)
        stop channel (QM1_TO_QM3)
        end
        EOF
        endmqm -i QM1
        endmqlsr -m QM1
        
        
      • At Livingston use
        runmqsc QM2 << EOF
        stop channel (QM2_TO_QM1)
        stop channel (QM2_TO_QM3)
        end
        EOF
        endmqm -i QM2
        endmqlsr -m QM2
        
      • At Caltech use
        runmqsc QM3 << EOF
        stop channel (QM3_TO_QM1)
        stop channel (QM3_TO_QM2)
        end
        EOF
        endmqm -i QM3
        endmqlsr -m QM3
        
      Ignore any error messages from these commands, as the message queues may not be running when you issue them.

    2. Start the message queues by running the following commands:
      • At Hanford use
        strmqm QM1
        runmqsc QM1 << EOF
        start channel (QM1_TO_QM2)
        start channel (QM1_TO_QM3)
        end
        EOF
        nohup runmqlsr -t tcp -m QM1 &> /export/ldbd/var/log/mqlsr.out </dev/null &
        
      • At Livingston use
        strmqm QM2
        runmqsc QM2 << EOF
        start channel (QM2_TO_QM1)
        start channel (QM2_TO_QM3)
        end
        EOF
        nohup runmqlsr -t tcp -m QM2 &> /export/ldbd/var/log/mqlsr.out </dev/null &
        
        
      • At Caltech use
        strmqm QM3
        runmqsc QM3 << EOF
        start channel (QM3_TO_QM1)
        start channel (QM3_TO_QM2)
        end
        EOF
        nohup runmqlsr -t tcp -m QM3 &> /export/ldbd/var/log/mqlsr.out </dev/null &
        

    3. Now that the message queues are sucessfully running, run the following commands to delete any pending messages on the queues:
      • At Hanford:
        asnqmfmt ASN.QM1.ADMINQ QM1 -delmsg
        asnqmfmt ASN.QM1.RESTARTQ QM1 -delmsg 
        asnqmfmt ASN.QM2_TO_QM1.DATAQ QM1 -delmsg 
        asnqmfmt ASN.QM3_TO_QM1.DATAQ QM1 -delmsg 
        
      • At LLO
        asnqmfmt ASN.QM2.ADMINQ QM2 -delmsg
        asnqmfmt ASN.QM2.RESTARTQ QM2 -delmsg 
        asnqmfmt ASN.QM1_TO_QM2.DATAQ QM2 -delmsg 
        asnqmfmt ASN.QM3_TO_QM2.DATAQ QM2 -delmsg 
        
      • At CIT
        asnqmfmt ASN.QM3.ADMINQ QM3 -delmsg
        asnqmfmt ASN.QM3.RESTARTQ QM3 -delmsg 
        asnqmfmt ASN.QM1_TO_QM3.DATAQ QM3 -delmsg 
        asnqmfmt ASN.QM2_TO_QM3.DATAQ QM3 -delmsg 
        
      Note: if there are any pending messages they will be printed to the screen and deleted. You may need to run the above commands several times at each site until all pending messages have been flushed and no more messages are printed to the screen.

    4. Reset the capture and apply control tables by running the following DB2 commands at all sites (after connecting to the segment database):
      db2 DELETE FROM ASN.IBMQREP_CAPTRACE
      db2 DELETE FROM ASN.IBMQREP_CAPMON
      db2 DELETE FROM ASN.IBMQREP_CAPQMON
      db2 DELETE FROM ASN.IBMQREP_SIGNAL
      db2 "UPDATE ASN.IBMQREP_SENDQUEUES SET STATE='A'"
      db2 "UPDATE ASN.IBMQREP_SUBS SET HAS_LOADPHASE='E'"
      db2 "UPDATE ASN.IBMQREP_SUBS SET STATE='I'"
      db2 "UPDATE ASN.IBMQREP_SUBS SET GROUP_MEMBERS=NULL"
      
      db2 DELETE FROM ASN.IBMQREP_SAVERI
      db2 DELETE FROM ASN.IBMQREP_APPLYTRACE
      db2 DELETE FROM ASN.IBMQREP_APPLYMON
      db2 DELETE FROM ASN.IBMQREP_SPILLQS
      db2 DELETE FROM ASN.IBMQREP_DONEMSG
      db2 DELETE FROM ASN.IBMQREP_SPILLEDROW
      db2 DELETE FROM ASN.IBMQREP_EXCEPTIONS
      db2 "UPDATE ASN.IBMQREP_RECVQUEUES SET STATE='A'"
      db2 "UPDATE ASN.IBMQREP_TARGETS SET HAS_LOADPHASE='E'"
      db2 "UPDATE ASN.IBMQREP_TARGETS SET STATE='I'"
      

    5. Cold start QCapture and QApply using startmode=COLD. This needs to be done at all sites.
      • At LHO:
        export CAP_PATH=/export/ldbd/var/db2/capture
        export APP_PATH=/export/ldbd/var/db2/apply
        asnqcap startmode=COLD capture_server=seg_lho capture_schema=ASN CAPTURE_PATH=/export/ldbd/var/db2/capture 1>> ${CAP_PATH}/asnqcap.stdout 2>> ${CAP_PATH}/asnqcap.stderr < /dev/null &
        
        asnqapp apply_server=seg_lho apply_schema=ASN APPLY_PATH=/export/ldbd/var/db2/apply 1>> ${APP_PATH}/asnqapp.stdout 2>> ${APP_PATH}/asnqapp.stderr < /dev/null &
        
      • At LLO:
        export CAP_PATH=/export/ldbd/var/db2/capture
        export APP_PATH=/export/ldbd/var/db2/apply
        asnqcap startmode=COLD capture_server=seg_llo capture_schema=ASN CAPTURE_PATH=/export/ldbd/var/db2/capture 1>> ${CAP_PATH}/asnqcap.stdout 2>> ${CAP_PATH}/asnqcap.stderr < /dev/null &
        
        asnqapp apply_server=seg_llo apply_schema=ASN APPLY_PATH=/export/ldbd/var/db2/apply 1>> ${APP_PATH}/asnqapp.stdout 2>> ${APP_PATH}/asnqapp.stderr < /dev/null &
        
      • At CIT:
        export CAP_PATH=/export/ldbd/var/db2/capture
        export APP_PATH=/export/ldbd/var/db2/apply
        asnqcap startmode=COLD capture_server=seg_cit capture_schema=ASN CAPTURE_PATH=/export/ldbd/var/db2/capture 1>> ${CAP_PATH}/asnqcap.stdout 2>> ${CAP_PATH}/asnqcap.stderr < /dev/null &
        
        asnqapp apply_server=seg_cit apply_schema=ASN APPLY_PATH=/export/ldbd/var/db2/apply 1>> ${APP_PATH}/asnqapp.stdout 2>> ${APP_PATH}/asnqapp.stderr < /dev/null &
        
      The Q capture programs will be in a quiescent state with all subscriptions inactive. The capture program log will display the following message:
      ASN7000I  "Q Capture" : "ASN" : "WorkerThread" : "0" subscriptions are active. 
      "40" subscriptions are inactive. "0" subscriptions that were new and were successfully activated.
      "0" subscriptions that were new could not be activated and are now inactive.
      

    6. Send the Start signal to Q replication to activate the subscriptions. The replication processes are started by running the following commands at LHO, not LLO or CIT. This triggers the replication to start at all the databases.
      cd ~/src/glue/src/conf/q_replication
      db2 connect to seg_lho
      db2 -tf start_1.sql
      db2 -tf loaddone_1.sql
      db2 -tf start_2.sql
      db2 -tf loaddone_2.sql
      db2 connect reset
      
    7. Once replication is restored, test the replication by inserting and deleteing a test segment at each site. Do this as follows:
      1. The test insertion uses the ldbdd servers at each site, however it is essential that they be started with a restricted grid-map file so that the DMT or Keith cannot insert segments while this test is being carried out. To do this, rename the existing grid-map file with
        mv /export/ldbd/etc/grid-security/ldbdserver/grid-mapfile /export/ldbd/etc/grid-security/ldbdserver/grid-mapfile.orig
        
        and create a new file that contains only the DN of the administrator performing the test and NOT the DMT's grid certificate. Do this at each site and start ldbdd with
        ldbdd -d -c /export/ldbd/etc/ldbdserver.ini
        

      2. On a client with glue installed, download the file badgers.xml and insert it into the LHO database with
        ldbdc --server ldas.ligo-wa.caltech.edu --insert badgers.xml
        
        This creates a fake segment at LHO which you can check for by using
        db2 "select * from process where program = 'BadgerMon'"
        
        Repeat this select command at LLO and CIT to check that replication has copied this data from LHO to CIT and LLO.

      3. The creator_db and process_id obtained by querying the process table for this inserted segment can be used to query the other tables to check that replication is working for all tables.

      4. Once replication has been verified to be working, delete the test segments by deleteing the entry in the process table using the creator_db and process_id obtained from the above query. For example,
        db2 "delete from process where creator_db = 2 and process_id = x'20070620013201702626000000'"
        
        Check that the delete has been sucessfully replicated to all sites by querying for this creator_db,process_id pair in the process table.

      5. If insertion and deletion of the above test segment at LHO and replication to CIT and LLO worked, repeat the process to check that insertion and deleteion at LLO is correctly propagated to CIT and LHO and insertion and deleteion at CIT is correctly propagated to LLO and LHO.

    You should perform a final check of the apply and capture stdout files for error messages, but everything should now be running.

    Deleting the Message Queues

    To display the message queues on a machine run

    dspmq
    

    You can delete each message queue with the commands

    
    endmqm -i QM1
    endmqlsr -m QM1
    dltmqm QM1
    
    endmqm -i QM2
    endmqlsr -m QM2
    dltmqm QM2
    
    endmqm -i QM3
    endmqlsr -m QM3
    dltmqm QM3
    

    Uninstall WebSphere MQ

    1. Shutdown Q replication and WebSphere Q
      • At Hanford use
      • asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_LHO LOGSTDOUT=Y STOP
        asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_LHO LOGSTDOUT=Y STOP
        runmqsc QM1 << EOF
        stop channel (QM1_TO_QM2)
        stop channel (QM1_TO_QM3)
        end
        EOF
        endmqm -i QM1
        endmqlsr -m QM1
        
      • At Livingston use
      • asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_LLO LOGSTDOUT=Y STOP
        asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_LLO LOGSTDOUT=Y STOP
        runmqsc QM2 << EOF
        stop channel (QM2_TO_QM1)
        stop channel (QM2_TO_QM3)
        end
        EOF
        endmqm -i QM2
        endmqlsr -m QM2
        
      • At Caltech use
      • asnqccmd CAPTURE_SCHEMA=ASN CAPTURE_SERVER=SEG_CIT LOGSTDOUT=Y STOP
        asnqacmd APPLY_SCHEMA=ASN APPLY_SERVER=SEG_CIT LOGSTDOUT=Y STOP
        runmqsc QM3 << EOF
        stop channel (QM3_TO_QM1)
        stop channel (QM3_TO_QM2)
        end
        EOF
        endmqm -i QM3
        endmqlsr -m QM3
        
    2. Clean up IPC:
    3. ipclean
      
    4. Remove shared resources owned by the mqm user and mqm group
    5. To list shared resources use ipcs command.
      [ldbd@ldas.ldas-la ~]$ ipcs
      IPC status from  as of Wed Jul  2 09:07:00 PDT 2008
      T         ID      KEY        MODE        OWNER    GROUP
      Message Queues:
      Shared Memory:
      m  956301409   0x7401c2   --rw-rw----      mqm      mqm
      m  956301408   0x7401c0   --rw-rw----      mqm      mqm
      m  956301403   0x40740006 --rw-rw-rw-      mqm      mqm
      m        118   0x78003fb8 --rw-r--r--     grid     grid
      m        117   0x78003557 --rw-r--r--     grid     grid
      m         28   0x790195cd --rw-rw-rw-     grid     grid
      Semaphores:
      s 1375731729   0x740008   --ra-ra-ra-      mqm      mqm
      s   50331664   0x740011   --ra-ra----      mqm      mqm
      s 1375731727   0x740010   --ra-ra----      mqm      mqm
      s   50331662   0x740006   --ra-ra-ra-      mqm      mqm
      s         37   0x790195cd --ra-ra-ra-     grid     grid
      
      To remove the user mqm and group mqm owned shared resources, use command ipcrm:
      [ldbd@ldas.ldas-la ~]$ ipcs | grep mqm | awk '{print " ipcrm  -"$1"  "$2}'
       ipcrm  -m  956301409
       ipcrm  -m  956301408
       ipcrm  -m  956301403
       ipcrm  -s  1375731729
       ipcrm  -s  50331664
       ipcrm  -s  1375731727
       ipcrm  -s  50331662
      
       
      [ldbd@ldas.ldas-la ~]$ ipcrm  -m  956301409
      [ldbd@ldas.ldas-la ~]$ ipcrm  -m  956301408
      [ldbd@ldas.ldas-la ~]$ ipcrm  -m  956301403
      [ldbd@ldas.ldas-la ~]$ ipcrm  -s  1375731729
      [ldbd@ldas.ldas-la ~]$ ipcrm  -s  50331664
      [ldbd@ldas.ldas-la ~]$ ipcrm  -s  1375731727
      [ldbd@ldas.ldas-la ~]$ ipcrm  -s  50331662
      
    6. Delete Queue Managers
    7. Change the queue manager name to the one that you need to delete. In this example, we delete queue manager QM2 at ldas-llo:
      dltmqm QM2
      
    8. Get a list of the mq related packages, and you will get something like:
    9. pkginfo | grep mqm
      [ldbd@ldas.ldas-wa ~]$ pkginfo | grep mqm
      application mqm                              WebSphere MQ for Sun Solaris
      application mqm-upd11                        WebSphere MQ for Sun Solaris 2 - U802142
      
    10. Remove the packages in reverse order
    11. pkgrm mqm-upd11
      pkgrm mqm
      
    12. Remove the other MQ associated packages (basically gskit):
      pkginfo | grep gsk
      application gsk6bas                          Certificate and SSL Base Runtime (gsk6bas)
      pkgrm gsk6bas
      
$Id$