LDAS Administration

This page details the commands necessary to perform basic administration on the LDAS system at UWM. It covers

Adding a user to LDAS is described on a separate page.

If you have not worked with LDAS before, please read through the introduction and interacting with APIs sections of this page before continuing.

Remember, anywhere you see the string ::MGRKEY replace it with the LDAS manager key, which can be obtained from Duncan.


Introduction

The most important links are:

LDAS is composed of several programs or APIs that perform different functions in the system and run on different machines. The managerAPI is the supervisor program to which users submit jobs (or LDAS user commands). The managerAPI them directs the other APIs to perform the necessary actions to process the job.

Each API runs a TCL interpreter that interacts with the user or performs the commands that it is instructed to run. APIs that perform computation (such as the frameAPI) have also have a C++ layer that contains functions for data processing that are called from the TCL layer.

The APIs and their functions are as follows:


Interacting with the APIs

Each API has an emergency port that contains a TCL interpreter that allows users to interact with the API and execute any TCL function that it has access to, such as functions that the API has loaded as well as all of the regular TCL commands. Each API also has an operator port that accepts command from other APIs or from the user, in the case of the managerAPI.

The port number of the emergency port is one higher than the operator port of the API. For example, the managers operator socket is port 10001 and the emergency port is port 10002. Since each API runs on a different machine, you also need to know the name of the machine that the API is running on. Then you can log into the emergency port in the usual fashion using the telnet program. For example, the managerAPI runs on ldas.phys.uwm.edu:

[duncan@antares duncan]$ telnet ldas.phys.uwm.edu 10002
Trying 129.89.57.100...
Connected to ldas.phys.uwm.edu.
Escape character is '^]'.

No prompt is given, but the API is ready for commands. The first word that the API gets should be the manager key, referred to in this page as ::MGRKEY. This can be obtained from Duncan. After the manager key comes a list of TCL functions or commands. For example, we could use the LDAS function addLogEntry to create an entry in the managers log file:

[duncan@antares duncan]$ telnet ldas.phys.uwm.edu 10002
Trying 129.89.57.100...
Connected to ldas.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY addLogEntry "hello world" orange
{managerAPI:emergency_port:executed: addLogEntry "hello world" orange}
Connection closed by foreign host.

If we now look at the managerAPIs log file we see:

09/19/02-11:25:38 CDT 
09/19/02-16:25:38 GMT 716487951 IDLE emergency accepted connection from: 129.89.57.174 antares.phys.uwm.edu 51532
09/19/02-11:25:38 CDT 
09/19/02-16:25:38 GMT 716487951 IDLE emergency hello world
09/19/02-11:25:38 CDT 
09/19/02-16:25:38 GMT 716487951 IDLE emergency executed: addLogEntry "hello world" blue

This should give you the basic idea how to get the APIs to execute commands. This is the method that is used below to perform the most common administration actions that we do at UWM. Any command submitted to an API will be logged in that APIs log file.

The other information that is needed is the location of the APIs ports. This can be obtained from the API status page. For each running API this page lists the port number of the operator socket, for example

green ball diskcache API is running on datacon port ::BASEPORT+4

For LDAS UWM BASEPORT is port 10000, so the operator port of the diskcacheAPI is port 10004 and the emergency port is 10005 on the machine datacon.phys.uwm.edu


Starting the system

To start ldas, ssh into the machine ldas.phys.uwm.edu as user ldas and execute the runLDAS command:

[duncan@antares duncan]$ ssh ldas@ldas.phys.uwm.edu
ldas@ldas.phys.uwm.edu's password: 
Last login: Wed Sep 18 19:19:23 2002 from antares.phys.uw
Sun Microsystems Inc.   SunOS 5.8       Generic February 2000
[ldas@ldas ldas_outgoing]$ /ldas/bin/runLDAS ::MGRKEY

You may be prompted for the user LDAS's ssh pass phrase, if this is the first time that LDAS has been started since the machine ldas.phys.uwm.edu has been rebooted. You can then watch the start up of the system in the managerAPIs log file. This will begin with

09/18/02-19:19:42 CDT 
09/19/02-00:19:42 GMT 716429995 STARTUP archiveLog /ldas_outgoing/logs/LDASmanager.log.html (file6) closed (archived as /ldas_outgoing/logs/archive/managerAPI/LDASmanager.716430013)
09/18/02-19:19:43 CDT 
09/19/02-00:19:43 GMT 716429996 STARTUP openListenSock port 10002 (emergency) opened on ldas as sock8

The manager will start all the other APIs on the various LDAS hardware and then start a bunch of internal services. The final message of the start up sequence is

09/18/02-19:39:30 CDT 
09/19/02-00:39:30 GMT 716431183 IDLE openListenSock port 10001 (operator) opened on ldas as sock7

Once you see this, LDAS is up and ready for commands. The runLDAS command should have terminated and you can log out of ldas.phys.uwm.edu.


Stopping the System

To stop LDAS, log into the managerAPIs emergency port and issue the shutdown command mgr::sHuTdOwN

[duncan@antares duncan]$ telnet ldas.phys.uwm.edu 10002
Trying 129.89.57.100...
Connected to ldas.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY mgr::sHuTdOwN

This should stop the entire system. Again, you can watch the progress in the managerAPI log file. You will see

09/18/02-15:16:34 CDT 
09/18/02-20:16:34 GMT 716415407 IDLE emergency accepted connection from: 129.89.57.174 antares.phys.uwm.edu 51351
09/18/02-15:16:34 CDT 
09/18/02-20:16:34 GMT 716415407 SHUTDOWN closeListenSock port 10001 (sock7) (operator) closed on ldas

and the manager will abort any running jobs with an error message, kill all the other APIs and save it current job number. The shutdown is complete when you see

09/18/02-15:18:18 CDT 
09/18/02-20:18:18 GMT 716415511 SHUTDOWN queue::save the QUEUE array has been saved to /ldas_outgoing/logs/manager.queue.
09/18/02-15:18:18 CDT 
09/18/02-20:18:18 GMT 716415511 SHUTDOWN key::increment incrKey: "key" array written to: seqKEYS
09/18/02-15:19:12 CDT 
09/18/02-20:19:12 GMT 716415565 SHUTDOWN closeListenSock port 10002 (sock8) (emergency) closed on ldas
09/18/02-15:19:12 CDT 
09/18/02-20:19:12 GMT 716415565 SHUTDOWN closeLog /ldas_outgoing/logs/LDASmanager.log.html (file6) closed

The TCL interpreter on the emergency port with then exit and LDAS has been successfully shut down.


Rebooting an API

Sometimes it may be necessary to reboot an API without restarting the whole of LDAS. To do this use the mgr::bootstrapAPI command. This takes as an argument the name of the API you want to restart (e.g. frame, datacond, ligolw, etc.). Note that you cannot bootstrap the managerAPI, you must restart LDAS to reboot the manager. For example to bootstrap the eventmonAPI:

[duncan@antares duncan]$ telnet ldas 10002
Trying 129.89.57.100...
Connected to ldas.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY mgr::bootstrapAPI eventmon
The manager will attempt to cleanly shutdown the eventmon, or kill it if it can't be cleanly shutdown. You can watch this in the managers logs. When the API has been successfully restarted you will see the message

09/19/02-11:44:12 CDT

09/19/02-16:44:12 GMT 716489065 IDLE mgr::bootstrapAPI done. eventmon API is now running as pid: 19065

in the managerAPI log and the telnet session will exit:
{managerAPI:emergency_port:executed: mgr::bootstrapAPI eventmon}
Connection closed by foreign host.

Forcing an Update of the Disk Cache File

Sometimes it may be necessary to tell LDAS to update the disk cache file on demand. To do this, log into the diskcacheAPI emergency port and execute the command cache::updateDirs with no arguments. Execute the command

[duncan@antares duncan]$ telnet datacon.phys.uwm.edu 10005
Trying 129.89.57.115...
Connected to datacon.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY cache::updateDirs

and the cache file will be updated, which may take while. Once it is done, the emergency port will close and the new contents of the cache file will be updated.


Forcing a Rebuild of the Disk Cache File

Sometimes it may be necessary to tell LDAS to purge and reconstruct the disk cache file. To do this, log into the diskcacheAPI emergency port and execute the command cache::updateDirs with three arguments. The first should be your name, which will be logged in the log file, the second is 0 and the third is 1. For example, if I was to update the cache file I would execute the command

[duncan@antares duncan]$ telnet datacon.phys.uwm.edu 10005
Trying 129.89.57.115...
Connected to datacon.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY cache::updateDirs duncan 0 1

You will see the message

09/19/02-12:20:35 CDT 
09/19/02-17:20:35 GMT 716491248 IDLE disk::cacheFile file /ldas_outgoing/.frame.cache deleted

and the cache file will be rebuilt, which may take while. Once it is done, the emergency port will close and the new contents of the cache file will be updated.


Adding or removing a node from the mpiAPI

Sometimes it may be necessary to remove a node from the beowulf for maintenance. LDAS should be told about this, so that it does not try and use a non functioning node. To see if any jobs are running a a particular node, log into the mpiAPI emergency port and execute the following TCL command (you can copy and paste it from the page, replacing s099 with the node name of the node that you wish to test for

[duncan@antares duncan]$ telnet medusa.phys.uwm.edu 10020
Trying 129.89.57.103...
Connected to medusa.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY set mynode s099 ; puts $cid "====================" ; foreach item
$::mpi::queue(NULL,nod) { foreach { name node score } $item { if { $mynode ==
$name } { if { $score } { puts $cid "$mynode is in use" } else { puts $cid
"$mynode is free" } } } } ; puts $cid "====================" 

The mpiAPI will then report if the node is free or in use. No report means that LDAS is not currently using the node. If the node is in use wait until the job using that node has finished and try again.

The nodes can be managed by the command mpi::liveNodelistUpdate which takes three arguments:

  1. an identifying string: use your userid, e.g. duncan.
  2. an action, either add or remove.
  3. the node name of the node that you wish to add or remove, e.g. s099

For example, suppose I wanted to remove node s112 I first make sure that it is free and then do the following

[duncan@antares duncan]$ telnet medusa.phys.uwm.edu 10020
Trying 129.89.57.103...
Connected to medusa.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY mpi::liveNodelistUpdate duncan remove s112
{mpi :emergency:executed: mpi::liveNodelistUpdate duncan remove s112}
Connection closed by foreign host.
The mpiAPI log then shows

09/19/02-15:21:35 CDT 
09/19/02-20:21:35 GMT 716502108 IDLE mpi::updateRscFile ::NODENAMES updated!! (m001 s002 s003 s004 s005 s006 s007 s008 s009 s010 s011 s012 s013 s014 s015 s017 s018 s019 s020 s021 s022 s023 s024 s025 s026 s027 s028 s029 s030 s031 s032 s033 s034 s035 s036 s037 s038 s039 s040 s041 s042 s043 s045 s046 s047 s048 s049 s050 s051 s052 s053 s054 s055 s056 s057 s058 s059 s060 s061 s062 s063 s064 s065 s066 s067 s068 s069 s071 s072 s073 s074 s075 s076 s077 s078 s079 s080 s081 s082 s083 s084 s085 s086 s087 s088 s089 s090 s091 s092 s093 s094 s095 s096 s097 s098 s099 s100 s101 s102 s103 s104 s105 s106 s107 s108 s109 s110 s111 s113 s114 s115 s116 s117 s118 s119 s120 s121 s122 s123 s124 s125 s126 s127 s128 s129 s130 s131 s132 s133 s134 s135 s136 s137 s138 s139 s140 s141 s142 s143 s144 s145 s146 s148 s149 s150)
09/19/02-15:21:35 CDT 
09/19/02-20:21:35 GMT 716502108 IDLE emergency executed: mpi::liveNodelistUpdate duncan remove s112

Now I am free to perform maintenance on the node. When I have finished, I simply run the command
[duncan@antares duncan]$ telnet medusa.phys.uwm.edu 10020
Trying 129.89.57.103...
Connected to medusa.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY mpi::liveNodelistUpdate duncan add s112
{mpi :emergency:executed: mpi::liveNodelistUpdate duncan add s112}
Connection closed by foreign host.
to add the node back in. The mpiAPI log then contains

09/19/02-15:24:25 CDT 
09/19/02-20:24:25 GMT 716502278 IDLE mpi::updateRscFile ::NODENAMES updated!! (m001 s002 s003 s004 s005 s006 s007 s008 s009 s010 s011 s012 s013 s014 s015 s017 s018 s019 s020 s021 s022 s023 s024 s025 s026 s027 s028 s029 s030 s031 s032 s033 s034 s035 s036 s037 s038 s039 s040 s041 s042 s043 s045 s046 s047 s048 s049 s050 s051 s052 s053 s054 s055 s056 s057 s058 s059 s060 s061 s062 s063 s064 s065 s066 s067 s068 s069 s071 s072 s073 s074 s075 s076 s077 s078 s079 s080 s081 s082 s083 s084 s085 s086 s087 s088 s089 s090 s091 s092 s093 s094 s095 s096 s097 s098 s099 s100 s101 s102 s103 s104 s105 s106 s107 s108 s109 s110 s111 s113 s114 s115 s116 s117 s118 s119 s120 s121 s122 s123 s124 s125 s126 s127 s128 s129 s130 s131 s132 s133 s134 s135 s136 s137 s138 s139 s140 s141 s142 s143 s144 s145 s146 s148 s149 s150 s112)
09/19/02-15:24:25 CDT 
09/19/02-20:24:25 GMT 716502278 IDLE emergency executed: mpi::liveNodelistUpdate duncan add s112

and the node is available for use again.

Manually Killing Jobs

Sometimes it is necessary to manually kill a job. In this case use the mgr::abortJob jobid jobid command. Replace jobid with the numeric job ID. For example to kill job LDAS-UWM43465 execute

[duncan@antares duncan]$ telnet ldas.phys.uwm.edu 10002
Trying 129.89.57.100...
Connected to ldas.phys.uwm.edu.
Escape character is '^]'.
::MGRKEY mgr::abortJob 43465 43465                        
{managerAPI:emergency_port:executed: mgr::abortJob 43465 43465}
Connection closed by foreign host.

Note that you have to give the job ID twice! If successfully aborted the manager log will contain a message similar to

09/19/02-15:34:40 CDT 
09/19/02-20:34:40 GMT 716502893 IDLE emergency accepted connection from: 129.89.57.174 antares.phys.uwm.edu 51850
09/19/02-15:34:40 CDT 
09/19/02-20:34:40 GMT 716502893 IDLE mgr::abortJob LDAS-UWM43465 aborted at user 'duncan' request
09/19/02-15:34:40 CDT 
09/19/02-20:34:40 GMT 716502893 LDAS-UWM43465 ::LDAS-UWM43465::delete closed channel sock9
09/19/02-15:34:40 CDT 
09/19/02-20:34:40 GMT 716502893 LDAS-UWM43465 ::LDAS-UWM43465::delete assistant manager "LDAS-UWM43465" destroyed.
09/19/02-15:34:41 CDT 
09/19/02-20:34:41 GMT 716502894 IDLE leakLogger Allocated 1736 kB in 71769 seconds

If the job does not exist you will a log message similar to

09/19/02-15:30:00 CDT 
09/19/02-20:30:00 GMT 716502613 IDLE emergency accepted connection from: 129.89.57.174 antares.phys.uwm.edu 51831
09/19/02-15:30:00 CDT 
09/19/02-20:30:00 GMT 716502613 IDLE mgr::abortJob LDAS-UWM43463 has already been aborted or has finished
09/19/02-15:30:00 CDT 
09/19/02-20:30:00 GMT 716502613 IDLE emergency executed: mgr::abortJob 43463 43463


$Id: ldas_administration.html,v 1.17 2002/10/09 23:42:06 duncan Exp $