09-Jul-2007

Questions when beginning a new job

Here's a list of some of the questions I'd be asking on taking on a new OpenVMS systems management job.

  • system disk
    • only one system disk per architecture?
    • shadowed on different controllers?
    • operator log, audit journal, accounting file rolled regularly and off system disk?
    • sysuaf, rightslist off system disk?
    • regularly used shareables installed?
    • pagefiles on separate shadowed disks on different controllers?
    • version control for system startup procedures, modparams, etc?
  • hardware
    • are the systems redundant? Identify and correct all SPOF.
    • are all boxes/cables labeled? Correctly?
    • configuration documentation/drawings exist?
    • log books for each cluster/node exist and maintained by HP engineers?
    • what's owned/leased?
    • maintenance contracts exist?
    • leasing contracts exist?
    • When was the last hardware inventory performed by HP and/or systems management?
    • spare parts inventory/location? Are all loose boards labeled?
    • raid level 0/singletons exist? If so, correct. Raid 0+1 or raid 5 only.
  • software
    • maintenance contracts exist?
    • vendor list/contact numbers/service call routines exist?
    • installed software version number list exists?
    • defragger in place?
    • scheduling system in place?
    • what software is having maintenance paid but is not being used?
  • change control
    • documented procedures in place?
    • regulatory requirements (such as SOX) exist?
  • security
    • subsystem ACEs used as opposed to installing images with privileges?
    • standardized ACL structure defined?
    • external audits performed?
    • hardware/software update procedures and windows of opportunity documented?
    • shared UICs exist? Correct.
    • interactive shared accounts exist? Correct.
    • who has privileges?
    • physical security acceptable? two factor authentication used?
    • who has access to the machine room? spares location? plant? network room?
    • untrusted privileged access to command line? write menus to remove.
    • what open IP ports are there?
    • minimum password requirements specified (length and lifetime)?
    • password cracker in place?
    • old account disable/delete in place? Manual? Automatic?
    • ssh/scp/sftp in use?
    • "orphaned" proxies exist? remove.
    • wildcard proxies exist? remove.
  • network
    • documentation/diagrams exist?
    • IP and DECnet address registers exist?
    • firewalled? who controls firewall?
    • switch/router configurations backed up regularly?
    • who controls network config?
    • packet prioritization in place? Needed?
  • trouble tickets
    • documented escalation procedures exist?
    • documented SLAs exist? How is meeting SLAs measured?
    • touble ticket system in place?
  • monitoring
    • disk usage?
    • CPU usage?
    • network usage?
    • system tuned to use all available memory?
    • uptime monitoring?
    • automated alarms for levels exceeded?
    • automated audit reports generated?
    • capacity management/tools (T4, PERFDAT, or better, PCPA) in place and used?
    • shadow set membership monitoring in place?
  • backups
    • documented procedures exist?
    • does the application "quiet point"? Backups dependent on snapshots or shadow breaks?
    • how does tape rotation work?
    • offsite tapes mandatory. Trusted vaulting company? Are they audited/tested regularly?
    • barcoded tapes mandatory
    • how is media tracked (i.e., what media management software is in place)?
    • restores tested regularly?
  • disaster recovery
    • documented plan exists?
    • tested regularly?
  • development
    • development separated from production? How? Physical/logical?
    • what languages are in use? restricted list of languages?
    • bug tracking system in place?
    • programming standards in place?
    • shared libraries used effectively?
    • change controlled?
    • do applications automatically page people when problems occur?
  • communications
    • register of phone numbers/pager numbers exists?
    • blog and/or wiki in place for systems team?
  • queues
    • appropriate print/batch queue security?
    • batch queues set at low priority?
    • printer hardware standards (i.e., approved lists of printers)?
    • printer documentation? where are they? is there a naming standard?
    • DCPS used?
  • misc systems management
    • ntp in place?
    • automatic daylight savings switch in place?
    • Oracle in use? update C DST offset job required?
    • ISAM files converted regularly?
    • global buffers used? Sensibly?
    • error reporting/analysis tools (such as ISEE and WSEA) in place?
    • how are disks "owned"? disk ownership tracking required?
    • physical disk names only in mount logicals?
    • consolidated directory structure for systems management exists?
    • HP OpenVMS Systems Healthcheck performed and issues addressed?
    • application load balancing correct?
    • disk cluster sizes correct? controller caching effective?
    • application teams understand directory size limitations/file prefix naming?
    • What performance bottlenecks exist now?
    • application teams have transaction analysis documentation?
    • DBAs have database layout documentation?
    • disk I/Os balanced? disk queues nonexistant?
    • pagefaults acceptable?
    • swapping nonexistant?
    • AUTOGEN SAVPARAMS performed automatically each day in peak time?
    • Watchdog software (e.g. to detect the death of required jobs) in place?
    • Idle process killer in place?
  • facilities
    • to scale machine room floor diagram exists?
    • Air conditioning acceptable? machine room overpressure? fresh air feed acceptable?
    • power conditioning/UPS/generator set, and associated monitoring acceptable?
    • power distribution/redundancy acceptable?
    • cable trays in place for fibre and power?
    • environmental monitoring in place? automated alarms/shutoff? Restart tested and documented?
    • fire suppression acceptable?
    • EPO switch protected from accidental activation?
    • windows for equipment moves/"tiles up" events documented?
  • Console and related
    • Is the DUMP_DEV (or related on IA64) set up correctly?
    • Is the dump file large enough to accept a dump?
Posted at July 9, 2007 5:08 PM
Tag Set:
Comments

A good list. Here are some more thoughts.
Are there proper procedures for username authorisation. Often there is a gap in procedures for removal of usernames.
Is there a proper record of who authorised every existing username and who they belong to and are all their privileges necessary?
Is there a process for regular reviews of user access?

Posted by: Ian Miller at July 9, 2007 8:09 PM

Yes, authorization is always a touchy subject. I had to deal with SOX in my last position. Record keeping and auditing are paramount. All your points have hit my radar before on this.

Posted by: Jim Duff at July 9, 2007 10:24 PM

Interesting read. Spot on, too. This same list also applies when an existing system manager or system management team adopts existing or new OpenVMS systems. The System Lifecycle presentation from the most recent Bootcamp is an orthagonal and system-centric view to the human-centric view considered here. http://labs.hoffmanlabs.com/node/317 Oh, and don't forget to load the AMDS / AvailMan client driver before you need it.

Posted by: Stephen Hoffman at July 10, 2007 4:35 AM

ManohMan ... This questionnaire reminds me the recent OVMS quality audit i had ... anyway it is a good consolidated list :-)

Posted by: Santhosh at August 8, 2007 9:55 AM

Comments are closed