Questions when beginning a new job
Here's a list of some of the questions I'd be asking on taking on a new OpenVMS systems management job.
- system disk
- only one system disk per architecture?
- shadowed on different controllers?
- operator log, audit journal, accounting file rolled regularly and off system disk?
- sysuaf, rightslist off system disk?
- regularly used shareables installed?
- pagefiles on separate shadowed disks on different controllers?
- version control for system startup procedures, modparams, etc?
- hardware
- are the systems redundant? Identify and correct all SPOF.
- are all boxes/cables labeled? Correctly?
- configuration documentation/drawings exist?
- log books for each cluster/node exist and maintained by HP engineers?
- what's owned/leased?
- maintenance contracts exist?
- leasing contracts exist?
- When was the last hardware inventory performed by HP and/or systems management?
- spare parts inventory/location? Are all loose boards labeled?
- raid level 0/singletons exist? If so, correct. Raid 0+1 or raid 5 only.
- software
- maintenance contracts exist?
- vendor list/contact numbers/service call routines exist?
- installed software version number list exists?
- defragger in place?
- scheduling system in place?
- what software is having maintenance paid but is not being used?
- change control
- documented procedures in place?
- regulatory requirements (such as SOX) exist?
- security
- subsystem ACEs used as opposed to installing images with privileges?
- standardized ACL structure defined?
- external audits performed?
- hardware/software update procedures and windows of opportunity documented?
- shared UICs exist? Correct.
- interactive shared accounts exist? Correct.
- who has privileges?
- physical security acceptable? two factor authentication used?
- who has access to the machine room? spares location? plant? network room?
- untrusted privileged access to command line? write menus to remove.
- what open IP ports are there?
- minimum password requirements specified (length and lifetime)?
- password cracker in place?
- old account disable/delete in place? Manual? Automatic?
- ssh/scp/sftp in use?
- "orphaned" proxies exist? remove.
- wildcard proxies exist? remove.
- network
- documentation/diagrams exist?
- IP and DECnet address registers exist?
- firewalled? who controls firewall?
- switch/router configurations backed up regularly?
- who controls network config?
- packet prioritization in place? Needed?
- trouble tickets
- documented escalation procedures exist?
- documented SLAs exist? How is meeting SLAs measured?
- touble ticket system in place?
- monitoring
- disk usage?
- CPU usage?
- network usage?
- system tuned to use all available memory?
- uptime monitoring?
- automated alarms for levels exceeded?
- automated audit reports generated?
- capacity management/tools (T4, PERFDAT, or better, PCPA) in place and used?
- shadow set membership monitoring in place?
- backups
- documented procedures exist?
- does the application "quiet point"? Backups dependent on snapshots or shadow breaks?
- how does tape rotation work?
- offsite tapes mandatory. Trusted vaulting company? Are they audited/tested regularly?
- barcoded tapes mandatory
- how is media tracked (i.e., what media management software is in place)?
- restores tested regularly?
- disaster recovery
- documented plan exists?
- tested regularly?
- development
- development separated from production? How? Physical/logical?
- what languages are in use? restricted list of languages?
- bug tracking system in place?
- programming standards in place?
- shared libraries used effectively?
- change controlled?
- do applications automatically page people when problems occur?
- communications
- register of phone numbers/pager numbers exists?
- blog and/or wiki in place for systems team?
- queues
- appropriate print/batch queue security?
- batch queues set at low priority?
- printer hardware standards (i.e., approved lists of printers)?
- printer documentation? where are they? is there a naming standard?
- DCPS used?
- misc systems management
- ntp in place?
- automatic daylight savings switch in place?
- Oracle in use? update C DST offset job required?
- ISAM files converted regularly?
- global buffers used? Sensibly?
- error reporting/analysis tools (such as ISEE and WSEA) in place?
- how are disks "owned"? disk ownership tracking required?
- physical disk names only in mount logicals?
- consolidated directory structure for systems management exists?
- HP OpenVMS Systems Healthcheck performed and issues addressed?
- application load balancing correct?
- disk cluster sizes correct? controller caching effective?
- application teams understand directory size limitations/file prefix naming?
- What performance bottlenecks exist now?
- application teams have transaction analysis documentation?
- DBAs have database layout documentation?
- disk I/Os balanced? disk queues nonexistant?
- pagefaults acceptable?
- swapping nonexistant?
- AUTOGEN SAVPARAMS performed automatically each day in peak time?
- Watchdog software (e.g. to detect the death of required jobs) in place?
- Idle process killer in place?
- facilities
- to scale machine room floor diagram exists?
- Air conditioning acceptable? machine room overpressure? fresh air feed acceptable?
- power conditioning/UPS/generator set, and associated monitoring acceptable?
- power distribution/redundancy acceptable?
- cable trays in place for fibre and power?
- environmental monitoring in place? automated alarms/shutoff? Restart tested and documented?
- fire suppression acceptable?
- EPO switch protected from accidental activation?
- windows for equipment moves/"tiles up" events documented?
- Console and related
- Is the DUMP_DEV (or related on IA64) set up correctly?
- Is the dump file large enough to accept a dump?
Posted at July 9, 2007 5:08 PM