Job Description

Diverse Agile Solutions is seeking a SRE Engineer who has experience creating and maintaining operations of site reliability engineering (SRE) efforts on multi-user High-Performance Computing (HPC) systems using a variety of configuration management, IT monitoring, and automation tools within a Linux environment (RedHat, CentOS).

Required Clearance: TS/SCI with Polygraph

Responsibilities:

  • Create a new Nagios Alerting Database, new SRE Database, and develop an effective consistent SRE automation protocol.
  • Responsible for using XFS/ZFS File Systems
  • Ability to use NFS/Block Storage FS Sharing
  • Ability to use automation tools

Required Education and Experience:

Certifications: IAT Level II Certification Required

Bachelor’s degree in Computer Science or related field and have (10) years of demonstrable experience in High Performance Computing systems administration and support of a large client-server-based IT enterprise.

  • Experience and/or exposure with automation tools including: Puppet, Salt, Ansible, and Chef. Candidates shall also have experience with scripting in Bash, Python and/or Perl.
  • Experience or exposure to XFS/ZFS File Systems and NFS/Block Storage FS Sharing; SSH, TMUX, PDSH, CLUSH system access; VI, EMACS, AWK/SES, CRON system editing; and Nagios, Ganglia, SNMP information technology monitoring systems.