Site Reliability Engineer, Incident Management
The Site Reliability Engineer, Incident Management has the responsibility of monitoring and maintaining entire Qualys infrastructure installed at different datacenters. When the system malfunctions, the Site Reliability Engineer, Incident Management technician troubleshoots, repairs and gets the system back up as quickly as possible. This can require the technician to work at all hours of the day, depending on when the disruption occurs. Ensure maximum possible service availability and performance, provide support services for Engineering and other technical teams.
ESSENTIAL FUNCTIONS AND RESPONSIBILITIES
- Monitor the performance and capacity of computer systems using a variety of tools. They look for hardware, software, and environmental alerts or malfunctions. When an issue is identified, Site Reliability Engineer, Incident Management works to determine the cause of the problem.
- Responsible for basic troubleshooting to isolate the problems and take appropriate action to resolve.
- Ensure timely resolution to trouble tickets.
- When a problem impacts IT services, Site Reliability Engineer, Incident Management works to triage or troubleshoot the problem, if possible. Site Reliability Engineer, Incident Management will closely follow standard operating procedures. This may include coordinating with third-party vendors, customer contacts, or other IT teams.
- Site Reliability Engineer, Incident Management must carefully track and document all issues and resolutions in detail on the ticketing tool / documentation tools. This increases the knowledge base of the Site Reliability Engineer, Incident Management and is a record of the health of the system.
- When problems are too large or complex for quick troubleshooting, Site Reliability Engineer, Incident Management must escalate the issue to management, other IT resources or 3rd party vendors for assistance in reaching a resolution. Site Reliability Engineer, Incident Management maintain ongoing communication within the team and externally, to keep all stakeholders aware of relevant, known issues and the steps being taken.
- Site Reliability Engineer, Incident Management team will operate 24*7*365 days.
- One to Two years Network Operations or equivalent experience.
- Must possess strong interpersonal skills and have the ability to interact with all levels of employees in a professional manner.
- As an essential function of this position, the employee must be able to handle high levels of stress satisfactorily and be congenial with other employees and customers at all times.
- Site Reliability Engineer, Incident Management is a fast paced environment, critical thinking is essential. Ideas will be extrapolated from one situation to another.
- Strict adherence to company policies, confidentiality, and mature judgment must be demonstrated at all times.
- Assigned duties should be performed in a timely and accurate manner.
- As this individual often represents the internal and external customers and colleagues – professionalism is a must.
- Excellent time management and organizational skills, and ability to handle multiple concurrent tasks and projects with minimal supervision
- Must deliver objectives both individually and by interaction with others, therefore must demonstrate an understanding of organizational functions and departments.
- Should have worked on ticketing tools like Service Now, BMC Remedy, Manage Engine, HPSM etc.
- Should have worked on monitoring tools like Solarwinds, Nagios, Zenoss, Splunk etc.
Education and certifications
- Associate Degree or Technical school is highly recommended with a strong knowledge of computer functionality
- Any technical certification on cisco, VMware or Linux will be an added advantage