Senior Cloud Consultant/Architect - (SRE / Observability)
Grid Dynamics (NASDAQ: GDYN) is looking for a professional who specializes in cloud operations with a focus on DevOps with a strong understanding of SRE practices including observability (logging, tracing, alerting), and a vision for the use of AI in operations. Our ideal candidate is supposed to be a consultant with a mix of knowledge and skills in software development and cloud platforms, with experience in advising clients on how to analyze their challenges, advise, design, build, test, and deploy changes while maintaining a cloud operating model e.g. DevOps & ITSM process and tools.
- Seasoned client-facing consultant/architect who has advised IT executives on cloud operating models e.g. IT processes and tools - and can craft proposals after initial presales meetings
- Experience working closely with production operations, application developers, system, network, middleware, and database administrators to streamline development, operations, and support processes
- Experience in leading DevOps teams, establishing pipelines for cloud and application development, and managing the velocity, quality, and performance of the cloud and the applications.
- Adept at analyzing and problem-solving and preferably have a blend of platform, middleware, network, and software development skills
- Very nice to have: experience with consulting methodologies, knowledge management, and service offering development (to assist in building cloud practice offerings from sales through delivery)
Apply consulting and engineering skills to solve operations problems by:
- Defining and driving initiatives to increase the client‘s overall application development velocity, quality, and availability
- Building tooling needed to improve DevOps and observability of development and operations performance/efficiency
- Enhancing monitoring and management tooling to better detect, diagnose, and correct problems
- Identification and resolution of defects/problems in the cloud or application code for an incident, when applicable
- Team with application developers to support pipelines for new features and incident response automation
- Driving the transformation of delivery methods into the operational teams such as network, database, system administrators, Incident management
- Enabling an AIOps strategy and roadmap to drive more predictive and automated response
- Investigate RCA resolution to get to, and correct, the source of issues and outages.
- Ideally, a former Developer who knows how to support development with DevOps and SRE automation including troubleshooting applications transactions end to end and critical points of failure or bottlenecks.
- DevOps/GitOps understanding with a vision for how to automate analysis, assignments, decisions, and actions to support and operate a platform and application
- Cloud Native dashboarding & and alerting. (minimally familiar with AWS, GCP, and Azure with depth in at least 1)
- Experience with scalable cloud-native architectures and performance tuning.
- Enjoy solving difficult engineering problems, approach troubleshooting systematically, and comfortable getting hands-on to guide engineers and operators
- Great communication and planning experience ideally with a large consultancy background
- Ability to own all or part of an assessment to develop recommendations and a roadmap
- Solid understanding of ITSM and ITIL principles with a focus on Event, Incident, Problem, Change, and Configuration Management - and ability to lead assessments of maturity
- Nice to have software engineering skills ideally with experience in Python, Go, and/or Java
- Understanding of large-scale complex systems from a reliability perspective
- Passion for resolving reliability issues and identifying strategies to mitigate going forward
- Implemented High Availability and Disaster Recovery Infrastructure in the cloud
- Experience with self-healing infrastructure
- Adhering infrastructure to business SLAs and SLOs and managing Error Budgets.
MUST HAVE experience:
- Designing and implementing DevOps pipelines (e.g. Jenkins, Tekton, etc) for Infrastructure as Code (e.g. Terraform)
- Leading team to develop DevOps/IaC Automation: Kubernetes, Terraform, Python, GCP/AWS/Azure
- Strong experience with at least one cloud: GCP, AWS, or Azure (ideally many)
Highly desired hands-on:
- Grafana, ELK Stack, Prometheus, Splunk, and cloud-native tools for alerting and logging
- Observability/APM tools (at least one): Dynatrace, Big Panda, Datadog or New Relic
- Participation in challenging projects, an opportunity for professional development and growth
- Flexible work hours and a dynamic environment
- Friendly cooperative team and atmosphere
- Medical insurance and other benefits
Grid Dynamics is an engineering services company known for transformative, mission-critical cloud solutions for the retail, finance, and technology sectors. We have architected some of the busiest e-commerce services on the Internet and have never had an outage during the peak season. Founded in 2006 and headquartered in San Ramon, California with offices throughout the world, we focus on big data analytics, scalable omnichannel services, DevOps, and cloud enablement.