Site Reliability Engineer

Technology - Cloud Engineering & Operations Pleasanton, California


Description

Responsibilities:

  • Create and maintain a continuous testing framework that observes and records and trends real time availability data for all of our clients
  • Develop and maintain on premise and cloud capacity plans that ensure we are delivering a BlackLine service that is performant and cost effective
  • Collaborate with development and other technology teams on requirements definition, capacity planning, and process refinement
  • Improve the BlackLine SaaS service experience by discovering and highlighting optimization opportunities with existing code to address application availability, performance, observability, efficiency, and security challenges.
  • Develop tools and systems to automate the identification, analysis, and remediation of application events, infrastructure issues, or requests.
  • Establish and maintain Key Performance Indicators for the overall health of the service and build tools to exercise and evaluate if these KPI’s are being met.
  • Works cross-functionally with other teams to surface common pain points, architect solutions, establish conventions, and evangelize application development and operations best practices.
  • Transform discoveries into requests to others or action items for you and your team.
  • Regularly learn new systems and tools as the BlackLine platform and ecosystem evolves.
  • Own and evolve the BlackLine trust site to include real time availability and performance information
  • Contribute knowledge, skills, and personal qualities to a dedicated team of top engineers solving real-life problems in a bleeding-edge, high-performance, and high-traffic environment.
  • Assessing, testing, tracking, predicting, and reporting all related performance aspects of a suite of production applications from a performance, responsiveness, capacity, and availability perspective.
  • Publish performance result findings, conclusions, recommendations
  • Create second tier level analysis of capacity constraint points and performance and discuss with development teams/infrastructure
  • Support integration of performance data into customer experience analytics tools and reporting
  • Ensure application and infrastructure capacity management efforts have verifiable capacity data to support business cases
  • Monitor industry trends and keep abreast of new tools and technologies.
  • Participate in our on-call rotation and conduct incident reviews
  • Other duties as assigned


Requirements:

  • BS or MS in Computer Science (or equivalent diploma and/or certifications) with 3-5 years related experience.
  • Intermediate to advanced knowledge of at least one of the following programming languages: C#, Visual Basic, PowerShell, Java, Go, Linux Shell, Ruby.
  • Demonstrated history of developing or operating production web applications and solid understanding of HTTP(S), HTML, JavaScript, CSS, and XML.
  • Knowledge of software development best practices and SDLC.
  • Experience deploying high availability systems and software.
  • Experience with troubleshooting distributed web applications in a production environment.
  • Intermediate level knowledge of IIS and Windows Server or Linux and Apache.
  • Experience with infrastructure as a code and platform as a service.
  • Experience with configuration management tools Ex Chef, Ansible, Puppet.
  • Must possess the ability to handle multiple goals concurrently and function in a fast-paced, demanding, ever changing high growth environment
  • Must maintain the highest level of integrity, courtesy and respect while interacting with internal customers, employees and business contacts
  • Excellent oral and written communication skills
  • Ability to interface with internal technical experts using professional interpersonal skills
  • Experience in analyzing datasets to draw conclusions, and graph datasets supporting these conclusions
  • Exhibit creative problem-solving, logical troubleshooting and analytical skills
  • Basic level proficiency in application load balancing methods (F5 LTM, Windows NLB, etc.)
  • Working knowledge of TCP/IP and networking concepts
  • Proficiency with statistical concepts; confidence interval, hypothesis testing, sampling
  • Operating systems concepts such as CPU, memory, disk queues and graphing/analyzing these over time
  • Must possess strong organizational skills and be able to work with minimal oversight
  • Ability to understand new technologies quickly and adapt these into daily work and goals
Preferred Requirements:
  • Prior C#, ASP.NET, Ruby, Go or Java development experience, preferably in an agile SaaS environment
  • Significant experience with open source platforms and technologies.
  • Experience with software development processes and methodologies
  • Track record of architecting, developing, implementing robust, distributed online solutions