Role Description
Givebutter is hiring a Site Reliability Team Lead to oversee the reliability, scalability, and performance of our systems. As a Lead SRE, you will be directly responsible for delivering world-class infrastructure to our users, maturing our operational practices, and leading a team of skilled engineers. You will report directly to our CTO and carry out our infrastructure vision while creating a scalable engineering culture that breeds innovation. You will ensure we are delivering excellent user experiences in a timely manner and retain top-notch security, design, and performance. You will cultivate a culture of high performance by creating systems that eliminate roadblocks, processes that incentivize excellence, and by being an expert in site reliability engineering. We have already built a great foundation, powering hundreds of millions of donations to over 10k+ organizations and you will take this impact much further.
Why join the Givebutter Engineering team?
Democracy of code - We are a group of engineers that values equal contribution as well as discussing architecture and ideas openly.
Not overburdened with meetings - Our Engineers manage their own calendars and block times so they can work uninterrupted.
Automated ci/cd - Our builds are reproducible and the pipeline is easy to manage. Shipping to production is hands-off, automated, and consistent. Our engineers are focused on solving problems with code.
Mission-driven, full stop - We work with amazing organizations, non-profits, and charities doing good all over the world.
Responsibilities
Manage and hire in-house SREs and contractor resourcesHandle and prioritize incidents, ensuring timely resolution and effective communication.Establish and manage key metrics for reliability; set up and maintain alerting systems.Automate tasks and manage infrastructure using Infrastructure as Code (IaC) tools and techniques.Ensure application scalability and identify performance bottlenecks to optimize system performance.Design and implement fault-tolerant and highly available systems to minimize downtime.Develop, implement, and regularly test disaster recovery plans to ensure business continuity.Conduct capacity planning to anticipate and manage future infrastructure needs.Define, measure, and maintain SLOs and SLAs to meet service performance expectations.Ensure the security of applications through best practices and conduct regular penetration tests to identify and mitigate vulnerabilities.Requirements
5+ years of experience building and deploying production infrastructure at scale5+ years experience working with AWSKnowledge of PHPAware of trends and best practices in SRE and cloud infrastructure2+ years of experience managing system architecture, ensuring best practices for reliability, performance, and securityStrong technical leadership, mentorship, and communication skillsExperience working for a product-led growth company is beneficialExperience managing a remote engineering team