Site Reliability Engineer
About the Role
At its heart, this role is about understanding what our people (users, developers, customers) need from our systems and setting up the processes and structures that better meet those needs. By working across organizational boundaries you will drive transformation and innovation in reliability and efficiency. You will work to balance the risk of unavailability with the goals of rapid innovation and efficient service operations so that overall happiness is optimized.
In our current environment, our CTO and senior engineers manage our hosting infrastructure. It is our desire to build out a team focused on enabling our cross-functional squads to manage their deploys using the "paved roads" of platform capabilities that this newly formed team will be responsible for building. Early on, a key part of your responsibilities will be to work closely with our CTO, Director of Engineering and Director of Product to understand our business. You will then recommend and implement the tooling, operational practices and infrastructure that will allow our product teams to continue scaling in an intentional manner. As the team grows, your responsibility will shift to become more focused on maintaining the platform, helping teams resolve infrastructure issues, keeping us current with our cloud computing technology and advising product teams on the best ways to achieve their goals and take ownership of the application code they create.
We desire an individual with an automation-first, automate everything mindset who is familiar with modern, distributed environments and infrastructure-as- code. The ideal candidate will have an ownership mentality and be willing to see issues through to resolution, but also consider which responsibilities should be federated out to product teams. We will ask that you partner with these groups of engineers and product managers to ensure you are not a bottleneck to innovation.
No individual knows everything and we are seeking a T-shaped engineer to help us with this position. We believe that attitude, communication skills and tenacity will be the defining attributes that contribute to this individual's success.
- Understanding and experience with the of key concepts related to modern, distributed hosting systems for high-traffic web systems including DNS, networking, caching, load balancing, etc.
- Experience with one or more development languages (Ruby, Python, etc.) - We use Ruby heavily in our organization so Ruby experience is a plus
- Expertise with AWS
- Experience with Terraform, Ansible
- Containerization and Orchestration - (ECS, K8s)
- Experience with CDNs - We use Cloudflare heavily
- Configuration management
- Observability and alerting across infrastructure and applications
- Server and application logging
- Shell scripting
- Database experience - We use MySQL and PostgreSQL
- Knowledge of CI/CD tools and pipeline operations
- Agile/Scrum workflows
- Familiarity with code versioning systems. We use Git and Github.
- Familiarity with issue management software. We use Jira.
- Lead the planning, design, development, and implementation of platform solutions related to web application deployment, security, scalability, and application health.
- Work within a budget, and help develop forecasts for new budget requirements involving infrastructure.
- Lead integrations, CI/CD pipelines, container orchestrations and virtual machine configuration within an Amazon Web Services (AWS) cloud environment.
- Maintain and improve staging and production environments for the suite of sites that enable the Eezy.com ecosystem. Keep important, revenue-critical systems up and running.
- Exhibit strong critical thinking skills, with ability to assess and metabolize variety of team and company requirements and make a decision that is best for everyone.
- Triage and resolve production issues. You love bugs and edge cases because they point out areas where our systems may be overcomplicated or have gaps. Then you work to make it so nobody can break it in the same way again.
- Lead and conduct root cause analysis for issues and outages.
- Ability to describe infrastructure and systems to both technical and non-technical stakeholders.
- Ability to partner with diverse stakeholder groups - from managers to individual contributors - with the goal of driving key initiatives around code quality, monitoring, security and scalability.
- Work with managers to establish healthy boundaries between developers, test engineers and web system engineers that ensure development teams take ownership of the applications they create and features that they ship. This will involve education and the creation or purchase of tools or processes to that end. You will be asked to participate in vendor evaluation with the goal of helping us to grow our infrastructure ecosystem responsibly in a manner that enables ongoing team growth and scaling.
- Create backup mechanisms for production environments and test disaster recovery mechanisms.
At Eezy, we believe that life is too short to work with jerks. We believe that by empowering teams, and providing clear, kind and consistent feedback we can create an environment that supports our engineering team and embodies our G.A.R.D.E.N. core values.