Site Reliability Engineer (Remote)
Job Description
Do you enjoy working with a highly motivated and talented team to deliver mission critical software? Our Site Reliability Engineering team is growing to help build deployment pipelines, manage infrastructure, troubleshoot, and enhance our complex cloud-based service for our enterprise clients and our SaaS solution.
As a Site Reliability Engineer you will design, implement and deploy cloud base infrastructure on AWS. Our platform is built using the latest technologies that allow us to scale and give us the ability to make our platform portable. This stack includes Amazon Web Services (EKS, EC2, Cloudsearch, ElasticSearch, MSK etc), microservice-based infrastructure, RDS (MySQL, PostgreSQL), Istio service mesh, Helm and Terraform. Your focus will be on maximizing system uptime, service stability and implementing new features to make our platform better. Team members all participate in an on-call rotation.
You will build innovative automated solutions and tools to help debug and resolve problems in production and prevent them from recurring. Further, you will proactively seek out system weaknesses and find ways to fix them before they cause production issues using monitoring data and watching trends.
Responsibilities
· Working closely with the internal Project Managers/Implementation team to ensure the success of each client solution.
· Monitoring our installations to ensure there are no issues and improve upon our current monitoring/visibility capabilities.
· Writing, updating, and using documentation, including runbooks/playbooks
· Debugging complex problems across the entire stack and creating solutions/liaising with the team responsible.
· Improving our automation, including infrastructure needs, testing, failover solutions, failure mitigation, and much more
· Developing CI/CD processes to improve cadence
· Creating internal tools (CLI/CI/API's) to enable the Software Engineers.
Key Skills and Attributes
· Commercial experience with Software Engineering, Software Development, or system operations/administration.
· Worked with containers (e.g.Docker) and worked with Container Orchestration (e.g. Kubernetes)
· Excellent communication skills, both verbal and written
· Experience debugging complex problems
· Worked with one or more programming languages (PHP, GO, Python, NodeJS)
· Has experience with a variety of databases (MySQL, Postgres, Redis, ElasticSearch)
· Worked with Terraform (Infrastructure as code)
Preferred
· Experience with DevOps engineering or SRE
· Experience with monitoring and observability such as with Datadog or similar
· Experience automating infrastructure, testing, and deployments using tools like Ansible, Chef, or Terraform and can explain the Infrastructure as Code paradigm
Candidates are not expected to have all of the above, we're looking for people who have strong experience in a few of the points and are willing to learn.
We have a FinTech start-up culture that emphasizes transparency, collaboration and career growth. Employees are able to create change at scale and have an opportunity to truly improve and shape our infrastructure.