Descripción del trabajo
We’re looking for a DevOps/Site Reliability Engineer to join our Client’s Platform Engineering team. In this role, you’ll be providing the key function of building and administering the solutions and infrastructure that ensure our cloud service maintains availability, scalability, performance and
What you’re responsible for:
●Perform and automate system administration services including installation, configuration, maintenance, and disaster recovery.
● Control EC2 instance lifecycle and other AWS resources using Infrastructure-as-Code tools such as Terraform
● Respond to production incidents, troubleshoot, resolve, and document. (Must be ok with being on-call during PageDuty schedule)
● Configure, test, deploy, and upgrade software for production EC2 servers in AWS
● Author and recommend settings for applications, operating systems, networks, and cloud services to improve performance, security, and reliability
● Set up and manage authentication, access control, and network security
● Develop plans and perform routine maintenance tasks for infrastructure systems such as patch management and application hotfixes
● Generate vulnerability management reports with all relevant actions and information
● Contribute to internal wiki with technical documentation, manuals and IT policies
● Participate in the design, implementation, and execution of backup and disaster recovery plan for infrastructure solutions
● Work closely with App Devs to build out CI/CD pipelines and workflows.
● Creatively solve new complex problems while trying to keep the solution simple.
● Reduce build, deploy & rollback times while simultaneously reducing risk and exposure.
The qualifications you’ll need:
● Obsessively focused on customer satisfaction and attention to detail in your work
● A strong desire to learn and contribute across a variety of technologies and disciplines
● Love of working in a fast moving, constantly changing team environment
● Ability to quickly diagnose and solve problems collaboratively
● Proven experience as a System Administrator, Network Administrator or similar role required
● Expert knowledge of Linux
● Understanding of AWS security concepts including VPC, VPN, KMS, and IAM to protect data at-rest and in-transit via encryption highly desirable
● Ability to create scripts in Python, Go, or other modern language required
● Proven understanding of TCP/IP, DNS, DHCP, NAT, routing, and related networking protocols, technologies and security related protocols (SSH, HTTPS, IPsec, etc.)
● Experience using git or similar code repository required
● Knowledge of workflow tools (e.g. Github, Jenkins, cloud builders.)
● Experience in 24×7 production operations, preferably supporting a highly available environment
for a SaaS or cloud service provider
● Experience with HashiCorp’s Terraform and Packer tools a big plus.
● Strong analytical and problem solving skills including the ability to quickly identify trends and patterns and ability to identify root causes of problems and candidate solutions.
● Working knowledge of containers (Docker, Kubernetes, ECR, etc) a plus.
● Experience with passive evaluations such as compliance audits and active evaluations such as vulnerability assessments a plus
● Able to participate in 24×7 on-call rotation
● Excellent communication skills and willing to take initiative
Detalles del trabajo