Realize your potential by joining the leading performance-driven advertising company!
As Site Reliability Engineer on the IT Production team in our TLV Office, you’ll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.
To thrive in this role, you’ll need :
- 7 years of experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
- Experience supporting, troubleshooting and scaling large distributed systems in production.
- Deep understanding of HTTP protocol, including HTTP / 1.1, HTTP / 2, caching semantics, TLS and gRPC delivery.
- Experience configuring and operating CDN services (e.g., Akamai, Fastly, Cloudflare, AWS CloudFront).
- Deep understanding in Linux system internals and system performance tuning.
- Experience with Configuration Management Tools (Puppet, Ansible, Chef, Terraform).
- Experience programming in at least one of the following languages (Python, Golang, Rust, Ruby, C++, Java).
- Experience with monitoring and metrics collection systems (Prometheus, Grafana, ELK).
- Experience with cloud providers and platforms (AWS, Azure, GCP, Alibaba).
- Experience with containerization technologies (Kubernetes, Docker).
- Deep understanding of networking principles (TCP / IP, DNS, load balancing).
How you’ll make an impact :
As a Site Reliability Engineer , you’ll bring value by :
Ensure Reliability & Scalability : Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI / ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.Drive Automation : Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).Develop Observability & Capacity : Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.Maintain Security & Compliance : Integrate security best practices and ensure compliance with industry standards.Lead Incident Management : Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.Foster Collaboration & Improvement : Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.Our Tech Stack :
Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.