הצעת עבדה זו אינה זמינה במדינה שלכם.

Site Reliability Engineer

TaboolaIsrael

25 לפני ימים

תיאור המשרה

Realize your potential by joining the leading performance-driven advertising company!

As Site Reliability Engineer on the IT Production team in our TLV Office, you’ll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.

To thrive in this role, you’ll need :

7 years of experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
Experience supporting, troubleshooting and scaling large distributed systems in production.
Deep understanding of HTTP protocol, including HTTP / 1.1, HTTP / 2, caching semantics, TLS and gRPC delivery.
Experience configuring and operating CDN services (e.g., Akamai, Fastly, Cloudflare, AWS CloudFront).
Deep understanding in Linux system internals and system performance tuning.
Experience with Configuration Management Tools (Puppet, Ansible, Chef, Terraform).
Experience programming in at least one of the following languages (Python, Golang, Rust, Ruby, C++, Java).
Experience with monitoring and metrics collection systems (Prometheus, Grafana, ELK).
Experience with cloud providers and platforms (AWS, Azure, GCP, Alibaba).
Experience with containerization technologies (Kubernetes, Docker).
Deep understanding of networking principles (TCP / IP, DNS, load balancing).

How you’ll make an impact :

As a Site Reliability Engineer , you’ll bring value by :

Ensure Reliability & Scalability : Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI / ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.

Drive Automation : Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).

Develop Observability & Capacity : Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.

Maintain Security & Compliance : Integrate security best practices and ensure compliance with industry standards.

Lead Incident Management : Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.

Foster Collaboration & Improvement : Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.

Our Tech Stack :

Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.

צור התראת עבודה עבור חיפוש זה

Site Reliability Engineer • Israel