Talent.com
הצעת עבדה זו אינה זמינה במדינה שלכם.
Site Reliability Engineer

Site Reliability Engineer

TaboolaIsrael
25 לפני ימים
תיאור המשרה

Realize your potential by joining the leading performance-driven advertising company!

As Site Reliability Engineer on the IT Production team in our TLV Office, you’ll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.

To thrive in this role, you’ll need :

  • 7 years of experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
  • Experience supporting, troubleshooting and scaling large distributed systems in production.
  • Deep understanding of HTTP protocol, including HTTP / 1.1, HTTP / 2, caching semantics, TLS and gRPC delivery.
  • Experience configuring and operating CDN services (e.g., Akamai, Fastly, Cloudflare, AWS CloudFront).
  • Deep understanding in Linux system internals and system performance tuning.
  • Experience with Configuration Management Tools (Puppet, Ansible, Chef, Terraform).
  • Experience programming in at least one of the following languages (Python, Golang, Rust, Ruby, C++, Java).
  • Experience with monitoring and metrics collection systems (Prometheus, Grafana, ELK).
  • Experience with cloud providers and platforms (AWS, Azure, GCP, Alibaba).
  • Experience with containerization technologies (Kubernetes, Docker).
  • Deep understanding of networking principles (TCP / IP, DNS, load balancing).

How you’ll make an impact :

As a Site Reliability Engineer , you’ll bring value by :

  • Ensure Reliability & Scalability : Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI / ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.
  • Drive Automation : Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).
  • Develop Observability & Capacity : Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.
  • Maintain Security & Compliance : Integrate security best practices and ensure compliance with industry standards.
  • Lead Incident Management : Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.
  • Foster Collaboration & Improvement : Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.
  • Our Tech Stack :

    Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.

    צור התראת עבודה עבור חיפוש זה

    Site Reliability Engineer • Israel