Skip navigation

Lead Site Reliability Engineer

at Movable Ink (View all jobs)Coherent Path - Toronto
Customers don’t experience data, they experience content. Movable Ink activates any data into personalized content in any customer engagement. The world’s most innovative brands rely on Movable Ink to accelerate their marketing performance. Headquartered in New York City, Movable Ink and its more than 500 employees serve its global client base from operations throughout North America, Central America, Europe, Australia, and Japan.

Coherent Path is the email marketing calendar company for top retailers that are seeking to transform their email program into a modern, data-driven channel focused on revenue.

Our company's machine-learning solution empowers retailers with the relevant themes and categories they should feature in today’s campaigns while continuously learning to inform the campaigns of tomorrow. By creating an optimized email diet that caters to each customer's evolving tastes and moods, Coherent Path’s software helps retailers quickly engage with and cross-sell to customers and promote strategic product categories while reducing email fatigue. Coherent Path has offices in Boston and Toronto.

We are looking for a talented Lead Site Reliability Engineer to take ownership of all things infrastructure and deliver a highly scalable, performant, and available platform that our portfolio of applications can rely on. There’s lots to do and lots to learn, so we hope you are also a fast learner who can grow with Coherent Path as we build the future of email marketing!

Responsibilities

  • Own SLOs/SLIs across all services and applications to provide metrics to the development teams and facilitate continuous improvement
  • Work together with the engineering team to improve CI/CD pipeline with a focus on successful deployment of services and applications
  • Drive improvements in the infrastructure, ensure that all the infrastructure can be consistently reproduced with Terraform
  • Maintain Incident Playbooks and ensure that a consistent process is followed to guarantee a rapid response
  • Enforce regular Infrastructure Security Audits, drive automation where appropriate
  • Continuously improve user experience as it relates to deployment and delivery
  • Optimize Production and lower environments and infrastructure through monitoring and automation
  • Drive platform management and capacity planning discussions
  • Assist with setup and deployment of new services as needed
  • Relentlessly eliminate false positive alerts
  • Perform application load testing/scalability
  • Participate in an on-call rotation to provide rapid response to critical issues in production

Requirements

  • 2+ years in a lead role
  • 4-7 years in a SRE or related role
  • Intellectual curiosity and a strong desire to learn
  • Problem solving skills, including the ability to disaggregate complex problems and incrementally implement solutions
  • Great communication skills to lead post-incident reviews, writing client-facing communication
  • A passion to efficiently support always-available applications
  • Able to multitask, prioritize, and manage time efficiently
  • Write and review application code: Python/TypeScript/JavaScript
  • Experience with Django web framework
  • Experience with configuration management and infrastructure deployment using Terraform 
  • Experience with monitoring and visualization tools like Prometheus and Grafana
  • Experience with deployment, logging, monitoring, securing services on GCP, AWS cloud providers
  • Experience with containerization and deployment automation tools: Docker, Kubernetes
  • Experience writing, maintaining, optimizing CI/CD pipelines
  • Experience with databases
  • Experience with Linux
  • Experience using Git

Nice to Have

  • Experience setting up an Application Platform Monitoring tool (New Relic, Datadog, Splunk, Dynatrace, etc.)
  • Write and review application code: Elixir

Interview Process

Learn More