The Quant DevOps Engineer owns and operates the application engineering platform, cloud infrastructure, and data execution environment that application and modeling workloads depend on.
The role is accountable for the end‑to‑end reliability, scalability, security, performance, and cost efficiency of the platform, supporting large‑scale, data‑intensive and compute‑heavy workloads such as parallel Databricks clusters and high‑memory systems.
This position requires broad and deep technical expertise, strong operational judgment, and the ability to act as a central technical interface between engineering teams, IT operations, security/SecOps, and data users. It is a senior ownership role with significant impact on delivery speed, platform stability, and infrastructure cost.
Platform & Infrastructure Ownership
- End-to-end ownership and delivery accountability for the application execution platform and production model runs (Databricks, Servers).
- Day-to-day operational responsibility: availability, incident handling, runtime management, and delivery continuity for business-critical runs (incl. peak periods like YE).
- Manage Azure cloud infrastructure, including: Virtual machines, storage, identity and access management (RBAC)Networking components such as firewalls, peering, and cross‑subscription connectivity
- Lead standardization with central teams across observability, security controls, platform services, and “golden paths,” while keeping delivery running
- Ensure platform reliability, scalability, and long‑term sustainability
CI/CD & Automation
- Design, maintain, and continuously improve CI/CD pipelines using Azure DevOps and related tooling
- Build and evolve automation using scripting and build tools
- Optimize pipeline performance, reliability, and parallel execution to support large‑scale workloads
Container & Runtime Management
- Own Docker image creation, lifecycle management, and governance
- Optimize build processes, caching strategies, and container security
- Support containerized execution environments for compute‑heavy workloads and services
Infrastructure as Code & Configuration Management
- Maintain and evolve Infrastructure as Code using Terraform
- Operate and improve configuration management systems (e.g. SaltStack)
- Reduce configuration drift and improve reproducibility across environments
Observability, Security & Secrets
- Own and operate observability platforms (e.g. ELK, Prometheus, Grafana)
- Ensure meaningful metrics, logs, dashboards, and alerting are in place
- Manage secrets and platform security tooling (e.g. Wiz, Snyk)
- Collaborate closely with Security and SecOps teams on controls, findings, and improvements
Data Platform & Capacity Planning
- Configure/support Databricks usage for application workloads; manage workspace-level configuration/permissions as delegated; partner with central Databricks lead for global administration and optimization.
- Support large‑scale data and compute workloads
- Lead capacity planning for: Highly parallel Databricks clusters (e.g. up to ~10 × 80‑node clusters)Memory‑intensive systems (multi‑terabyte RAM)Data pipelines producing terabytes of data
- Balance performance, reliability, and cost across platform decisions
Operations & Incident Response
- Act as senior escalation point for platform and infrastructure incidents
- Participate in a limited on‑call rotation
- Investigate incidents and execute or coordinate remediation
- Perform manual interventions when automation is insufficient
- Drive post‑incident reviews and platform improvements
Cross‑Team & Organizational Coordination
- Serve as primary technical contact for: IT OperationsSecurity and SecOpsArchitecture and governance bodiesService management processes
- Coordinate platform‑related work across teams
- Support customer‑facing technical discussions related to platform capabilities and constraints
Experience
- Background in Insurance, Finance, or Scientific / High‑Performance Computing environments
- Strong experience in platform or DevOps engineering within production environments
- Solid expertise in cloud infrastructure, preferably Microsoft Azure
- Hands‑on experience with CI/CD, Infrastructure as Code, and container platforms
- Proven experience operating data‑intensive and compute‑heavy systems
Technical Competencies
- Strong troubleshooting and operational mindset
- Ability to manage and balance competing constraints: CostPerformanceSecurityReliability
- Deep understanding of platform stability, scalability, and automation
Professional Competencies
- Senior‑level autonomy and decision‑making capability
- Ownership mindset; accountable for outcomes rather than tasks
- Ability to operate as a trusted senior technical interface across teams