Senior Lead SysOps/Devops Engineer

Integrant Al Qahirah, Egypt
Apply Now

We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales, designing and running mission-critical GPU, HPC, and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.   This role carries both SysOps, HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution

What You Will DoPresales & Business Development •       Partner with sales and solution teams to identify and qualify new opportunities •       Lead or support technical presales activities: discovery workshops, RFP responses, architecture presentations •       Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients •       Prepare high-quality technical materials •       Act as a trusted technical advisor during client conversations, proposing solutions aligned to business goals   In-Account Delivery — SysOps & DevOps Execution •       Operate directly within client accounts as a senior SysOps/DevOps engineer •       Run, troubleshoot, and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on •       Own Linux system administration at a deep level: kernel tuning, storage, networking, performance profiling •       Implement and maintain IaC pipelines, GitOps workflows, and CI/CD systems •       Serve as the senior escalation point for complex operational incidents within accounts   Architecture & Solution Design •       Design end-to-end platform architectures spanning cloud, hybrid, and on-premises HPC environments •       Define workload isolation models, networking architectures, and storage strategies for multi-tenant platforms •       Recommend and validate technology choices aligned to client scale, budget, and team maturity •       Produce architecture decision records (ADRs), solution blueprints, and technical runbooks

Technical Competencies & Requirements1. Architecture & System Design •       Design production-grade multi-cluster Kubernetes platforms: ◦       RKE2, EKS (AWS), AKS (Azure) at enterprise scale ◦       GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools ◦       Hybrid cloud + on-premises HPC infrastructure •       Define and document: ◦       Workload isolation: namespaces, MIG partitioning, multi-tenancy models ◦       Networking: BGP peering, Ingress controllers, service mesh (Istio / Cilium) ◦       Storage: Longhorn, Ceph, distributed and high-throughput file systems   2. Platform Engineering & GitOps Strategy •       Define and enforce platform standards across the delivery lifecycle •       GitOps tooling: ArgoCD, Fleet — declarative cluster management •       CI/CD pipelines: Azure DevOps, Jenkins — build, test, promote •       Infrastructure as Code: Terraform (modules, remote state, workspaces), Ansible •       Standardize cluster bootstrapping, app deployment lifecycle, environment promotion (Dev → QA → Prod)   3. AI / GPU Infrastructure Architecture  (Priority Competency) •       Design and operate GPU compute platforms at scale: ◦       GPU Operator deployment and lifecycle management ◦       MIG (Multi-Instance GPU) partitioning for multi-tenant workloads ◦       Advanced scheduling: Run:AI, Kubernetes-native GPU scheduling (device plugins) •       Understand AI workload classes and their infrastructure implications: ◦       Distributed training workloads (data/model/pipeline parallelism) ◦       Inference pipelines — NVIDIA Triton Inference Server, TensorRT optimization •       Align infrastructure to the full AI stack: ◦       CUDA stack, cuDNN, NCCL collective communication libraries ◦       High-speed networking: InfiniBand (HDR/NDR), RoCE for RDMA ◦       GPUDirect RDMA / GPUDirect Storage for low-latency data paths   4. Observability & Reliability Engineering •       Define and implement full-stack observability: ◦       Metrics: Prometheus, Thanos (long-term retention, multi-cluster) ◦       Logs: Loki, Fluent Bit ◦       GPU telemetry: DCGM Exporter, NVIDIA Nsight Systems •       Build operational frameworks: ◦       SLO / SLA definitions and error budget tracking ◦       Alerting strategy — noise reduction, severity routing ◦       Incident response playbooks and on-call runbooks   5. Security & Multi-Tenancy Architecture •       Design zero-trust security postures for multi-tenant platforms •       Secret management: HashiCorp Vault, External Secrets Operator •       Identity and access: IAM, RBAC, SSO/OIDC integration •       Network isolation: NetworkPolicy, micro-segmentation, mTLS •       Secure GPU sharing: MIG isolation, VGPU licensing, tenant boundary enforcement   6. HPC, Data & Storage Architecture  (Priority Competency) •       Understand the high-performance storage for AI/HPC workloads: ◦       GPUDirect Storage — bypassing CPU for GPU-native I/O ◦       Distributed file systems: Weka (high-throughput NFS/S3), Ceph (scalable object/block) ◦       Storage tiering, caching strategies, and data lifecycle management •       Size and validate storage architectures against workload I/O profiles   7. Operational Leadership & Linux Systems •       Lead incident response and root cause analysis (RCA) for critical production issues •       Define upgrade strategies, change management procedures, and disaster recovery plans •       Write and maintain runbooks, operational playbooks, and knowledge base content •       Integrate organizational processes, compliance requirements, and security policies into operational frameworks •       Deep Linux expertise: ◦       Kernel tuning (CPU governor, NUMA, IRQ affinity, hugepages) ◦       Storage I/O scheduling, NVMe optimization ◦       Network stack tuning for RDMA / InfiniBand ◦       System performance profiling and bottleneck analysis  

Candidate Profile — Who You Are•       you are comfortable running production systems. •       You have stronger SysOps and HPC depth than DevOps breadth, and you embrace that identity •       You can shift fluidly between running a live incident, presenting an architecture to a CTO, and reviewing a POC demo environment •       You communicate technical complexity clearly — to engineers and to C-level stakeholders •       You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations •       You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions •       You thrive in ambiguity and can scope both short POCs and long-horizon platform programs