Job Description:

Observability Engineer / Application Performance Engineer

Client’s Enterprise Data Machine Learning (EDML) employs innovative minds like yours to design and develop software systems that can meet the demand of our ever-growing customer base. Like a startup inside an enterprise, EDML uses a customer-centric approach to building our product to enable data-driven conversations with our customers.

As one of the Observability Engineers, you’ll be able to work closely with customers, product management, and other subject matter experts in the technology industry to drive forward solutions that have an immediate impact on the day-to-day ability of other data scientists and machine learning engineers to productionize their models by iteratively improving how we instrument our software-systems for observability and consequentially help enable infrastructure automation, product analysis, application performance.

Our team culture allows you to work in a highly collaborative team of professionals in a horizontal job role that has excellent writing and communication skills to ensure that instrumentation, practices, processes, and tools are documented and released in a way that helps others understand the stability and functionality of each application.

In this role, you’ll be working directly with other SRE and development team members to help create an SDK and API service that offers highly-personalized observability in a way that is easy to onboard and instrument by other developers and makes synthesizing the state of distributed software systems simple.

To do this, you’ll need to leverage your depth of knowledge and expertise in telemetry, software development lifecycle, and previous experience working in creating workflow-related software products.

What You'll Do

- Collaborate closely with AI/ML Engineers to identify monitoring data gaps and provide recommendations to improve instrumentations to build and train production-grade ML models on large-scale datasets that solve various business use cases.

- Work with multiple teams to provide custom NewRelic dashboards and alerts to monitor our product usage, application performance, Kubernetes cluster health, and cloud resource utilization in several AWS cloud environments.

- Integrate application instrumentation with CNCF projects like ArgoCD and various AWS services for proactive, protective, and predictive monitoring that enables self-healing in different monitoring tools

- Collaborate with other stakeholders including ML platform teams and financial services by participating in agile ceremonies and technical design conversations in a way that drives product direction, with a clear understanding of the product roadmap and potential risks.

- Educate application developers on the best way to understand the runtime state of a product effectively through telemetry.

Basic Qualifications

- Minimum 2+ years of working experience as a software developer in projects related to SDKs, APIs, application performance, or other relevant engineering experience.

- Basic understanding of Kubernetes concepts and deploying applications onto a cluster

- Proficiency and hands-on experience with Python and associated frameworks.

- Experience provisioning cloud infrastructure inside an enterprise AWS multi-account cloud environment.

- Deep-level expertise in instrumentation, customization, and usage of modern monitoring tools such as New Relic, CNCF OpenTelemetry, Prometheus, PagerDuty, and AWS CloudWatch

- Experience in implementing enterprise-level observability strategies, including best practices on telemetry (events, metrics, logs, traces) for cloud applications and Kubernetes clusters.

Desired Skills/Experience

- Strong working knowledge of modern development technologies and tools such as Argo CD, Git, Terraform, and Jenkins.

- Experience working with end-to-end pipelines using frameworks like Prefect, Streamz, and Apache Airflow is preferred. Building and maintaining various components of an ML Engineering pipeline is a big plus.

- Excellent technical writing skills, verbal communication, and teamwork.

Desired Credentials

- Python Institute Certifications

- New Relic Programmability Certified

- New Relic Full Stack Observability Practitioner

- AWS Certified Solutions Architect

- Certified Kubernetes Application Developer (CKAD) or Certified Kubernetes Administrator (CKA)

Please refer to the Observability Engineer description.