Enterprise Service Reliability and Insights Lead

ID 2025-3196
Job Locations
US
Category
Information Technology
Type
Regular Full-Time

Overview

DecisionPoint seeks a Senior Enterprise Service Reliability and Insights Lead to oversee enterprise-wide monitoring, observability, and operational intelligence for a large federal and DoD-aligned IT environment. This senior-level role defines the monitoring strategy, manages toolsets, develops dashboards, establishes alerting thresholds, and ensures service reliability through proactive detection and rapid incident identification. 

The Enterprise Service Reliability and Insights Lead is responsible for driving visibility into uptime, system performance, service health, and operational risks. This position partners closely with Tier 2 and Tier 3 engineering teams, cloud operations, cybersecurity, and service desk leadership to ensure monitoring aligns with mission needs, SLAs, and enterprise performance objectives. 

This position is fully remote. 

Note: By applying to this position, you acknowledge and consent to having your resume included in an active competitive government contract bid.

Duties & Responsibilities

The Enterprise Service Reliability and Insights Lead will: 

  • Define, implement, and manage the enterprise monitoring and observability strategy. 
  • Oversee monitoring tools, dashboards, agents, log pipelines, and alerting configurations across all environments. 
  • Establish alert thresholds, escalation criteria, and performance indicators that support proactive issue detection. 
  • Ensure monitoring coverage aligns with uptime, performance, and security requirements. 
  • Collaborate with Tier 2 and Tier 3 engineering teams on system health assessments, log analytics, and incident triage. 
  • Lead efforts to correlate events across application, infrastructure, network, and security monitoring tools. 
  • Deliver actionable insights on system reliability, capacity issues, performance bottlenecks, and incident trends. 
  • Support SLA and KPI measurement, reporting, and compliance tracking. 
  • Maintain monitoring documentation, dashboards, service health definitions, and alerting standards. 
  • Partner with cloud, infrastructure, and cybersecurity teams to ensure observability supports mission and compliance needs. 
  • Recommend improvements to monitoring architectures, event correlation, and automation capabilities. 
  • Participate in incident response activities, root cause analysis sessions, and readiness reviews. 
  • Drive continuous improvement initiatives across reliability engineering and service monitoring. 

Qualifications

Clearance Requirement 

Must hold an active Secret clearance, supported by a Tier 3 background investigation. 

 

Education (Required) 

Bachelor’s degree in Information Technology, Cybersecurity, Systems Engineering, or a related technical field. 

 

Experience (Required) 

  • Minimum 10 years of experience in service reliability, monitoring engineering, IT operations, or systems engineering. 
  • Experience designing or managing enterprise monitoring systems and dashboards. 
  • Experience defining SLAs, KPIs, and operational performance measurements. 
  • Experience collaborating with Tier 2 and Tier 3 teams for incident management and problem resolution. 
  • Experience with log analysis, event correlation, and observability platforms. 

 

Technical Knowledge (Required) 

  • Strong understanding of monitoring and observability tools (metrics, logs, traces). 
  • Knowledge of uptime, performance, and reliability engineering practices. 
  • Familiarity with ITIL v4 processes for incident, problem, and change management. 
  • Understanding of alerting strategies, threshold design, and escalation workflows. 
  • Knowledge of DoD or federal IT operational environments. 

Technical Knowledge (Preferred) 

  • Experience with cloud-native monitoring services and distributed systems monitoring. 
  • Experience with APM tools, SIEM integrations, or event correlation engines. 
  • Familiarity with automation scripting or analytics for monitoring enhancement. 

 

Certifications 

Required: 

  • ITIL v4 Foundation 
  • CompTIA Security+ 

Preferred: 

  • Cloud monitoring certifications (AWS, Azure, or similar) 
  • SRE or observability-related certifications 

 

Skills 

  • Strong analytical skills for interpreting system health and service reliability data. 
  • Excellent communication and reporting skills for executive and technical audiences. 
  • Ability to lead cross-functional coordination during performance events and incidents. 
  • High attention to detail with strong documentation habits. 
  • Ability to drive continuous improvement across monitoring, reliability, and availability functions. 

Our Equal Employment Opportunity Policy

  • EEO and Affirmative Action Policy: DecisionPoint Corporation is an Equal Employment Opportunity and Affirmative Action employer. It is the policy of DecisionPoint Corporation to provide equal employment opportunity in accordance with all applicable Equal Employment Opportunity/Affirmative Action laws, directives and regulations to all employees and qualified applicants without regard to race, ethnicity, color, religion, national origin, sex, age, disability status, pregnancy, sexual orientation, gender identity, genetic information, protected veteran status, or any other protected status under Federal, State or Local laws.
  • Pay Transparency Policy: In accordance with Presidential Executive Order 13665, DecisionPoint Corporation will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant. However, employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information, unless the disclosure is (a) in response to a formal complaint or charge, (b) in furtherance of an investigation, proceeding, hearing, or action, including an investigation conducted by the employer, or (c) consistent with the contractor's legal duty to furnish information.
  • Authorization to Share Resume and Personal Information: By expressing your interest and submitting your resume for this position, you authorize DecisionPoint Corporation to share your resume, as well as personal information included on the resume, with its subsidiaries, affiliates and teaming partners for the purpose of considering you for this position and other available positions requiring comparable skills, education and experience. Should DecisionPoint Corporation. or its affiliates and teaming partners wish to initiate pre-employment discussions, you will be asked to complete an employment application and related employment documents.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed