Enterprise Service Management Office, Incident Management Standards
Purpose
This document is for all university organizations operating or managing IT infrastructure and defines the minimum standard for managing IT Incidents.
Goal of Incident Management
The goal of Incident Management is to minimize the negative impact of IT Incidents by restoring normal service operations as quickly as possible. Resolving disruptions efficiently protects the university’s core operations, including teaching, learning, research, and essential administrative functions, by maintaining continuity and stability across the institution.
Incident Management Standards
Discovery, Logging, and Prioritization
- Technical Support teams must log all IT Incidents in Team Dynamix upon discovery.
- Incident records must contain the following minimum information (as defined in the Incident Documentation Requirements) when initially logged:
- Title
- Affected Service
- Description
- Incident Priority
- Technical Support teams must prioritize all incidents using the campus prioritization scale.
- When multiple users report the same incident, Service Desk or user support teams must create a separate Incident Ticket in TDX for each user report. Then link all related tickets to the primary Incident Record using TDX’s parent/child relationship feature.
Escalation and Work Progress
- Technical support teams must document escalation protocols for service offerings in a location accessible to the service desk.
- Service desk must follow the documented escalation protocols when handling reports of a potential incident.
- Technical support teams must document all incident resolution steps taken in the incident record.
- Update the Incident status on the Incident Record as it changes throughout the life of the incident. (e.g., In Process, Resolved, Closed).
- If an incident involves suspected security compromise, it must be categorized as a cybersecurity incident and immediately escalated per security procedures.
- Cybersecurity incidents are a subset of incidents, but they require different handling, escalation, and governance (often via a Security or SOC function).
Incident Communications
- Unit Communication Plans Required. Each campus IT unit must maintain a documented Incident Communication Plan that identifies communication roles, approval paths (if applicable), and internal workflows. Unit plans must align with campus Incident Management standards and support timely updates to Status at Illinois or alternative communications channels.
- Initial Notification: For incidents affecting Enterprise IT Service Offerings, the responsible technical team must post an initial notification to Status at Illinois as soon as the incident is detected, even if full details are not yet known. Early awareness is required to set user expectations and reduce duplicate reporting.
- Priority-Based Update Cadence: Technical teams must provide ongoing updates in Status at Illinois according to the established communication and resolution targets for the incident priority level. Updates should continue at the defined intervals until the incident is resolved.
- Resolution Communication: When service is restored, the responsible team must publish a resolution update in Status at Illinois. The resolution notice should clearly state that service has been restored and provide any relevant user guidance (for example, whether any user action is needed to use the affected service).
- Required Content and Audience Clarity: All incident communications must:
- Identify the affected service by its official service name
- Describe the issue using clear, non-technical language appropriate for a broad campus audience
- Provide the expected timing of the next update
- Avoid internal troubleshooting details that do not aid user understanding
- Communications should be concise, transparent, and written for end users rather than technical staff.
- Incidents that do not affect Enterprise IT Service Offerings or that impact only a small, defined user group must still be communicated through a predetermined, unit-approved channel (for example: unit email lists, Teams channels, local signage, or department websites), rather than posting to Status at Illinois. This standard ensures affected users receive timely information while preserving Status at Illinois for incidents with broader campus impact.
- Cybersecurity incidents may override standard SLAs and workflows due to legal, regulatory, and risk considerations: Report a Cybersecurity Incident | Privacy & Cybersecurity | Office of the Chief Information Officer | Illinois
Resolution
- Once service has been restored, update the incident record status and describe how the incident was resolved. Include links to relevant Knowledge Base articles and related records (changes, problems, incidents).
- Communicate that the incident has been resolved according to the incident communications standards.
- Ensure that any user tickets associated with the incident are closed.
Post-Incident Reviews (PIR)
- Conduct PIRs for all critical and high-priority incidents. PIRs support continual improvement.
- Technical Support teams must document the actions taken to resolve the incident, the root cause of the incident (if known), and any lessons learned or improvement opportunities identified during the PIR.
- Track follow-up actions arising from PIRs, including assigned owners and target completion dates, to ensure they are completed.
Incident Record Requirements
The following information must be included in every Incident Record. Units may choose to collect additional information if needed.
| Data Point | Definition | Purpose |
|---|---|---|
|
Start Timestamp |
The time the incident began. |
Average incident duration |
|
Discovery Timestamp |
The time the incident was discovered. |
Average discovery time |
|
Response Timestamp |
The time of the first action taken. |
Average response time |
|
Resolution Timestamp |
The time that the incident was resolved. |
Average resolution time |
|
Status |
Current lifecycle state (e.g., In Progress, Resolved, Closed). |
Tracks incident progression |
| Data Point | Definition | Purpose |
|---|---|---|
|
Affected Users |
The users impacted by the incident. |
User impact metrics and prioritization |
|
Responsible |
The group or technician currently assigned to the incident. |
Identifies responsible parties throughout incident lifecycle |
| Data Point | Definition | Purpose |
|---|---|---|
|
Title |
Short name to refer to the incident |
Meaningful way to refer to incident |
|
Description |
Clear summary of symptoms, impact, and relevant context. |
Captures vital information about the incident and response |
|
Affected Service |
The IT service offering experiencing the unplanned interruption or degradation. |
Enables service-based reporting and trend analysis |
|
Incident Priority |
Priority based on the campus Prioritization Scale. |
Incident response context and baseline metrics |
|
Actions Taken |
High-level summary of actions taken to trouble-shoot and resolve the incident. |
Assists with escalations and incident review |
Incident Prioritization Scale
Incidents are prioritized using the campus scale which allows campus-wide incident management reporting. Incident priority is determined by impact and urgency, not solely by the number of users affected.
| Priority | User Impact | Academic/Business Impact | Examples |
|---|---|---|---|
|
Critical |
Campus-wide impact |
Incident prevents teaching, research, or essential business operations |
|
|
High |
Broad impact (multiple users, departments, or buildings) |
Incident significantly interferes with teaching, research, or essential business operations, but work may continue in a degraded state |
|
|
Medium |
Limited impact (single user or small group) |
Teaching, research, or business operations are impacted but can continue with a reasonable workaround |
|
|
Low |
Minimal or no immediate user impact |
No significant disruption to operations; issue is cosmetic, informational, or deferred maintenance |
|
Major Incident Trigger
- Critical incidents will be evaluated for Major Incident activation upon identification.
- The Major Incident Response Plan is initiated when an incident is determined to meet defined criteria for significant business impact, widespread service disruption, or coordinated cross-team response.
- Activation of the Major Incident Response Plan follows the procedures and designated roles outlined in the Major Incident Response Plan documentation.
Key points:
- Single-user incidents are typically Medium unless the issue is minor, cosmetic, or has no meaningful impact.
- Low priority incidents may not impact users immediately but represent minor issues, cosmetic defects, or risks that should be addressed before they cause disruption.
Communication and Resolution Targets
| Incident Priority |
Communication Target Communicate as soon as you have credible confirmation, even if details are minimal. |
Resolution Target |
|---|---|---|
|
Critical |
|
4 hours |
|
High |
|
8 hours |
|
Medium |
|
2 business days |
|
Low |
|
5 business days |
Key Point: Resolution targets apply to time under the control of the assigned support team. Time awaiting vendor or third-party action may be excluded when appropriately documented.
Measurement and Reporting
The following metrics are required for Incident reporting. Note: In these reports, an Incident refers to the actual service-affecting event, while an Incident Record refers to the ticket created in the ITSM system to track that event. Metrics are calculated based on Incident Records, unless otherwise specified.
Information Technology Infrastruture Library (ITIL) Maturity Model
The ITIL Maturity Model is a tool that organizations can use to objectively and comprehensively assess their service management capabilities and the maturity of their Service Value System.
|
Level 1 |
The practice is not well organized; it is performed as initial/intuitive. It may occasionally or partially achieve its purpose thought an incomplete set of activities. |
|---|---|
|
Level 2 |
The practice systematically achieves its purpose though a basic set of activities supported by specialized resources. |
|
Level 3 |
The practice is well defined and achieves its purpose in an organized way, using dedicated resources and replying on inputs from other practices that integrated into a service management system. |
| Level 4 |
The practice achieves is purpose in a highly organized ways, and its performance is continually measured and assessed in the context of the service management system. |
| Level 5 |
The practice is continually improving organizational capabilities associated with its purpose. |
Terminology
Incident
Any unplanned interruption to an IT Service offering or a reduction in the quality of an IT Service. If a service is not operating at the level of performance agreed upon in the service level agreement (SLA) or as defined by the service provider, it constitutes an incident and must be logged.
Incident Record
A documented set of data containing all details and the history of an incident—from initial reporting to resolution and closure—used to manage the incident lifecycle, track progress, and provide data for future improvement.
User
An individual who uses the IT services provided.
Incident Manager
The Incident Manager coordinates the end-to-end handling of Incidents, ensuring effective collaboration, clear communication, and timely resolution. They monitor Incident activity, lead reviews, and drive continual improvement of Incident processes, models, and practices.
Service Desk
First point of contact for users reporting incidents. Responsible for logging, categorizing, and prioritizing incidents, and providing initial diagnosis or resolution where possible.
Technical Support
Provides specialized investigation and resolution for incidents escalated from the Service Desk or detected by technical teams. Responsible for implementing fixes and collaborating with other support groups as needed.
Campus Unit ITSM Liaison
Ensures that campus-wide Incident Management processes are understood, adopted, and consistently followed within their unit, and serves as the point of coordination between the unit and the central ITSM governance group.
Post-Incident Review (PIR)
A post-incident activity that analyzes past incidents to identify trends, root causes, and process improvements
Incident Prioritization
The process of determining the relative importance of an incident by assessing its impact (scope of damage) and urgency (speed required for resolution).
Business Impact
Evaluates the incident's effect on business processes, revenue, and compliance. It determines how critical the incident is to overall organizational operations.
User Impact
Evaluates the number of users affected by an incident or the severity of the functional limitation for an individual.
KnowledgeBase
A central repository containing documentation like FAQs, how-to articles, and troubleshooting guides.
Service Value System (SVS)
The ITIL SVS describes how all the components and activities of the organization unite as a system to enable value co-creation.
