Incident management process, to transform crises into opportunities for continuous improvement
From a blocked printer output to an application out of order, there are many incidents, more or less critical, that your IT system experiences. Hence the importance of putting in place an effective incident management process.
But how can you ensure that your incident management procedure is effective? What resolution stages should be defined? Is it possible to provide a satisfactory solution for the user, in line with your SLA, and within reasonable timescales?
To help you achieve greater efficiency and consistency, Appvizer explains in this article the principles and steps to follow, based on the ITIL framework, and outlines the benefits to be gained from this working method.
What is IT incident management?
Definition of incident management
Most IT incidents are managed in accordance with the ITIL (Information Technology Infrastructure Library) standard.
This project, developed in the 1980s by the British Office of Government Commerce, is a set of documents listing the best practices to be applied in the management of IT services on a broad scale. The aim is to provide methodological support for professionals, with a view to continuous improvement.
The ITIL process covers a number of themes (organisation of the information system, configuration management, change management, etc.), including incident management, which is specified as follows:
An incident is defined as any event which is not part of the standard operation of a service and which causes, or may cause, an interruption or a reduction in the quality of this service.
The different types of incident
The above definition encompasses different types of incident:
- Software or application incidents. Examples include
- programme error slowing down the user ;
- application slowdown, etc.
- Hardware incidents. Examples include
- printer output blocked ;
- hard disk nearly full, etc.
- Service requests. Examples include
- forgotten password ;
- request for specific documentation, etc.
Incident management VS problem management
Incident management is often confused with problem management. However, they involve different procedures.
According to ITIL, problem management is used to :
Minimise the negative impact on the company's activities of incidents and problems caused by errors in the IT infrastructure, and prevent the recurrence of incidents induced by these errors.
➡️ In other words, problem management is more proactive, while incident management is more reactive.
However, the two processes work in parallel, with problem management operating through the identification of recurring incidents.
Why is incident management important?
A standardised process for managing your incidents generates numerous benefits for your company 🤩 :
- it reduces the impact, sometimes critical, of incidents on the company and the business more quickly;
- it greatly simplifies the procedure by avoiding, for example, back and forth emails ;
- It allows recurring incidents to be identified, enabling the problem management process mentioned above to be deployed;
- It improves the quality of the business knowledge base by setting up databases for handling incidents;
- It provides transparency within the organisation regarding the resolution of incidents;
- It increases user and customer satisfaction, as well as the productivity of everyone in the company.
☝️ Bear in mind that an incident management process goes beyond simply resolving an IT problem. It provides solid support for the company's business functions, by reducing the number of slowdowns or stoppages in activities that would have an impact on turnover.
Example of a 7-step IT incident management procedure
#1 Identifying and recording the incident
To begin with, you need to identify the incident, specifying :
- its name and identification number
- the identity of the person responsible
- the date ;
- and above all its characteristics (nature, seriousness and impact on operations).
E.g.: a server breakdown affecting several departments will be considered a major incident, whereas a connection problem at a single workstation will be considered less critical.
It is up to the department responsible to record these details on the chosen medium (software, spreadsheet, form, etc.) and to report it to the support teams responsible for dealing with it in accordance with the procedure.
#2 Incident classification and analysis
The incident is then classified according to the order of priority defined upstream and specific to your organisation, depending for example on the impact on the business and the urgency of the situation.
For example, a network failure could be classified as a "connectivity" incident, with a "high" severity level if it paralyses the entire company.
At the same time, an initial analysis is carried out to determine the possible causes of the incident. Diagnostic tools or even previous experience can be used for this assessment.
☝️ Note that if this is a service request, you must follow the procedure associated with that service.
#3 Gathering evidence
The next step is to gather as much evidence as possible. The objective? To understand what happened, when, how and why.
For example, we're talking about :
- system or application logs;
- screenshots or videos
- error messages displayed ;
- network data or metrics from monitoring tools;
- any other element that can support the technical analysis.
☝️ Do not neglect this stage, as it determines the quality of the diagnosis to come, and therefore the speed of resolution.
#4 Investigating and diagnosing the incident
All the information relating to the incident is analysed, with the aim of resolving it and getting it back into service as quickly as possible. The teams in charge of this work use various methodologies, from log analysis to real-time tests.
👉 E.g.: if a server breaks down, the team will consult the event logs for critical errors or use monitoring tools to check hardware performance.
Be aware that sometimes the first level of service is unable to resolve the incident: this triggers an escalation of incidents, i.e. their resolution is transferred to the next level.
#5 Incident resolution and return to service
Incident resolution takes various forms:
- the incident is repaired immediately. The incident has been resolved and normal operations are resumed;
- a workaround has been found. Incident management must lead to the rapid restoration of services. If the system is not perfect, but it makes the situation "acceptable", the process is respected.
☝️ Note that if the underlying causes of an incident are unknown, but they seem to share the same origin, it is recommended that a problem management process be initiated. Remember that incident and problem management flows are often crossed.
#6 Verifying resolution
Once the solution has been applied, you need to make sure that everything is working normally, by checking :
- that the service is operational ;
- that users can resume their activities without any inconvenience;
- that no side-effects have been generated.
This stage is crucial for validating the effectiveness of the corrective action. It also avoids "boomerang" incidents, those that return without warning.
#7 Closing the incident
To close an incident properly, the teams in charge of the process carry out a number of actions:
- They take care to record all the details of the incident and the time spent on it. ☝️ This documentation is used to create a searchable history to improve incident management protocols;
- inform the user of the resolution;
- They ensure that all the details of the solution are clear and legible.
This level of detail reduces the risk of conflict between the various stakeholders.
What about the DevOps and SRE incident management process?
In a DevOps or SRE environment, incident management takes on a different dimension. It's no longer just about fixing things quickly: it's about ensuring the ongoing resilience of systems, while maintaining a high level of performance.
Here, you don't "wait for incidents to happen". You anticipate them, you document them, and above all, you learn from them. In other words, every bug becomes an opportunity for improvement.
👉 More concretely, the DevOps/SRE process is based on specific tools and practices:
- proactive monitoring via dashboards and intelligent alerts ;
- the use of observability tools (logs, traces, metrics, etc.) to diagnose problems in real time;
- asynchronous communication channels (Slack, Teams, PagerDuty, etc.) to coordinate the response;
- the use of runbooks to ensure rapid, stress-free action;
- Conducting post-incident reviews to prevent the error from happening again.
So why is it so important to put in place a solid incident management process? Because in a cloud-native environment, interruptions are costly in terms of time, money and reputation. What's more, systems have become increasingly complex and interconnected.
The human factor: a strategic issue in incident management
In most digital environments, incidents are not caused solely by technical failures. The human factor is a major cause. According to several studies, the human factor is involved in over 80% of IT incidents. A configuration error, a click on a malicious link, an incorrectly followed procedure... human error remains one of the most fragile links in the operational chain.
As a result, you need to incorporate this parameter into your incident management process. It's not just a question of correcting an error, but of understanding why it happened and how to prevent it from happening again.
👉 Implementing a human and systemic approach makes it possible to:
- strengthen the culture of prevention;
- encourage transparent reporting of errors;
- provide targeted, ongoing training;
- establish a climate of mutual trust.
Technology can fail, but it's often the human being who triggers the alert... or who ignores it. By treating them as key players, you can transform incident management into a lever for continuous improvement and resilience.
What tools do you need for incident management?
You've got a clearer picture of incident management, but perhaps you're wondering how to put all these recommendations into practice? Can you already see yourself applying your incident management procedure using an Excel spreadsheet or a traditional project management tool?
Fortunately, specific software has been developed to support your teams at every stage of the incident management procedure.
To help you, have a look at our selection ✔️:
- Jira. Developed by Atlassian, the Jira ticketing tool standardises the processing of tickets opened following the reporting of an incident.
😀 Why Jira? - create tickets with a precise level of information (descriptions, severity level, etc.) and follow all the processes required to manage them ;
- easily classify and prioritise bugs, and assign them to the right employee or department;
- integrate your tickets into a ready-made workflow, or one that can be customised to suit your needs and processes.
- NinjaOne. NinjaOne is a complete IT asset management solution for SMEs, ETIs and large enterprises.
😀 Why NinjaOne? - Centrally and proactively supervise your entire IT infrastructure to detect incidents as early as possible ;
- automatically apply the necessary patches, reliably, to all your endpoints ;
- store all the standardised and structured documentation relating to your processes within the platform.
- Octopus. Octopus is an ITSM (Information Technology Service Management), i.e. IT service management software.
😀 Why Octopus? - benefit from a tool developed in accordance with ITIL best practices: your teams can apply them naturally without needing to master them perfectly beforehand ;
- easily manage requests from your users, whether for incidents or service requests;
- improve preventive action thanks to a database that manages all aspects of your information systems' configuration.
- Splunk Enterprise Security. Splunk Enterprise Security is a SIEM (Security information and event management) designed to support you in strengthening the security of IT systems, and in incident management.
😀 Why Splunk Enterprise Security? - Take advantage of a solution focused on analytics and therefore streamlining cybersecurity-related tasks ;
- get real-time insight through customised dashboards and views; ;
- detect incidents more quickly and take preventive action.
What are the key points of IT incident management?
Incident management, as standardised by ITIL, is a procedure that should be incorporated into your information system as quickly as possible, as it promises to provide a clear and rapid response in the event of an incident.
What's more, it gradually leads to a reduction in the number of incidents by feeding into your problem management processes, and hence your preventive actions.
And the good news is that everyone wins when such a working method is put into practice:
- Technical teams work more efficiently and transparently;
- users are less affected by bugs and more satisfied with your product;
- the company suffers fewer losses in the event of a critical incident.
Finally, it's worth remembering that good incident management goes hand in hand with the use of relevant tools, which support your process and save your teams precious time.

Currently Editorial Manager, Jennifer Montérémal joined the Appvizer team in 2019. Since then, she's been putting her expertise in web copywriting, copywriting and SEO optimisation to work for the company, with her sights set on reader satisfaction 😀 !
A medievalist by training, Jennifer took a short break from fortified castles and other manuscripts to discover her passion for content marketing. She took away from her studies the skills expected of a good copywriter: understanding and analysing the subject, conveying the information, with a real mastery of the pen (without systematically resorting to a certain AI 🤫).
An anecdote about Jennifer? She stood out at Appvizer for her karaoke skills and her boundless knowledge of musical dreck 🎤.