Updated: Dec 27, 2022
This post will explain how to generate a diagrammatic flow from log ingestion all the way to case and alert generation. It will also explain the maximum life-cycle of an Incident and how to track and manage those to prevent fatigue in the Security Operations Center (SOC). Additionally, it will show how to escalate incidents, manage your cases in your IT Service Management Tool as well as manage flow back to your Security Orchestration and Automated Response (SOAR) platform. We will also work to carve out the different stages of the incident life-cycle making it easy to track the state of an incident as well as identify weak points in our workflows.
Throughout the design process of building your SOC, you have to come with a plan to contain the incoming security events. Otherwise, your analysts will operate in chaos, this will create alert fatigue and high severity events to go unhandled for far too long. Some of the points we will cover in this post will work to address those issues:
Stages of an incident life-cycle
During the stages of the incident life-cycle, you will be able to methodically track security events as they progress in your SOC. Using this structure will help you identify weak points in your workflows and engineering with key metrics. There are many ways of categorizing your incident stages, however, the following is the framework that we are using and have tested across a variety of environments and security solutions. In this section we will discuss the following stages:
Enrichment - Queries, data normalization, and data reformatting.
Management - Classifications, severities, Allowlisting, Blocklisting, and auto-resolve.
Notify – Notification schema, phone calls, SMS, email, and ticketing.
Remediate - Automated remediations, communications to various teams, manual remediations, and incident response involvement.
Post-management - Closure Notes, close reasons, and post-mortem report:
The process of intake is a rather simple stage however, it can have high operational impact. For example, there are some untuneable alert sources which can overwhelm a SOC or your SOAR platforms. These security events can be dropped using a pre-processing/suppression rule. Additionally, during this stage you will want to configure which sources you want to absorb security events from.
Trends from the market show that more and more SOCs are moving away from a monolithic approach. This means that you'll see a lot more organizations relying on security end solutions such as Microsoft Cloud Application Security, Microsoft Defender for Endpoint or the Cortex platforms (XDR, UBA, XSOAR, PRISMA) and more while pulling those security events directly from the end solution. Pulling all this data into the SIEM along with all of your other logs has proven to be a cumbersome process which many are moving away from.
Additionally, this is depleting the reliance of building SIEM and EDR use-cases as many of those providers, such as the aforementioned ones, are spending a huge amount of resources covering those detections for you. However, these benefits also come with some downsides. Some of these solutions lack proper tuning capabilities. So, you'd either deprecate that alert source and open a gap in your monitoring or use a SOAR platform that can suppress some of these security events entirely.
Other functionalities in SOAR will allow the engineering team and analysts to build allow listing rules which will operate based on indicators and close very specific security events with specific indicator values. We will discuss allowlisting more when we get to Incident Management. However, the next step is to cover enrichment since it's a vital part of the rest of the stages.
In another post we discussed our significant incident attributes. In order for our analytical playbooks to be minimally sufficient, you will need to define those and make sure that they are enriched into your security event. Enrichment is critical to the enhancement and efficiency of an analyst. Ensuring the correct data is in front of the analyst or your automations can perform various actions you must ensure that you have normalized the relevant data to complete the security event..
Consequently, ensuring the data is normalized will help you to be vendor agnostic with your technologies. In addition, it is wise as the SOC manager to allow your analysts to communicate with the engineering team on which data they need to assess an incident and how they would like it to be portrayed.
For example, the Access Anomaly classification we spoke about in another post would require specific data surrounding the sign-in activity of the user. Those activities can be the following: device information, browser information, user location, user IP address and whether or not single-factor authentication or multi-factor authentication has been used. With the aforementioned information present in your security event and an analytical workflow designed in your playbook, your SOC will be able to handle thousands of these events effectively.
While information for security events is important, it's also a good idea to provide enrichment on your indicators. For example, keeping an indicator data base in your SOAR platform is wise and some platforms allow for fully customized layouts. By creating an indicator such as an "Account" type, you can help an analyst choose how a user accounts information can be displayed to them. Now that you have the correct data you can now manage your security event.
Once your indicators are enriched into your platform your analysts can quickly check any related incidents and examine threat intelligence sources for reputation. Additionally, with a proper war room you can display header information and the email body in HTML format for easy viewing as seen below.
The figure displays the HTML view of the email that makes it easy for them to get the context of the email so that they can assess if it's a true positive or not. However, in many cases this information does not suffice. Displaying the headers of an email is also important.
Enriching your security events is extremely important for your analysts to make a quick assessment of a possible attack. In platforms like Cortex XSOAR you can not only perform automations but you can customize layouts and make data visible and reformat it as you like. In addition to that, you will see later how you can embed scripts into a button on the layout to take swift action on incidents. In the next section we will cover the incident management portion of the life-cycle. It's important to note, without the relevant data added and enriched to your security event the management process becomes more difficult.
The security event management process is critical to helping your analysts to identify which security events to handle first. This is the phase where you will be implementing your handling procedures and escalation policies as defined before in another post. Additionally, it allows you to create allowlisting and blocklisting procedures in your SOC so that you may tune out unwanted noise without ruining your reporting capabilities on specific security events. In some cases, you can actually provide tuning options based on your indicators/classifications and action taken.
For example, in many cases where an environment is using an anti-virus, you'll see hundreds of malware alerts coming from the appliance. By applying your incident handling procedures, you can set those security events to low and report only. This way you can assign an analyst weekly to check these reports and try to assess why there are so many malware events penetrating your defences. Lastly, one of the most effective procedures is security event aggregation. This is also policy based and will curtail an enormous number of security events from flooding the SOC. For example, an Access Anomaly event is generated, the use-case which triggered the first security event is an Impossible Travel activity.
A second alert is generated for Unfamiliar Sign-In properties, both are considered to be Access Anomalies, and both have the same user account. Based on the conditions of same classification and same user account, I can append the second security event to the first one. Typically, we only allow this procedure to take place if the oldest security event is open to prevent a potentially malicious attack from going unnoticed by the SOC.
Applying the correct management procedures will help you to filter up to 60% of the noise so that your SOC can apply their efforts to the most critical security events. After the management stage has figured and sorted the events for your SOC, your analysts and your automations will now commence the analysis stage.
Once your security event has been taken in, enriched and managed, your analysts will now handle the remaining security events. Armed with analytical workflows and customized layouts filled with enriched data, your analysts and automations can now work the remaining security events. Typically, you want your most experienced analysts to design workflows so that your more junior analysts can work in a guided tour mode. This allows for a lot less error on the analyst's side. Additionally, you want to configure at which points of your playbooks that you want to run automations.
You can effectively automate up to 99% of all L1 security events leaving only meaningful work left to be done by your team. What I typically recommend to do is to talk to your experience analysts, ask for a step-by-step analysis and draw a diagram. Afterwards, present this diagram to your junior SOC analysts and ask them if they understand each step. Additionally, I have them draw layouts of how they want the data to be displayed for this type of workflow. You always want to create 1 analytical workflow for each incident classification you develop.
Creating analytical playbooks for those classifications will help ensure a high level of quality in each security event. In some cases you can force an analyst work through the specified steps. However, it's important to create inflection points in your playbook framework for your analyst to take manual decision. Another key factor is to speed up the rate in which your analyst can ingest data from the various sources you provide them. One tool we have found to be extremely useful is Polarity. This interconnects data from various sources and reformats it within seconds. This allows you to combine the data sources of your SOAR, SIEM, EDR, Threat Intelligence and ITSM tool and merge into a simplistic lightweight platform visible for your analysts as an overlay on top of your war room. It will also help them conduct mass lookups of various indicators with minimal effort involved.
The last step is to speak with the engineering team and check the plausibility of implementing the new playbook and layout. Once you have completed all of these steps and both SOC and engineering agree to the new playbook and classification, you can start sending your security events into your SOAR platform. Once your L1 analysis has been performed, either you have created a case or closed an event because the analyst has made their determination. At this point you should begin notifying the appropriate teams to confirm and remediate the incident:
Notifying the correct teams is essential to handling the rest of the incident. Involving these teams will give the SOC the appropriate information to remediating and resolving the incident. Using your notification schema, you can use multiple different avenues to contacting responders and other appropriate teams. One of the most common escalation methods is done by phone call. This way you can activate certain incident responders or personnel can be notified and they can acknowledge the incident.
The following call procedure can be performed via automation:
In addition to phone call escalations, you can use SMS, email, and a messaging application. The key component is to create a ticket in your IT service management with an appropriate description and the critical indicators. At this point you involve your more experienced analysts to confirm the incident.
During this process they will document each step they take so that the involved teams are up to date on the case. Additionally, it will give the assigned analyst the capability of asking questions to various teams in case they need additional information. Now we will begin discussing the remediation phase.
During the remediation process, we begin with confirmation of the received case. At this point we have several teams involved in a single tool cross-collaborating to resolve the incident. Additionally, to expedite the remediation process it's also important to make all the necessary tools available. Typically, we like to embed certain tooling into our SOAR platform. This allows our analysts to do everything from a single platform and drastically reduce the amount of time it takes to perform an action and resolve a case. With well-defined processes for each classification and the appropriate individuals being added to your communication channel, your analysts are able to perform a wide array of actions all from the same tool.
From the figure shown next, you can see in our Cortex XSOAR instance that the analyst is able to search for similar emails across an enterprise. These capabilities hook back into the Microsoft Defender for O365 P2 license. The following capabilities are possible all from your war room:
1. Blocking Senders
2. Blocking Domains
3. Password Reset
4. Detecting if a malicious URL was clicked
5. Email hunting by specific parameters (sender address, attachments, and others)
6. Email soft deletion:
Once you have performed a look up of all emails sent from a particular sender, you can then take mass action against the phishing campaign by activating a script with a button. This will wipe out all emails from that sender across an organization. It's extremely fast and efficient for your analysts to work out of a single tool to take all actions.
With this setup our mean time to resolve a phishing campaign is approximately 15 minutes. Appropriate centralization allows for mass eradication of threats with very limited efforts. Once you are done with the remediation phase of the incident response life-cycle, you are now able to document the outcome of your incident for audit and reporting purposes.
Incident post-management will vary based on your target operating model. Typically, you should train on a set of close reasons for your analysts. At this moment you are trying to analyze the following attributes in your security event process:
1. Effectiveness of your defence tooling (how many attacks blocked/quarantined).
1. Effectiveness of your use-cases and detections.
2. How long it took for your analyst to conclude a case from the ingestion of a security event to the closing of your case.
3. How effective the communications between the teams were and if it led to a swift and accurate conclusion.
False Positive - Incorrect Alert Logic
Security event genrated due to incorrect detection logic.
True Positive - Non-Malicious
Activity was determined to be true positive it was determined that it was either authorized activity.
True Positive - Malicious
Activity was determined to be true positive with malicious intent.
No enough information was given to conclude the case.
It's vital to ensure that your analysts are entering the correct close codes. Additionally, your SOAR playbooks can make determinations and close reasons based on certain conditions. Once you have these properly defined you can easily run a search to see how your Malware Post-Compromise playbook is running. For example, if you notice that 90% of the cases which you have raised are false positive, you may want to determine if you should correct something in your processes.
Finally, after you are done setting up your close codes it's important to use those codes to feedback into the SOC and continuously improve your processes and tooling. This part is extremely important and it's necessary to force the analyst into the same corrective workflow each time you raise a false positive case or there are efficiency gains to be made. In the next section we will discuss continuous improvement in detail.
Continuous improvement is vital if you want to have an ever-evolving operation. Playbooks are excellent when it comes to guiding your analysts into the correct workflows for analysis, remediation and incident management. In addition to those, you can also apply continuous improvement processes at the end of every incident. For example, if you raised a case and determined it to be false positive, you can nearly always create an allowlisting rule or tune a use-case. You can even add these workflows into the end of your playbooks to ensure that your analysts follow the same steps each time.
Another extremely effective way to ensure that your use-cases are continuously improving is to connect your SOAR platform to the ITSM tool of choice for your detection engineering team. Each time a false positive occurs due to a faulty use-case a simple click of a button will feed this information back directly to the detection engineering team. Lastly, if the event type is not tuneable, you can rely on your allowlisting rules to suppress security events based on specific indicators and classifications.
The latter would allow you to continue reporting on the events without completely tuning them out. This can be beneficial for many reasons, for example a lot of administrators use psexec – a tool for remote execution. This tool is often used by threat actors to execute malicious executables on remote systems. However, is a specific administrator is expected to use that on a certain system each time you can create an allowlisting rule for this without eliminating it from your recurring reports.
In addition to the process displayed in the previous figure you can also setup your SOAR engineering team to run in sprints. If you give your analysts the ability to raise feature requests, they can be reviewed during backlog grooming calls and prioritised during the next sprint. It's extremely important to treat your analyst as the customer between all engineering teams. By listening closely to your frontline analysts you will continuously improve the SOC in the most effective manner.
With all of your processes set in place it's going to be vital to start tracking performance indicators. From here we will measure the target operating model, the classifications, and overall incident response-life cycle. This will help us determine where in our process we are struggling the most with time, efficiency, and error.