Detection Engineering the SOC: Building a SOAR Workflow

Cybersec Café #30 - 08/27/2024

Aug 27, 2024

Welcome back to the third and final article in the short series: Engineering the SOC. So far, we’ve taken a use case through detection creation, and then the process of creating an Incident Response Playbook. The purpose of this series has been to show the exact thought process of a Security Engineer taking a use case through the Detection Lifecycle. The last phase for our use case is design an automated workflow in order to lower the ticket triage time, and in turn, fight off alert fatigue once and for all!

In case you haven’t read the first two articles in the series, you can view the table of contents below. This post marks the finale, so I’d highly recommend checking them out before you continue.

Writing a Detection Rule
Designing an Incident Response Playbook
Building a SOAR workflow
The Cybersec Cafe is a reader-supported publication. Your support keeps the newsletter going!

We were given a use case for our organization where some privileged users will need to access the AWS Console without using MFA. Although this will generally be standard behavior, since this design decision was made purposefully for privileged users to access special accounts, it could also point to malicious activity — like lateral movement to one of these privileged role, or even leaked credentials. Due to the critical permissions this kind of access entails, the effects of a malicious attack could be devastating to the organization. So, it was essential to setup a detection in order to alert the SOC team any time this activity occurs.

After the detection was set up, we took the time to document the Incident Response Playbook. We created artifacts such as how to triage the alert, saved queries to investigate, links to resources, steps to verify activity, and escalation instructions in the in the case of a security incident. The purpose of an IR Playbook is to have documented, easy to follow processes for the SOC Analysts.

After setting up the IR Playbook, it became apparent how manual of a process alert investigation is. In this specific use case, there are many resources to parse through, several queries to run in order to investigate, and other actions the Analyst must take to verify the activity. There must be a better way to handle this… right?

Enter the SOAR

A SOAR is a platform that is designed for automation around security alerts. SOAR stands for Security Orchestration Automation and Response. Even though there are other tools out there designed to help automate processes, SOAR tools are designed specifically for security use cases. They generally come prepackaged with support for popular applications Analysts would use to triage alerts, and also integrations to fit seamlessly with the prominent SIEMs on the market (which are the general source of alerts). Some popular examples of SOAR platforms are:

Splunk SOAR
Tines
Shuffler
Swimlane
Custom AWS Step Functions — more on this in the future ;)

Even though each of these products have their pros and cons, the thought process behind designing a workflow will remain the same no matter the platform chosen.

The Workflow

Let’s start designing our workflow. For reference, here’s a refresher on how the log looks for the detection we wrote:

{
"awsRegion":"us-west-2",
"consoleLogin":"Success",
"eventName":"ConsoleLogin",
"eventSource":"signin.amazonaws.com",
"eventTime":"2054-03-14T23:12:12Z",
"eventType":"AwsConsoleSignIn",
"ip":"8.12.54.2",
"mfaAuthenticated":false,
"mfaUsed":"No",
"userIdentity":"RCXCybersecCafe"
}

Before we get started, it may also be beneficial to take a quick glance at the IR Playbook from the second article if you haven’t already to jog your memory on our triage process too.

Now that we understand the triage process and made quick note of the different fields we have available from the log, we can start designing the workflow.

Some things to consider when designing the workflow are:

What are the artifacts of interest in the log?
What do we want to know about the user involved?
What do we need to know about the environment in the alert?
What information is needed to mark the activity as not malicious?

With these thoughts in mind, let’s start designing out the workflow. Here are some ideas on what we can implement:

IP Automation: The first steps of workflow should be to analyze the IP. Here are some common resources we can leverage in order to analyze the IP.

VirusTotal API: Utilize the API to analyze the IP, and then show the results in the ticket.
Talos Intelligence IP: Create a quick link to click to pull up the page to analyze the IP in the web page.
SIEM IP: Have a quick link to a query in the SIEM to view for recent activity from the IP.

Login Queries: The next phase will be queries looking for recent logins. These can be classified in the following five queries, and can generally be queried for in the SIEM.

Logins for the IP: Find the counts of the users who logged in from the IP in the last 30–60 days.
Logins by the User: Find the login counts of the IPs the user has logged in to in the last 30–60 days.
Most Recent Successful User Logins: Find the 10 most recent successful logins the user has had.
Most Recent Successful non-MFA AWS Logins: Find the 10 most recent logins and the associated users.
Most Common Users non-MFA AWS Logins: Find the most common users to login without MFA over the last 30–60 days.

Action Count: These last queries will look for common actions the user has taken. This will give context around the alert, and allow the Analyst to immediately see if any malicious actions may have been performed recently. We can get a good overview of these actions with the following two queries, also generally queried for in the SIEM.

Recent User Actions: Find the 10 most recent actions performed by the user, and the time the actions were performed at.
User Action Counts: Find the 5–10 most common actions the user has performed, and the amount of times they’ve been performed over the past 2–7 days.

Putting it all Together

If we take these automation steps and design them in our SOAR product, the workflow may look a little something like this:

As you can see, we run quite a few things in parallel to improve performance. However, we could also design the workflow with each query running back-to-back. But, this would come with delays in retrieving the information, and could mean losing precious time to triage in the event of a severe incident. So, it’s important to optimize workflows as much as possible as well.

You now might be thinking, what is the exact impact of creating this automation flow? Well, let’s take a look at what a sample ticket may look like for our use case before and after the automation flow.

Wow! The automation workflow has added so much context around this alert for our Analyst. They now has everything they need in order to get an idea of what is happening in the alert — all in one spot. In some cases, it can give enough context to close tickets immediately without any further manual investigation. This is the goal we are trying to achieve for our detection suite. In conjunction with the IR Playbook, the Analyst will also have quick links to any resources they would need to investigate further.

We’ve created a triage process that is not only much quicker now, but effortless for an Analyst.

Next Steps

Well, that’s it — we’ve done it! We’ve taken a use case and built a detection around it. We took that detection and documented the triage and Incident Response process. Then we automated the triage process to save time and fight off alert fatigue. Congratulations! That means you’ve taken a detection through the Detection Lifecycle from start to… finish?

Sort of, but not quite.

As mentioned in a previous blog, the Detection Lifecycle is an iterative process. It’s a never ending process of improvement in your detection suite.

So, what does that mean is next then?

After a detection has been in production for some time it may need some tweaking. This can come in the form of logic detections, exceptions, or severity adjustments.
The IR Playbook may need tweaks on the initial triage, verification steps, or escalation processes. It is important to continue to keep this documentation up to date.
The automation workflow may need some upgrades or adjustments. It’s possibly you may find some queries more/less useful than others, or may have ideas for upgrades.

These are all steps that will just come in time, and become regular processes as your team begins to build out their detection suite. But for now, we can say we’ve successfully built out a detection from start to deployment.

Securely Yours,

The Cybersec Cafe

Just a heads up, The Cybersec Cafe's got a pretty cool weekly cadence.

Every week, expect to dive into the hacker’s mindset in our Methodology Walkthroughs or explore Deep Dive articles on various cybersecurity topics.

. . .

Oh, and if you want even more content and updates, hop over to Ryan G. Cox on Twitter/X or my Website. Can't wait to keep sharing and learning together!

The Cybersec Café

Discussion about this post