If you’ve ever worked in a SOC, you know the true affects of alert fatigue.
If you don’t scale your detection and SOAR suites correctly, the influx of alerts can get exhausting real quick, real fast.
On top of that, nearly 70% of alerts are either ignored or not given the proper attention when triaging.
Eventually, two things generally end up happening:
A mass influx of alerts are left to a small team, leaving them feeling exhausted and unable to work on other projects.
Low level analysts or members of other IT departments are assigned to triage tickets, leaving them feeling overwhelmed by not feeling comfortable triaging alerts.
What if there was a simple solution to both of these problems?
Well, there is … Enter the AI SOC Analyst.
This AI SOC Analyst has a simple job - take an alert and triage artifacts from a SOAR workflow then provide an analysis and recommendation back to the analyst.
Simple enough to start, but as with everything in the SOC - what’s the best way to scale this?
Here’s how I did it.
If you want to view the code for this project setup as a FlaskApp, you can fork the repository here: https://github.com/rcx23/ai-soc-analyst
Setting up the Model
First things first, let’s set up our model.
Since my current team already has a contract with OpenAI, I ventured over to our Organization instance and started to dig into researching Assistants.
Turns out they’re pretty simple: Assistants are essentially trainable models for specific use cases. I can train the model using my own test data and then interact with the assistant directly via API.
The first step is to provide the Assistant with a high-level prompt on what it will be doing. I’d suggest something like this:
You're a security expert tasked with being a SOC Analyst. You'll triage tickets that come in and decipher if the activity is malicious or not.
You will evaluate the event along with the provided SOAR context to return a summary of the analysis made, a recommendation statement on next steps, an overall severity rating from 1-10, a confidence rating of your analysis from 1-10, and a list of artifacts that may be of interest from the investigation.
At this specific time, I decided to use the gpt-4o model for this use case.
Before moving on to the next step, make sure to turn the Response format to json too so we can easily interact with this model via API.
Training the Model
In order to train the model, we need data so the Assistant can learn what actual Analysts are doing when they triage alerts.
Currently tickets filter into Jira and the SOAR of choice is a custom solution that I created using AWS Step Functions and Lambdas (more on this soon).
I’ve also setup analytics in Jira that Analysts have been filling out when completing tickets for the past year. Not only has this been helpful in creating a Data Driven Detection Lifecycle, but it will be helpful in training the model too:
Classification: True/False Positive, Confirmed/Expected Activity
Resource Used: List of different tools used when triaging the ticket
Reason: A written reason for the decision made on the ticket
I wrote a simple script that would marry my Jira data and SOAR workflow data together since they currently live in separate places (the SOAR filters over to Slack for better observability). Combining this data will allow the model to see exactly what the Analyst saw while triaging the ticket and correlate with the decisions being made based on the artifacts given.
I exported my last 6 months of tickets and SOAR workflows and fed the json output to the model.
Basic training is now finished!
- Today’s Sponsor -
Prepare for a career in Cybersecurity, one sip at a time with The Security Sip. With rapidly evolving threats and technologies, many struggle to gain the right skills and experience to break into the cybersecurity industry. This course is designed to transform beginners into industry-ready professionals over 12 sections, 85 modules, and 155 exercises. Check it out!
Testing the Model
After feeding your training data to the model it’s going to need some basic calibration.
Feed it a few tickets and see what kind of output it comes back with.
Now is the time to teach the model to return exactly what you’re looking for. Focus on concise explanations and eliminate any fluff from the analysis.
The idea here is to provide the analyst with 2-3 sentences with all the information they need to understand the alert and decide on next steps.
Once you feel it’s in a good place, we can move on to the implementation.
Implementing the Model
There are two ways I’d consider implementing this model, and it really depends on your current architecture.
For the way my infrastructure is setup, I decided to spin up an AWS lambda to implement my logic in.This works for me because it can easily be called as a step in my SOAR Step Functions.
However, I the easiest way to spin up an MVP would be a Flask App.
You could spin up an endpoint with a few lines of code, implement a quick python method, and be ready to deploy in an hour or two. I wouldn’t even worry about authentication until you’re ready to push it up to prod, just focus on the basics for now.
Either way, here’s how I would handle implementation in Python:
Setup a prompt with base instructions. This will tell the model how to return the JSON object:
When you receive an input and output from an alert, you'll return a JSON object with the following keys:
summary (str): A brief summary of the analysis done on the alert and any items that stand out as supicious, remove suspicion, or are unknown. Keep this less than 300 characters.
recommendation (str): A recommendation whether it is safe to close the ticket, investigate more into the ticket (outcome is unknown based on the data given), or escalate to a security incident (confidence is high this is a security risk) with a reasoning on why based on what was found in the alert or triage context. Keep this recommendation brief and to the point.
classification (str): a recommended classification for Jira (True Positive, False Positive, Expected Activity, Confirmed Activity)
severity (int): A score of 1-10 on how severe the files shared are in relation to company intellectual property (10 being very high risk, sensitive items)
confidence (int): A score 1-10 on the probability this is a true positive (10 very positive)
artifacts (list): Evidence that is a list of the potentially problematic/malicious artifacts found in the alert or triage context
Construct the Alert based content. This will be an additional prompt including the data from the alert and include the SOAR workflow output artifacts.
Then make a call to your OpenAI Assistant via the API. Here’s a simple way I’d do it:
def openai_api(log_type, detection_id):
try:
client = OpenAI(api_key=openai_dart_api_key)
my_thread = client.beta.threads.create()
content_prompt = "Prompt with alert and SOAR info"
initialize_thread = client.beta.threads.messages.create(
thread_id=my_thread.id,
role="user",
content=content_prompt
)
dart_assistant_run = client.beta.threads.runs.create(
thread_id=my_thread.id,
assistant_id=openai_dart_assistant_id,
instructions=content_prompt
)
while dart_assistant_run.status in ("queued", "in_progress"):
keep_retrieving_run = client.beta.threads.runs.retrieve(
thread_id=my_thread.id,
run_id=dart_assistant_run.id
)
if keep_retrieving_run.status == "completed":
all_messages = client.beta.threads.messages.list(
thread_id=my_thread.id
)
analysis = all_messages.data[0].content[0].text.value
json_analysis = json.loads(analysis)
return json_analysis
elif keep_retrieving_run.status == "queued" or keep_retrieving_run.status == "in_progress":
pass
else:
return "An error occurred while analyzing the alert"
else:
return "An error occurred trying to construct an analysis to the given Alert"
except Exception as e:
return "An Exception occurred in DART Assistant Analyze"
Hook up this method to your FlaskApp and you’ll have an AI Analyst ready to go!
Scaling the Model
As I began to use the model, I realized I needed a couple more things to make this more reliable and viable as a solution.
Problem 1
The output wasn’t as specific as I wanted for certain detections with more specific triage processes.
Now, if you’ve worked in the SOC before, you’d know that each LogType has specific actions necessary in order to triage effectively, and then each detection may have 1 or 2 specific actions needed.
For example, let’s take Okta. Okta is going to have IP activity and a User’s alternateID in every alert. So any Okta alert that comes in needs to have queries run that look up recent activity in Okta logs from that user and IP. Then, there are more detection based activities. Say we have an MFA removal alert come in. We may be more interested in their user agent compared to common user agents used.
In order to solve this use case, I realized I would have to edit my script slightly.
I needed to add something to dynamically set the triage instructions based on the LogType and Detection.
So, my script ended up looking something like this:
def openai_api(log_type, detection_id):
try:
client = OpenAI(api_key=openai_dart_api_key)
my_thread = client.beta.threads.create()
content_prompt = "Prompt with alert and SOAR info"
full_assistant_instructions = construct_instructions(log_type, detection_id)
initialize_thread = client.beta.threads.messages.create(
thread_id=my_thread.id,
role="user",
content=content_prompt
)
dart_assistant_run = client.beta.threads.runs.create(
thread_id=my_thread.id,
assistant_id=openai_dart_assistant_id,
instructions=full_assistant_instructions
)
while dart_assistant_run.status in ("queued", "in_progress"):
keep_retrieving_run = client.beta.threads.runs.retrieve(
thread_id=my_thread.id,
run_id=dart_assistant_run.id
)
if keep_retrieving_run.status == "completed":
all_messages = client.beta.threads.messages.list(
thread_id=my_thread.id
)
analysis = all_messages.data[0].content[0].text.value
json_analysis = json.loads(analysis)
return json_analysis
elif keep_retrieving_run.status == "queued" or keep_retrieving_run.status == "in_progress":
pass
else:
return "An error occurred while analyzing the alert"
else:
return "An error occurred trying to construct an analysis to the given Alert"
except Exception as e:
return "An Exception occurred in DART Assistant Analyze"
Hard to see the difference? Thats okay, because it’s very slight.
I added a construct_instructions method that will take information from the alert and create logtype and detection specific triage instructions.
In order to create this method in a scalable way, create a high level method in construct_instructions that filters to different methods sorted by logtype. Inside each logtype, create logic for each detection.
While this isn’t needed for every detection, it now provides a nice level of optional customization that can be added whenever needed.
Problem 2
I needed a way to provide feedback to the model based on it’s responses.
Not everything was great, and it would be a massive pain to have to manually input the alert, SOAR artifacts, and AI response every time I didn’t like a response.
Now, there is probably a way to hook up an API call where a user can interact with the ticket and it automatically sends feedback back to the model, but we’re about quick implementation for now.
The solution I settled for is a field in the Jira ticket that would section feedback into 3 main groups:
Unrelated
Needs Improvement
Almost there
This field is optional for Analysts, and they’ll only need to fill it out if there’s a problem and room for improvement.
We can then reuse our script for training to mass import training data for the model without adding additional manual toil for analysts during triage.
Impact
We’ve measured that the triage time for tickets is down approximately 12% since implementing the assistant.
I believe this isn’t higher for two reasons:
We already have extremely intricate SOAR workflows that make a majority of tickets a breeze to triage to begin with.
We consistently alter our detections to get rid of false positives.
I think the bigger win from this project is the feedback received from the various IT departments. They report feeling better equipped to handle the triage of these tickets, something that they didn’t have confidence in before.
Next Steps
It got me thinking - this would be an amazing SaaS product.
I think a product like this could be implemented in two ways:
Easy Way: Make a module that can be hit by API (similar to this) that can implemented at the end of every SOAR workflow.
Better but Harder Way: Do away with the SOAR completely. Create a system where all you have to do is add your API keys to the AI Analyst and it will handle triage all by itself.
I believe there is enough of a pain point in Security Operations for a product like this, and I think there’s plenty of opportunity here to rethink how the SOC functions.
I believe my implementation that I’ve shared today is a hacked together solution. While it works and you can implement it today, I think there’s much more that hasn’t been shipped to implement this seamlessly into any environment.
Securely Yours,
The Cybersec Cafe
Just a heads up, The Cybersec Cafe's got a pretty cool weekly cadence.
Every week, expect to dive into the hacker’s mindset in our Methodology Walkthroughs or explore Deep Dive articles on various cybersecurity topics.
. . .
Oh, and if you want even more content and updates, hop over to Ryan G. Cox on Twitter/X or my Website. Can't wait to keep sharing and learning together!