Tuesday, October 25, 2016

Should You Build a SOC?


There is a section of consultants and educators in the cybersecurity industry, who proclaim that the litmus test for having a mature information security program is revealed by the presence and maturity of a dedicated, in-house Security Operations Center. Their message says, that if you have a SOC, you have arrived. You are doing it right. You are the mature security organization. That has many others wondering, "is it time for us to move in that direction and obtain that level of program maturity?"

I've been there, done that. I worked in an MSSP for over a decade, whose SOC was world-class and served as that model for our customers, partners, and other interested parties. We also provided SOC build and maturity consulting services to help organizations reach what we had attained. We were an early SOC and had the luxury of maturing ahead of the industry to truly lead the way. We went through several significant periods of re-design (in form and function) as the threat landscape and technology scene changed. We also had the opportunity to replicate our work several times, which forced us to review end-to-end what it was we were doing, why, and what should change. I was one of those consultant, education, practitioners who carried that message of maturity forward into the industry. I have also since been part of two organizations, who didn't have a SOC but were considering building them. In fact, one of those two organizations claimed they did have a SOC, called themselves a SOC, and even used email aliases with the term "soc" in them, but when I arrived, it turned out they didn't have any of the fundamentals (except the email address) that represent a SOC. They didn't even have a room. The security organization recognized the value, but didn't know how to actually build or operate. Hence my mission when I joined. The organization I'm currently at is considering this question as well. Is it the right time in our story to make the investment and step up the CMMI ladder to the next level?

If you are considering the question of building a SOC today, in 2016 or beyond, please keep reading. I may have some surprises in store for you.

The premise behind the modern Security Operations Center, or SOC, is to enable common awareness of the security state of the enterprise, with ample staff ready and trained, supported by carefully instrumented technology and defined processes, to ensure you can pounce on the right security issues that arise with consistency and expediency. SOCs represent the embodiment of a full and mature implementation of the NIST 800-61 standard; a center of excellence purposely built to enable full incident response lifecycle management in an intentional way, tailored to the organization in which they live. SOC staffing models seek to maintain sufficient staff that can handle numerous incidents at the same time based on the expectation that numerous incidents will happen on a daily basis. Staffing models are also designed to ensure teams can sift through the massive volume of data SOC's consume, to effectively triage and identify the issues that require action. As a security leader with a SOC, you will know that no matter the volume of security issues, nor the specific person on shift to handle them, you have a place to effectively coordinate and manage incident response in a consistent and professional manner. Your people will be aware, accessible, and equipped. You can have the cyber equivalent of NASA's mission control center. You can walk into a room, and have immediate situational awareness regarding your organization's cybersecurity posture.

The Operations Center mindset isn't new or unique to cybersecurity, and in theory it does make a lot of sense to have that single point of awareness, visibility, and coordination, especially given the risks facing organizations today. Additionally, more and more organizations expect high standards from their IT teams, and what better way to ensure quality, than with highly defined processes, commonly trained staff, and reinforced physical and logical structure? It works for call centers right? That all sounds great, doesn't it? Realistically though, what does it take to build a SOC, and can you do it yourself?

The reality is, if you are an organization who has decided the full mature SOC model is something you'd like to implement and operate, then you don't have one today. That means your existing team wasn't able to or chose not to operate in that highly structured manner. In my experience, that's most likely because they haven't seen it before, haven't been resourced accordingly, or don't agree it's for them (doesn't match their or their company culture). Regardless of the reason, they aren't operating in that SOC paradigm, and in my experience, they won't be able to build it for you. There are two primary challenges facing them; the day-to-day work that they already have to do, and the practical knowledge that they need to guide their actions from their current operating state into that final SOC model you desire. Quite simply; they won't be able to build the roadmap and won't be able to execute against it. They are busy, and won't know how to build the SOC. You will need help. You'll need a dedicated team who has the experience of both building and operating within a SOC because it's a complete mindset shift, it takes a ton of work, and unless you've lived the value, you won't appreciate it and a lot of what it takes to build a SOC may not make sense up front. The SOC build journey is extremely expensive, and it takes a lot of time. At least one year.

Hire any of the big consulting firms with the mission to build you a mature security operations program, and they will ship you a small army of experienced consultants wrapped in a nicely structured package with a 1 year roadmap to deliver said capability. This build team will run in parallel with your current team, and may engage/partner with your existing staff depending upon the maturity of what you already have. The team of consultants will probably contain one person focused on building out your technology layer, predominantly your log and event analysis capability. Another will be focused on developing use cases for threat detection, and the processes and playbooks that define how to leverage the technology and what to do when an alert is triggered. Another consultant will be focused on the people story, finding you the right talent, building a staffing plan, training plan, retention policy etc. Finally, you'll have a PM as your go-to person who will orchestrate this madness in a very structured manner to build this new function for you. It will be great, and expensive. You see, it takes a dedicated team a long time to build a fully operationalized and mature security operations practice. I know because I've lived it.

However, is that feasible, realistic, or even relevant to the majority of InfoSec programs and their parent organizations today? Having been there and done that, I can tell you with confidence that you probably don't need that highly structured, mature SOC that sounds so appealing. Yes, the message I now share to the industry has changed, because times have changed. Let's take a look at some of the primary selling points of having that fully developed and mature Security Operations Center:

  • Centralized (and physical) orchestration of all things InfoSec, namely communication, monitoring, and incident response coordination
  • A room that facilitates common and continuous awareness of the state of security for the monitored organization, staffed with personnel who are ready and quick to respond to security issues
  • Around the clock, 24x7x365 staff performing active monitoring, ready to detect and manage any issue that should arise
  • Highly structured processes and procedures that enable consistency and efficiency in service delivery to the organization
  • Specifically configured technology that supports visibility, awareness, and execution
  • Accountability and validation that Analysts are doing what they need to
  • Rapid and personal communication among InfoSec, and namely Operations staff to facilitate detection, analysis, and response actions
  • Dedicated facilities built and secured for use by the Security team
  • Controlled access, separating sensitive information from the common employee community or from visitors
In a sentence, the SOC represents a place of focus; a place where a team can assemble with all the right equipment, to perform a function in a consistent and coordinated manner. But, do you need a room for that, do you still need the same equipment, and do you need to invest in the level of effort to ensure that highly automated and repeatable experience? How many times will your staff need to coordinate together to deliver that same repeatable experience?

Having staff at the ready, armed with the tools and processes to manage that alert as soon as it pops sounds great right? The assumption behind the modern SOC is that you need those resources at the ready, because you are under constant attack by sophisticated adversaries who can bypass your controls and will break in. You need the SOC so that you can rapidly detect these problems and act to mitigate them before they become a major issue. In fact you'll still hear the terms "worm" and "outbreak" used in SOC circles because that's the old school world and problem statement SOC's were created to solve; rapidly stopping the expansion of a threat before it could reach catastrophic levels. To do that, the story says you need 24x7x365 and tons of data to correlate into actionable events, plus awesome dashboards that track trends and status, plus flashing lights that "sound" the alarm when something interesting happens, followed by automated orchestration that creates tickets for Analysts, pre-populated with data elements gathered from multiple different sources to enrich the ticket with attributes that will help answer questions on your Analysts's minds. It's all about speed; you have to out-pace the attacker. If you are really good, you might even have pre-selected and presented playbooks or IR actions ready for your Analysts to use. Then, your team springs into action, performing initial triage, coordinating next-steps, performing an initial assessment, carving out action items...and away they go racing through the incident response lifecycle while their adversary on the other end of the globe races through the kill chain to reach their objectives before they can be cut off.

That's the way it works right? We still race attackers and worms through our networks right? This is a live game of cat and mouse isn't it? Well, let's talk about that.

I haven't seen a worm or virus outbreak in years. The closest example might be a phishing campaign or targeted attack that spreads select malware across multiple assets within an environment, but those are few and very far between. It turns out that most malware we face on a daily basis is highly automated and well known in terms of identifiable characteristics. The data points you actually need to look for to validate a malware infection from a detection and data analysis perspective, are actually few and easy to find. Attacks are also highly automated and usually set into motion without direct supervision by the operators. That's not always the case, but I'm speaking to an 80/20 split (80% of the time vs. the 20% of exceptions to the rule). Pick your malware and delivery methodology. Pick your modern campaign. What your Ops team will encounter on a daily basis is most likely crimeware, delivered through broad and not necessarily targeted phishing campaigns where you are one of many. Your employees may receive emails with malicious attachments or URLs, or they may stumble upon a web exploit kit after having been re-directed from their favorite news website that happened to be poisoned with malvertising. Phishing campaigns (by far the most common threat or attack we experience today), are a constant wave. 50-100 malicious emails a day is a likely number, but with modern technology, only the first 1-5 will actually get through. The eventual malware that drops (if the attack is successful) will automatically begin performing it's defined functions, which often includes local system profiling, immediate data theft, and check-ins with it's command and control server for further instructions. Many of these malware infections, after immediately posting the data they were designed to initially steal, then sit, waiting for future instructions from their master, which may come days, weeks, or months later.

The most popular malware experience of 2016, crypto/ransomware, performs it's damage immediately upon infection. There is no race. Once installed, it's game over, and the last time I checked, crytpoware doesn't have worm-like properties. The race, is actually one of prevention in the first place, but I'm getting ahead of myself. Let's first look at the infection race.

What is the average time from infection to action by an adversary? Before you can decide on the resources and overall success criteria for your SOC, you need to understand the adversary you are up against and how they operate. If you plan to equip yourself to win the detection and remediation race, you better understand how fast you need to be able to go. In a recent investigation I completed, an adversary, having successfully brute forced their way into a server where they created a local account with admin privileges, left the compromised server untouched for weeks. They attacked, established their foothold, then left. When they came back, they simply did so to validate their access and to install a few preferred user tools (including the FireFox web browser). They went silent again for a period of time, and came back about a month later to install more tools. That race was minimally weeks, if not months long and the actual impact was nothing short of an annoyance.

This follows another incident I investigated about 2 years ago where an adversary compromised a publicly facing web server through a true 0-day, dropped a local web shell, used it to enumerate and understand the files on the target system, then left it alone for 6 months until we found it. Sure, they exploited a server and gained root access, but they were apparently staging themselves for a later action. Again, I'm speaking to the 80/20 rule here.

In another recent example that shows the variations on the attacker race, I helped investigate an incident where an AWS console admin API key was accidentally published to a public GitHub repository. It was there for a while before an adversary noticed it. When they acted with that information, they did so very quickly. They used it to quickly spin up a bit coin mining operation in unauthorized AWS resources. The damage to the business? None really.

Your ready-to-go SOC team will likely be able to detect the stages of infection, and will likely be instrumented (due to the general noise and low liklihood of requiring action) to monitor for indicators of compromise. At best, they will be able to respond to validate and perform some level of mitigation within hours. That means they may be able to prevent that infected system from being used for further outcomes down the road, but it's unlikely that they will be able to prevent the initial data theft (browser information or locally stored data), nor will they be able to prevent cryptomalware from taking it's effect. In the true 0 day example where the adversary was active from the moment of installation, by the time they completed their enumeration of local data and potential extraction,  a well defined SOC team would still be in the initial triage phase. You see the automated attack will always beat the reactive SOC, and the manual attack (on average) likely doesn't require mitigation within a few hours...it can probably wait. However, if the malware or methods your adversary is using is that easily identifiable to enable, and if the infection context is so easy to validate that you believe you have a fighting change, then why didn't you automate prevention in the first place?

Ah, there's the root question and that question flies in the face of the traditional SOC argument. Does automated threat prevention work, and if so, what does that mean to the modern SOC built to chase IOCs and handle multiple intrusions at the same time? Conventional wisdom in the industry says, no, prevention doesn't work. We tried back in 2000. Conventional wisdom says you will be infiltrated and you can't prevent it from happening. That's true of a small number of potential scenarios; true 0 days where you were targeted, or true crafted infiltrations by a nation state who developed tactics unique to you. Or true for the first wave of phishing emails that come from a new campaign. However, for the mass majority of issues your operations team and organization are likely to face, I say yes, they can be prevented. Relatively easily. At least that has been my recent experience.

This especially proved itself over the past year at my present place of employment where we embrace the prevention story 100%, where sandboxing is king, and where we actively build what we know back into our products to enable prevention next time, not just detection. In our world, that incident situation that led to manual remediation efforts, becomes not the basis for a new SIEM detection use case, but rather a candidate for research and prevention in our core inspection technology. If we missed it, then we treat that as a bug in our prevention stack, and work to fix that. It's built into everything we do; prevention works, prevention first. If you can define it, you can prevent it. It's actually quite simple when you leverage the right tools.

When I first joined my present company, I didn't believe the prevention story. I thought it was interesting, and had potential, but I didn't believe it. I pulled in my extensive list of IOCs that had been experienced by myself or others, pulled all our relevant logs into one massive searchable repository, wrote out my top 50 threat scenarios including the data attributes and analysis logic that supported them, and went to work hunting for all the infections I thought I should find. My prior years of experience led me to assume that we should expect to handle about 5-10 endpoint infections per day. We were seeing about 1 per week. As I started drilling into my logs, I quickly identified what I expected to find; emails containing suspicious attachments and URLs, web browsing sessions that looked strange, funky DNS requests from internal hosts, endpoints making connections to known malicious websites, active content and files being downloaded by employees that were marked as suspicious, outbound connections to IPs on known C2 lists etc. All of these were indicators of attack, some of potential compromise. Then, as I continued my investigation methods, looking for attributes from the endpoints that would validate impact, well, I found...nothing. This continued for weeks, and aside from the 1-2 positive threats found per week (which all turned out to be greyware), my assumptions about prevention not working were shattered. That darn sandbox. Worse, I was hard at work also creating processes and playbooks, assuming we needed highly defined structure to ensure repetition and efficiency in IR. I was gearing up for a dozen infections a day. I found on average, 1-2 per week.

You see we do two fundamental things right. First, every file retrieved or delivered from the Internet, we analyze offline via static and dynamic sandbox technology that also compares characteristics (file identifiers as well as behaviors) with other known threats. When the verdict comes back that the file is malicious, our technology prevents the download or delivery, or prevents the local execution on the endpoint. Second, we don't just focus on the installation phase of the kill chain, but took it up one notch and also identify and prevent signs of exposure to exploit code or sites that behaved like web exploit kits. For email, we proactively analyze links delivered to employees and update our prevention tools behind the scenes based on our conclusions of that analysis so by the time the employee could click, we had dynamically updated our block list and prevented access. We inspect everything that could lead to an infection.

It actually works. Better yet, we log a lot of detail along the way, providing Analysts visibility into the sequence of events we detect leading up to the prevention decisions. However, as I was coming to the realizing that prevention actually does work, investigating these various triggers and alerts kept me in a logic loop. Here's a few examples.

Let's say an employee browsed to a suspicious website that was allowed by policy for some reason. While there, they were redirected to a website that contained a malicious ad that redirected the browser to a web exploit kit. Great, that's a candidate for a further look from an event analysis perspective. We have plenty of solutions from industry that can identify web exploit kit behavior as it happens. In fact, there are open source solutions and rules that enable you to do that. But better yet, we just sandbox everything anyway. The obvious next question in the WEK investigation process is, "was the endpoint then served a file?" If yes, then we would need to analyze it to determine if it's malicious. If it didn't, then we would need to monitor the endpoint for new activity out of the normal. What do you as an Analyst do to validate that? You try to re-create the exploit experience or try to grab a copy of any files that were transferred to the exposed endpoint, then you'll probably send them through a sandbox or up to VirusTotal etc. for analysis to find out if it's malicious or not. Here's the deal, our network sandbox technology automatically grabs a copy of every downloaded or served file and runs it for further analysis, automating the validation and prevention process. The very work I as an Analyst was preparing to do, our technology already did. At the same time, our endpoint protection technology, monitors every process that attempts to execute locally, also sending it up to the sandbox for further analysis. So...why not just focus my efforts on monitoring the results of the sandbox analysis since that's what I need in the end - some validation that a malicious process was delivered or is running? Well because if our sandbox can detect it, it would have, and our control technology would have prevented it based on that sandbox verdict. See my logic loop?

In another example, let's say an employee receives an email that has a malicious URL, and my technology detects that, but still delivers the email because it took some time to analyze the site or because policy allowed it for some other reason. The employee might click right? Well maybe, but once the verdict on the URL is decided, the technology automatically implements a block or permit decision. What if the URL used some sort of cloaking technology to evade sandboxing and the verdict is returned as "benign" or "unknown" so we don't prevent the employee's click? Well, fine, if the website serves a malicious file to the browser as a result of the click, my network stack will grab that file and send it off to the sandbox for analysis. What if that failed? Then I would traditionally look for signs that a file or process was dropped on the endpoint, or that some suspicious new traffic or behavior was observed from the endpoint following the URL exposure. Oh, but my endpoint solution is already monitoring all local processes and sandboxing them as well.

See where this is going? We sandbox potential weaponized items like URLs and files on their way in. Then we sandbox any content that transits our perimeter. Then we sandbox every process that attempts to execute on endpoints. That's delivery, exploit, and installation prevention. Better yet, it's not IOC dependent because we perform unique analysis every time.

In another example, let's say I'm looking for IOCs on the network - specifically at network communications that resemble known threats. If I find some, I'll need to get access to the endpoint, find the offending process, and analyze or validate it in my sandbox right? Well, it turns out that since my endpoint solution continuously monitors all new processes as they start, and performs analysis on-the-fly including submitting the process for sandbox analysis. It also looks at the local actions performed by the process to determine how closely those resemble malicious actions to help influence the prevention decision. So I don't need to look for IOCs on my network, because I'm constantly monitoring every process that tries to run on an endpoint...and if it looks malicious, I'm preventing it from executing.

Ok, so what about an exposed web server that has a publicly facing form and a data input validation vulnerability that allows a remote attacker to upload a file which is remotely activated through a crafted URL? Well, my network appliance is going to send of copy of that uploaded file to the sandbox for analysis, while the endpoint solution is going to monitor local execution for signs of malicious activity. Done.

See my point? The sandbox and endpoint solutions that are now available have actually changed the game. Quite effectively. If you are preventing the adversary from delivering their weapons and preventing the weapons from running, then you can significantly reduce the number of investigations and incidents in your environment, thus eliminating the race condition your SOC is gearing up to win. You can defeat the adversary before they even have a chance.

So what does that multi-million dollar SOC with all it's IOC detections and correlation capabilities and workflow automations do now? It idles.

You might ask, "well what about DDoS, malware-less intrusions, and insider threats?" Yep, those are still a concern...but enough to warrant 24x7x365 SOCs? DDoS is solved for via off-premise solutions like Akamai (Prolexic) and others. Malware-less intrusions and insiders are still a concern, but again, thinking of the actual value in rapid detection and response, does it actually gain you much? Anything? I'm not convinced it does. It takes a lot for me to say that because 6 years ago you would have found me in my employers RSA Conference vendor booth, selling customers on the rapid detection and response story, touting our time to detect and accuracy of detection capabilities. As of today, I'm just not seeing the value anymore.

Granted, there are still several other security scenarios that may come into your experience.

What about employees who accidentally post sensitive information on the Internet? Yeah that's a problem that must be mitigated, but you aren't going to detect that with data feeds and SIEMs (at least you don't need that level of complexity for that detection).

What about employees who bring in their own infected laptops and plug them into the corporate network? Ok, well we still prevent known C2 calls (again, based on our own analysis of malware we've seen plus all the malware samples our product vendor has seen). Even still, what are the odds that the adversary would be attempting to remotely control that device while it's present on your network? Probably not going to happen.

What about stolen property (laptops, servers, tablets, smartphones etc.)? Yeah, that's still an issue, but you don't detect stolen property with your SOC. You might respond to reports of stolen property, in which case you'll simply file a police report, assess potential damages, and try to perform a remote wipe of the device or data if you have an MDM solution.

You see, my point is, I believe the security industry has actually solved the primary problem so effectively that the SOCs of yesterday no longer apply. The SOC concept was designed to enable defenders to detect and respond to intrusions faster than the adversary could operate to leverage them. However, today, we can simply eliminate the intrusions rather than build compensating processes around them. Rather than putting millions of dollars into that SOC gear, put millions of dollars into prevention through solutions provided like those from Palo Alto Networks, CrowdStrike, or others focused on the prevention story end-to-end. My experience is, your Ops team will have few and far incidents to manage, which nullifies the value statement of that SOC room. You don't need central comms; you can have central ticketing and chat for rapid engagement when needed. You don't need a room to facilitate common awareness because dashboards are presented by applications, and applications can be securely accessed remotely. You don't need to rapidly respond to issues because in all likelihood, you those actors who are that sophisticated that you couldn't prevent them, are going to out pace and out wit you anyway. You won't need highly structured and repeatable processes for a team of Analysts to use, because your incident count will be so few and far between, it's not worth the ROI or resources to build all that pre-planning. You can wing-it each time with the right seasoned people. Finally, you don't need that structure, because having a physical room to operate in 1) wastes corporate real-estate, and 2) limits you to the talent readily available in your immediate area. In this industry, where we have 0% unemployment and virtual connectedness through solutions like Zoom, chat, and others, placing physical boundaries around your security team simply hinders your ability to capture and retain the talent you need to be successful.

Still convinced you need a SOC? I'm not, and again, that's saying a lot given that 13 years of my career was dependent upon selling SOC services.


1 comment:

  1. It was a nice article on SOC cyber security. I found very nice info here and want to share information on SOC integration. Thanks for sharing

    ReplyDelete