This multi-part blog series explores how advanced network-traffic analytics changed how the Department of Defense approaches its overall cyber security operations, creating a far more effective methodology for protecting many of our nation’s most sensitive networks.
Today we’ll cover a realistic — but, to be clear, completely made up — scenario of how the DoD conducted an incident investigation before deployment of an advanced network-traffic analytics solution…
It is the morning of Monday, 23 September, 2010 and Susan, the CISO for a large DoD agency, is starting her day at the office. In her inbox she finds an email notification from one of her cyber intelligence analysts describing a previously unseen zero-day attack campaign. As Susan reads through the report she realizes that her organization fits the target group for this campaign and she starts to get worried. She thinks, “Are we a victim of this attack and we don’t even know it yet?” The organization hasn’t had any security issues recently, but this zero-day vulnerability wouldn’t be detected by their current security tools (firewall, SIEM, IPS, endpoint protection).
Acting on this worry, Susan forwards the threat alert email to her most senior network security analyst, Jim. She tells Jim that their organization fits within the campaign’s target group and she wants to know if there is any evidence of an intrusion based on the provided attack details. She says this is high priority and she needs answers ASAP. Jim is already overwhelmed by alerts, receiving an average of 20 critical alerts per day that need investigation. This alert, however, comes directly from the CISO and needs attention. Hopefully the investigation doesn’t take long and he can look at his other critical alerts.
After reviewing the campaign details, Jim thinks about how his existing security infrastructure helps protect him in this situation. He has the following technologies in place:
- Leading Intrusion Detection System (IDS)
- Leading Security Incident and Event Manager (SIEM)
- Leading security analytics and forensics package
- Sampled NetFlow collector
- Layered DNS
- HTTP web proxy
Unfortunately since this is a zero-day vulnerability, his automated signature-based IDS will not be of any help. His SIEM might provide some clues and evidence, but only if the attackers were noisy or haven’t covered their tracks.
In response to Susan’s email Jim starts working on this investigation. Looking at the intelligence analyst’s alert, he realizes the following:
- This is a zero-day in Internet Explorer that drops previously unseen malware, so all signature-based tools (anti-virus, anti-malware, IDS, IPS, and endpoint protection) will be ineffective.
- The traffic is HTTP over port 443, both of which are allowed by the organization’s firewalls.
- If this attack is successful then an attacker can execute remote code and gain the same user rights as the current user. The attacker can then enumerate the network and move laterally.
Since Jim’s team doesn’t have any advanced network analysis tools, they will have to manually collect and analyze information to look for signs of intrusion. Knowing this will be a lot of work he brings in his team member Mike and they start digging.
Jim and Mike start by going to their IDS solution. They know that it only detects known threats and likely won’t help in this case, but they search it anyway for any suspicious activity during the campaign timeframe. They find nothing, but at least they failed quickly and can move on. Next Jim and Mike look at their SIEM solution to search for any aggregated logs and events within the last month related to the campaign activity. This is a slow manual process because they’re essentially searching through all their historic log and event data for little snippets of activity that might match this attack profile. Advanced attackers, the kind that execute previously unseen zero-day attacks, very commonly alter and delete logs and event generators to hide and cover their tracks. So the output of the SIEM would be a partial and untrusted view of historic activity. Jim and Mike iteratively investigate SIEM logs and try to validate attacker behaviors, but it’s a lot of effort and very time intensive. Ultimately they gather what they can and move on to other investigation activities.
After analyzing the NetFlow and DNS logs, Jim and Mike move on to the HTTP web proxy logs. This malware was delivered via Internet Explorer, so the proxy logs must have logged the attack. They access the HTTP proxy and perform a text search for URIs containing specific IP addresses, domain names, and resources (such as “img20130823.jpg”). They export the results, copy them back to the workstation, and weave them in with the rest of the collected data. They create new queries to analyze the proxy logs and join them up with the NetFlow and DNS records. This step takes Jim 5 hours to complete.
Next Jim and Mike turn to their security analytics and forensics solution. This solution collects packet capture at choke points and unravels it to enable forensics. Unfortunately Jim and Mike are still trying to discover and investigate this malicious traffic. They agree to come back to the forensics tool when they have specific scenarios to analyze in greater detail.
Next Jim and Mike go to their sampled NetFlow collection device. They start with the IP addresses they had found in previous steps. Using these IP addresses as their search terms they log into their Cisco NetFlow collection device and look for all traffic related to these IPs for the past month. Unfortunately their NetFlow sampling rate is set to sample 1 out of every 100 packets to reduce consumption of router CPU resources, so this information will only provide a partial view of the traffic going across the router. The result of their NetFlow search is bad news – there are a large number of flows that include these known bad IP addresses. So they export the search result set as a flat file and copy it back to a workstation.
Now they start to combine and analyze the exported NetFlow results into an Excel spreadsheet, and begin to analyze it by filtering, querying, and looking at traffic intersections to get a better understanding of the network traffic. It is tedious and time-consuming work because it’s a lot of data.
Once they get familiar with the NetFlow data they can construct an initial timeline of when each of these IP address were first seen and when they were last touched by a machine within the organization. This step takes Jim and Mike 4 hours to complete due to the tedious nature of the work.
The next step in the investigation is to examine DNS logs from the organization’s DNS servers. Jim knows in advance that this won’t capture all DNS traffic across their network because some clients set their own DNS servers (e.g. OpenDNS or Google DNS). The organization does not force clients through their DNS servers. But they still need to examine the DNS logs. Jim logs into the multiple primary and secondary DNS servers, performs text searches (e.g. grep) for IP addresses and domain names, and finds a large number of results. He exports the results to a local workstation and starts with this part of the analysis. Since there are a large number of records associated with these IP addresses and domains they can’t use Excel for their analysis anymore. They abandon Excel and create a custom MySQL database for their analysis. After creating a table for each data source they load in the data so that they can run queries on it.
As they start to look at the DNS logs they realize that the organization’s DNS server only logs requests. It does not log responses because their logging level is set to the default level with no other options. This is typical for large organizations that want to maximize their request cache and aren’t as interested in holding onto response data. This means that they only see one side of the DNS conversation and are missing information about what their DNS servers returned. So using the incomplete DNS logs and the sampled NetFlow together, Jim and Mike eventually connect some dots about what clients were communicating with which servers, which helps with the timeline and incident reconstruction. They aren’t fully confident that they’re seeing everything because of the NetFlow sampling rate and the use of external DNS servers. This step takes Jim 7 hours to complete due to the creation of the MySQL database and queries required for analysis.
At this point, at the beginning of Tuesday 24 September, Susan is starting to get anxious about the pace of the investigation. Since Jim is not confident that he’s identified all the compromised hosts on his network, he investigates individual hosts to see if there is any suspicious activity or signs of lateral movements. Jim and his co-worker log onto hosts that are known to be compromised and run a series of console-level commands to look for suspicious network connections, processes, scheduled tasks, and users. They manually correlate all of this information and use it to identify other hosts that have potentially been compromised. This is slow work and still doesn’t guarantee that they’ve found everything an attacker has done within their network.
Now that Jim and Mike have gathered up logs, centralized them, and created a custom MySQL database for analysis, they can start their final analysis and collaboration. They construct the best timeline they can, mapping out which machines communicated with which servers. They create a report showing their findings and list out conclusions. Their evidence consists of the logs they were able to collect. This process of reviewing and finalizing the findings takes a full day to complete.
Jim and Mike took 3 days to complete their investigation. Their findings were based on incomplete information because their data sources did not have enough logging to paint a complete picture of what happened on the network. And since this zero-day attack did not trip any monitoring or alerting on their existing tools, they have no intelligence to gather from those tools. The threat got by all of their existing defenses. The output of their efforts was a best-effort timeline of events, a report containing everything they could find, a recommendation for creating a new signature based on the analyst intelligence, a list of external actors that connected to their network, and a list of impacted machines that may or may not be exhaustive. Jim does not feel confident about his output, and he is frustrated that it took so long to gather and analyze the information. He hopes he didn’t miss anything major. After 3 days of investigation Susan gets this information and they move on to containment and remediation. Hopefully that will clean up any compromised machines.
In our next installment of this series, we’ll take a look at the same process, but after the deployment of an advanced network-traffic analytics solution.