Log file analysis is critical in modern IT operations, development and security. Yet it’s difficult, manually intensive, and unlike many other cognitive aspects of working on software addressed by everything from IDEs to coding guides to APIs to libraries, the human searching, reading and inferential thinking required by log file analysis is difficult to automate. Perhaps most importantly, the information it produces is difficult for non-technical decision-makers to use or even appreciate.
From discussions with system administrators and data scientists in the field, we’ve heard a range of common issues arising in modern businesses facing the difficulty and the necessity of log file analysis. The most common of them that crystallize into two issues:
- Prospect-based risk: Management is reluctant to invest in log file analysis until the prospect of some exigent circumstance forces their hand — usually a site crash, data breach or IT crisis of some kind.
- Data accessibility for decision-makers: Business intelligence is fundamentally useless in the long run if it’s not made accessible and comprehensible to non-technical decision-makers.
The clearest examples of prospect-based risk assessment and decision-maker data accessibility occur in IT security.
log analysis in security
Security is a hot-button issue today, and for a good reason: In the United States, the average cost of a data security breach in 2014 was $5.85 million, and major breaches occur constantly. Despite high public awareness of this issue, in many high-profile organizations afflicted by high-profile security breaches, early-warning signs were revealed through log file analysis and reported to non-technical decision-makers, who then either did nothing or the wrong thing.
As December 2013’s high profile data breach at a major North American retailer illustrates, this is not always entirely the fault of the decision-makers. Prior to the breach, this retailer was actually considered a good example of doing security right — it had a specialized IT security chief, it built a sophisticated security monitoring, and it employed a specialized staff running a business intelligence solution in a discrete physical location. In fact, just three months prior to the breach, it reported full compliance with payment card industries (PCI) standards for security and information retention.
On December 19, 2013, however, a massive data breach was revealed involving the financial and personal information of 110 million customers of its customers, or better than one-third of the total number of men, women and children in America. 40 million debit and credit card numbers were compromised, 70 million customer records lost, $71 million in lost revenue expected, $61 million in related expenses, the first firing of a CEO over a data breach in history occurred, and four different civil actions and 80+ class action lawsuits across a variety of jurisdictions were filed, in addition to active investigations by the FTC, the Department of Justice and the US Secret Service. The total cost of the breach when all is said and done will easily to run into the billions.
As the Senate report on the breach shows, there are two points in the breach “kill chain” where it would have been possible to disrupt the process based on simple log log file analysis:
- First, the critical period when malicious binaries were installed — the installation log files were captured by an in-place monitoring solution which raised a high-priority alert entitled “
malware.binary”, which was then ignored by security decision-makers accustomed to a constant stream of false positives.
- Second, at the data exfiltration stage — since this retail chain had no stores in Russia, the range of Russian IPs to which data was exfiltrated would easily have been noted and flagged by an anomaly detection solution working against web server logs.
Nowadays it’s very popular for technical bloggers to knock the businesspeople involved in breaches as bad decision-makers making poor assessments of risk, but this masks a harder truth: as makers of technical intelligence systems, bad decisions in data breaches also represent the failure to properly guide decision-makers to the correct choices to save their businesses.
The simplest fix along these lines are smart defaults. Design can preclude the possibility of inaction by just not presenting it as the default option. For instance, consider the two following alerts:
ANOMALY: 8 gigabytes of CRM data outbound to IP addresses with prefixes 17.101.12.xxx. Do something? (note lack of human detail and the implied default choice of doing nothing)
WARNING: Large volume of CRM data outbound to IP addresses of suspected national origin: RUSSIA. Potential data exfiltration event. Do you want to A) investigate, B) flag for further inspection by peer analyst, or C) raise an alarm? (note human detail and that doing nothing is not an option)
A second, easily implemented solution for security installations is “common sense” anomaly detection. In this case, re-writing a system binary shows an unusually detailed and brazen hack, a high-priority anomaly that should have clearly signaled a sophisticated attack.
Finally, a professional, user-oriented IT security system that signals a danger this clear and intrusive by saying things “
malware.binary” is like a fire alarm in a crowded venue that says “
event.combustive” or an alarm clock that wakes you up by announcing “
alarm.event” — it is not only unhelpful, it’s also irresponsible. It is fundamentally wrong to blame users in these cases for not understanding this type of alert.
There’s an old saying in IT: Garbage In, Garbage Out,or GIGO — give a computer a set of incorrect or meaningless inputs and you’ll get meaningless output. In machine data intelligence and log file analysis, we’d like to propose a more positive but perhaps less catchy variant, GIGO: Good information In, Good decisions Out. When you enlighten and empower everyone in an organization by offering accessible, real-time intelligence and intelligently constrained and defaulted choices, the result is a corporate culture of decision responsibility and continual self-improvement. We’ll discuss how these benefits accrue in our next article on log file analysis in operations.
If you’re working on similar problems of data accessibility, log file analysis and security, sign up for our beta or just contact us – we’d like to talk.