Part of CTSC's mission is to help educate the NSF community about tools and processes related to cybersecurity. For example, our software assurance team offers tutorials on static analysis tools and to test those tools, they provide benchmark datasets (code). In this article, we describe tools (Python modules) and a benchmark dataset for analyzing authentication data. However, the tools are sufficiently general that they could apply to other types of data related to cybersecurity, e.g. network traffic or more general data flows.
I recently had the pleasure of attending the SIAM Workshop on Network Science where I presented our poster on the analysis of a rather large authentication1 dataset. The public dataset was made available from Los Alamos National Laboratory (LANL) and represented over 700 million anonymized authentication events over a nine-month period.
Our poster submission demonstrated the use of Python to analyze and visualize the data. Since our scripts relied on various Python modules not found in the standard library, we recommended using the Anaconda Python distribution (3.x) which contained those modules (and a lot more). One key module that we used, to perform some of the network analysis, was NetworkX. Another module, to plot results, was matplotlib. We also demonstrated how one could use the IPython Notebook in a browser.
An authentication event was represented as a simple entry: "time,user,computer", where "time" was in seconds offset from the beginning, and "user, computer" were anonymized entries with unique numeric identifiers (e.g. U214,C148). We preprocessed the dataset to generate two files: one containing just the time values, another representing the user-computer information as a global, static graph. This type of graph, with two disjoint sets of nodes (users and computers), is known as a bipartite graph. Since the second file, containing the graph, took about 8 hours to generate, we made it publicly available in case others wanted to experiment. (Generating the first file, with only time values, just took a few minutes using one of our scripts.)
Our first step was to perform a sanity check on the time values for the authentication events. Fig. 1 is a histogram plot of all events over the nine-month period. Using the matplotlib module, we can interactively select a region to zoom into and see general daily and weekly usage patterns. The script to generate this histogram is parameterized so that a user can see more detailed (or coarse) plots.
Next, we use the NetworkX module to plot the graph and zoom in on particular nodes that seem to be hubs in the network. In the following two figures, the User nodes are colored red and Computer nodes are colored white. Fig. 2 shows C148 as a hub with numerous User nodes connected to it. Fig. 3, in contrast, shows U12 connecting to numerous computers. Obviously, if we had more information about the authentication events, we might be able to determine that certain User hubs were, for example, just the result of system administrators performing maintenance. On the other hand, it may be an indication of questionable user behavior.
In addition to visually inspecting the graph, we can programmatically analyze it to discover certain features, e.g., hubs or connected components. These techniques can be found in our poster and scripts.
According to LANL's Aric Hagberg, there will likely be another dataset coming sometime this year that will have more metadata.
Our abstract, poster, Python scripts, and additional documentation can be found at https://github.com/rheiland/authpy.
We welcome your comments.
1. Authentication, in this context, is the process of verifying the identity of a person connecting to, e.g. logging into, a computer.
 A. Hagberg, A. Kent, N. Lemons, and J. Neil. Credential hopping in authentication graphs. In 2014 International Conference on Signal-Image Technology Internet-Based Systems (SITIS). IEEE Computer Society, Nov. 2014.
 A. D. Kent, L. M. Liebrock, and J. C. Neil. Authentication graphs: Analyzing user behavior within an enterprise network. Computers & Security, 48:150-166, 2015.