Part of CTSC's mission is to help educate the NSF community about tools and processes related
to cybersecurity. For example, our software assurance team offers tutorials on static analysis tools
and to test those tools, they provide benchmark datasets (code). In this article, we describe tools
(Python modules) and a benchmark dataset for analyzing authentication data. However, the tools are
sufficiently general that they could apply to other types of data related to cybersecurity, e.g. network traffic
or more general data flows.
I recently had the pleasure of attending the SIAM
Workshop on Network Science where I presented our poster on the analysis of a rather large
authentication1 dataset. The public dataset was made available from Los Alamos
National Laboratory (LANL) and represented over 700 million anonymized authentication events
over a nine-month period.[1][2]
Our poster submission demonstrated the use of Python to
analyze and visualize the data. Since our scripts relied on various Python modules not found in
the standard library, we recommended using the Anaconda
Python distribution (3.x) which contained those modules (and a lot more). One key module
that we used, to perform some of the network analysis, was
NetworkX. Another module, to plot results, was
matplotlib. We also demonstrated how one could use the
IPython Notebook in a browser.
An authentication event was represented as a simple entry: "time,user,computer", where "time"
was in seconds offset from the beginning, and "user, computer" were anonymized entries with
unique numeric identifiers (e.g. U214,C148).
We preprocessed the dataset to generate two files: one containing just the time values, another
representing the user-computer information as a global, static graph. This type of graph, with
two disjoint sets of nodes (users and computers), is known as a bipartite graph. Since the second
file, containing the graph, took about 8 hours to generate, we
made it publicly available in case others wanted to
experiment. (Generating the first file, with only time values,
just took a few minutes using one of our scripts.)
Our first step was to perform a sanity check on the time values for the authentication events.
Fig. 1 is a histogram plot of all
events over the nine-month period. Using the matplotlib module, we can interactively select a
region to zoom into and see general daily and weekly usage patterns. The script to generate this
histogram is parameterized so that a user can see more detailed (or coarse) plots.
Fig. 1: A histogram, over time, of all authentication events (top); zooming into a 2 week
window (bottom)
Next, we use the NetworkX module to plot the graph and zoom in on particular nodes
that seem to be hubs in the network.
In the following two figures, the User nodes are colored red and Computer nodes are colored white.
Fig. 2 shows C148 as a hub with numerous User nodes connected to it. Fig. 3, in contrast, shows
U12 connecting to numerous computers. Obviously, if we had more information about the
authentication events, we might be able to determine that certain User hubs were, for example,
just the result of system administrators performing maintenance. On the other hand, it may be an
indication of questionable user behavior.
Fig. 2: Node C148 as a hub.
Fig. 3: Node U12 as a hub.
In addition to visually inspecting the graph, we can programmatically analyze it to discover
certain features, e.g., hubs or connected components. These techniques can be found in our
poster and scripts.
Discussing results with LANL's Hagberg (left)
According to LANL's Aric Hagberg, there will likely be another dataset coming sometime this year
that will have more metadata.
Our abstract, poster, Python scripts, and additional documentation can be found at https://github.com/rheiland/authpy.
We welcome your comments.
1. Authentication, in this context, is the process of verifying the
identity of a person connecting to, e.g. logging into, a computer.
[1] A. Hagberg, A. Kent, N. Lemons, and J. Neil. Credential
hopping in authentication graphs. In 2014 International Conference
on Signal-Image Technology Internet-Based Systems
(SITIS). IEEE Computer Society, Nov. 2014.
[2] A. D. Kent, L. M. Liebrock, and J. C. Neil. Authentication
graphs: Analyzing user behavior within an enterprise network.
Computers & Security, 48:150-166, 2015.