Trusted CI recently received the following query from Chester Langin and are sharing his question and our answer with his permission:
As a security person, can you tell me the advantages and disadvantages of allowing more than one than one user on a cluster node at a time? I ask because we just moved from Rocks/SGE to OpenHPC/SLURM. Our old cluster allowed multiple users per node so, with 20 cores as an example, users with jobs running 8, 8, and 4 cores could all be running on the same compute node. This provides high efficiency. Our new cluster apparently restricts this so if the first user runs a job with, say 8 cores, nobody else can use that same node and 12 cores are not being used. So, our users will be noticing that jobs will be backing up in queue.
Should we configure SLURM to allow multiple users per node? Do you have a recommendation? Can you give me pros and cons?
This is a classic example of a risk/reward trade-off. As you note in your question, allowing only a single user per node has the down side of lower efficiency. So what do you gain?
There are risks with allowing multiple users per node in that user accounts are not as strong a guarantee of isolating users from each other as is having them on separate nodes. Bugs in the underlying system (and hypervisor if we’re talking virtual machines), misconfigurations of the operating system, and errors in setting file permissions can allow information, potentially sensitive information and credentials, to leak between users on the same node. Some examples include CVE-15566, CVE-2017-5715, CVE-2017-4924. Additionally we've seen two recent cases in our software assessments where we found file system permissions were set too permissive allowing users to see each other data.
Hence you gain some risk reduction. We assume you can estimate the value of the efficient reduction in terms of lost CPU time, but how to you estimate the benefits of the risk reduction so you can compare these two things?
Unfortunately, quantifying this trade-off isn’t trivial - it’s a judgement call. Some questions to ask to determine which path makes sense for your system involved gauging the consequences of the security risks:
- How big and diverse is your user community? If your users are all from a collaborating community or within the same institution, the consequences of data leakage could be lower. But if you have users who are competing research groups or companies, the stakes could be higher
- What type of data does your system handle? Is it regulated data or other sensitive data that would increase the impact of the risks in question?
- How you handle an incident can greatly impact its consequences. How poised are you to handle a incident if it occurs? Do you have a incident response plan in place that you regularly exercise?
- What is the risk tolerance of your stakeholders? Are you expected to squeeze every ounce of performance out of the system or is reputation considered more important? Is there any recent history related to security incidents that may impact this?