Search Blogs

Thursday, October 24, 2024

Limiting the Data Shuffle

I had to write a blog post about a simple and old but super powerful Linux utility. In computational materials science, or any computational science, remote computing resources are essential for carrying out high-performance computing. In many cases your calculations will generate a large number of files or files that are large in size. The issue always becomes where do you access the data, remotely or locally, and where should you perform the analysis. In general the best practice is to do as much as you can remotely to avoid the overhead of shuttling data back and forth. The issue is that some analysis or data wranggling is best done in an environment you control or have elevated privileges.

In the scenario where you can't VPN or mount a remote filesystem, it might seem that there isn't a path forward! But indeed there is, well at least for Linux3. The small but nice SSHFS Filesystem utility offers a seamless solution by mounting remote directories directly into your local filesystem, allowing secure, integrated access without constant file transfers1. It is a piggy back off of SSH and FUSE for Linux, both these utilities are mainstays in Linux.

What is SSHFS?

SSHFS is a FUSE (Filesystem in Userspace)-based filesystem client [1] that uses SSH for securely mounting and interacting with remote directories on your local machine. Unlike traditional file transfer tools such as scp or sftp, SSHFS allows you to work with remote files as if they were stored locally, enabling easy access to simulation outputs on a remote server or datasets without manual transfers [2]. For me particularly this is very useful as I have been using TensorDock a lot recently for my ML/AI and computational materials science projects that I do on my personal time. The issue is -- more because I'm a little bit lazy -- that I don't want to setup the VM instance with all the software I use to do analysis, plotting, etc. I much rather just mount the remote directory directly into my local filesystem and access the files. So how does it work, it works pretty well so far.

Benefits of SSHFS

  • Seamless Integration: Direct access to remote files within the local filesystem enables the use of local applications, such as data visualization or code editors.
  • Security: SSHFS employs SSH encryption, ensuring secure data transfers; I think! I don't know how secure this really is.
  • Efficiency: Reduces the need for manual file transfers, allowing a smoother research workflow.

Setting Up SSHFS

To begin, install SSHFS on your system:

# On Ubuntu/Debian systems
sudo apt install sshfs

Create a mount point and use the following syntax to mount:

mkdir ~/remote_folder_mounted_locally
sshfs username@remote-server:/remote/folder ~/remote_folder_mounted_locally

Thats pretty easy. You can look to see if its mounted using mount command and the best way to unmount is using fusermount -u.

Performance Optimization

To make SSHFS more efficient for my scientific workflows where there are a lot of files or very large files that are produced, I try to optimize the mount flags for speed and stability:

Caching Options

  • cache=yes: Enables file and directory caching to improve access speeds.
  • kernel_cache and auto_cache: Allow system-level caching and automatic cache updates.

Transfer Settings

  • readahead=32768: Sets the read-ahead buffer size to 32KB, which is used by SSHFS to pre-fetch file data. This can improve throughput by reducing the number of read operations needed, especially beneficial when accessing large files or directories with many files.
  • big_writes: Supports larger write sizes, enhancing transfer speeds.

Stability Enhancements

  • reconnect: Automatically re-establishes the connection if interrupted.
  • ServerAliveInterval=15: Keeps the connection alive with periodic pings.

An optimized mount command might look like this:

sshfs username@remote-server:/remote/path ~/remote-calc \
    -o cache=yes \
    -o kernel_cache \
    -o auto_cache \
    -o readahead=32768 \
    -o big_writes \
    -o reconnect \
    -o ServerAliveInterval=15

I haven't benchmarked how this different settings effect the performance of SSHFS for the type/size of files I produce, but seems to work pretty well. One thing the keep in mind is you will want to use ssh keys to make the connection more seamless (and secure). You can do this like:

ssh-keygen -t ed25519 -C "your_email@example.com"
ssh-copy-id username@remote-server

Limitations and Considerations

While SSHFS is powerful, it probably has limitations:

  • File Size Constraints: SSHFS seems to be slower with files larger than 500MB, making it less optimal for very large datasets.
  • Latency Sensitivity: The performance is highly dependent on network quality; high latency or low bandwidth can impact file operations4.
  • Concurrent Access: File locking support is limited, so SSHFS may not handle concurrent access well, which could lead to data inconsistencies.

Performance Comparison

Based on my experience over the years this would be the table of comparisons for different transfer methods:

Transfer Method Small Files (<10MB) Large Files (>500MB) Concurrent Access
SSHFS Excellent Fair Limited
SCP Good Good N/A
Tar over SSH Good Excellent N/A

This table reflects SSHFS's performance strengths, especially with small files, while acknowledging its limitations in concurrent access and handling very large files. SCP and tar over SSH generally outperform SSHFS for single large files or batched file transfers due to reduced latency concerns.

Final Thoughts

For me SSHFS is becoming a valuable tool for my personal workflows providing local-like access to remote files and supporting efficient data analysis. While it has limitations with large files and concurrent editing, the integration it offers is just straighforward and simple.

Footnotes


  1. SSHFS is particularly helpful when VPN access is unavailable or when working across multiple remote systems. 

  2. Performance optimizations referenced are based on SSHFS version 3.7.0. 

  3. There probably are variants of SSHFS or steps that let you work with Windows and MacOS. 

  4. When I grab servers in Estonia I get disconnected or experience latency issues more regularly. 

References

  1. FUSE: Filesystem in Userspace, Wikipedia.
  2. SSHFS utility, GitHub Repo.


Reuse and Attribution

Tuesday, October 15, 2024

Making KMC Cool

If you've ever tried to simulate a deposition process for film growth, you'll quickly find that a lot of different events take place. Moreover, these events occur at different timescales. In a deposition process, for example, the particles/species are arriving at the surface at a much faster timescale than, say, surface diffusion. To put it in context, say we have a 1 mm$^2$ area, and our particle flux is $10^{21} \frac{\#}{\text{m}^2 \cdot \text{s}}$, while the diffusion coefficient is $10^{-4} \frac{\text{cm}^2}{\text{s}}$. If we assume some desired number of particles to deposit to create a film on this area, then the timescales to see events are something like microseconds for deposition and thousands of seconds for diffusion.

These timescales are vastly different and hence will need to be handled differently. In the deposition process, we could treat it directly using atomistic simulations, but it turns out that for most practical fluxes, the timescales are just out of reach. To solve this problem, we can turn to a technique called kinetic Monte Carlo (KMC). For diffusion, it's clear that we need to use KMC or solve mass transport equations.

Kinetic Monte Carlo (KMC)

Most people are familiar with Monte Carlo methods. These are stochastic approaches to solving problems where, instead of computing the solutions directly, we sample from probable events. The probability is determined by the probability mass or density function. Monte Carlo methods can be used to solve all kinds of problems, for example performing integration. Say we have a function we want to integrate but don't know a deterministic algorithm or analytical solution. We can use Monte Carlo to randomly sample the area under the curve and estimate a definite integral. The ultimate accuracy depends on how much sampling you decide to perform.

Monte Carlo methods let us perform random sampling to estimate the behavior of a system or process; if we think back to our integral example, we can think of the system or process being described by a high-dimensional probability density or mass function that we are trying to integrate to get the expected behavior (i.e., value).

So what does it mean when we add the prefix "kinetic" to Monte Carlo? It simply designates that the process has a time component to the evolution. Think of a chemical reaction or diffusive dynamics. In KMC, the events that are described are derived from rates, $\nu$, which are usually in inverse time units. The probability of an event is determined by the rate, and this in turn informs the elapsed time. To expand, let's look at the math.

Say we have some event, $\Omega_i$, that corresponds to a diffusive hop. There is some energy required to overcome for the diffusive hop to occur, given as $E_{a}^i$. We can describe the rate as $\nu_i = \nu_0^i \exp\left(-\frac{E_{a}^i}{k_B T}\right)$, where $k_B$ is the Boltzmann constant, $T$ is the temperature, and $\nu_0^i$ is the prefactor (attempt frequency) for event $i$. So now assume we have $N$ events that describe some diffusive hop to a different site/location. We can describe the probability of any one event occurring by:

\begin{equation} P(\Omega_i; E_a^i, \nu_0^i) = \frac{\nu_0^i \exp\left(-\dfrac{E_{a}^i}{k_B T}\right)}{\sum\limits_{j=1}^N \nu_0^j \exp\left(-\dfrac{E_{a}^j}{k_B T}\right)} \label{eq:probability} \end{equation}

I intentionally include that the probability is parameterized based on selecting the activation energy and the prefactor. However, we could easily consider this as a conditional probability if we have some knowledge of the range of values for these terms.

Thus, you have a way to sample from the event catalog now. So how do you define time based on the sampling? If an event $\Omega_i$ is selected, then we can calculate the time increment associated with this event. In kinetic Monte Carlo, we generate a second random number to calculate the time increment. Specifically, the time increment $\Delta t$ is given by:

\begin{equation} \Delta t = -\dfrac{1}{R_N} \ln r \end{equation}

where $r$ is a uniform random number between 0 and 1, and $R_N$ is the cumulative rate for all events/processes, i.e., $R_N = \sum\limits_{j=1}^N \nu_j$. This time increment reflects the stochastic time elapsed before the occurrence of event $\Omega_i$.

It's a pretty simple formalism but very powerful to simulate kinetic processes at the right timescales. The challenge is always two-fold:

  1. Treating a global event table with all process timescales and rates can lead to the need for very large sampling trials. In other words, if we treat the deposition and diffusive processes on the same event table, diffusive events will have a lower chance of being sampled. The easiest way to address this is simply by decoupling them and having two KMC times simultaneously evolve.

  2. A priori knowledge of what should happen, i.e., this atom deposits at this site and reaction $\alpha$ occurs if it has species A but does reaction $\beta$ if it has species B. For well-understood systems that have simple species, this is manageable, but it should be clear that the combinatorics of events for multi-species deposition and diffusion becomes prohibitive, and more importantly, we have many model parameters to decide on.

Lattice KMC

One way to simplify the KMC approach and to simulate crystalline materials is to use a lattice formalism. This means we can describe lattice points where events can occur and thus it somewhat restricts what can happen. In particular, diffusion hops are constrained to occur between neighboring lattice sites. This works well for crystalline materials with a single basis, but one may need to fall back on mesh methods to treat more complicated crystals.

🧠 Thinking out Loud

I haven't really thought too much about meshing approaches, but the idea would be to define nested lattices where each basis point is defined. One could then define the events that are allowed within a lattice and between lattices. Take a Cesium Chloride crystal; we would have two interpenetrating simple cubic lattices.

Some of the challenges with lattice KMC are that you can't treat amorphous materials and grain boundaries. And it gets complicated for multiple species in my view.

You're also still plagued with the fact that you need to enumerate all the events and assign the rates. Maybe not too difficult for a monoatomic species, where you can find a good amount of information for the surface deposition/diffusion and bulk diffusion.

On-the-Fly Off-Lattice KMC

First, imagine we remove the lattice constraint and let points (representing atoms or molecules) occupy any point in space. Then, say we have a way to determine for a species the value of $E_a^i$ for some motion it could have in space. Think of a surface atom hopping on top of another surface atom. If you could find and enumerate all the possible motions such an atom could have, you would come up with a list of activation energies and therefore rates. Thus, you would have cataloged an event table you could then sample from to perform KMC. This is on-the-fly KMC.

Basically, you need to have a model/description for the energy barrier associated with different dynamical processes. We do this through the potential energy surface (PES) model, which is the functional surface describing, for a given configuration of atoms in space, what is the potential energy and the associated gradients (i.e., forces).

Once you have a way to describe the PES, then we need a methodology to sample different pathways (e.g., reaction coordinates) to determine what pathways through space an atom(s) could take and the associated energetic barrier. Then, through transition state theory, we can determine the rate of each pathway; that is, we get an Arrhenius expression for the rate of a reaction coordinate:

\begin{equation} \nu_i = \nu_0 \exp\left(-\dfrac{E_{a}^i}{k_B T}\right) \end{equation}

Challenges with Transition State Theory

The two major hurdles with transition state theory are:

  1. You need to have an accurate PES description for the chemical system you're trying to describe. This usually requires force-field or interatomic potential development (see my posts discussing such topics).

  2. The pathway along each reaction coordinate either needs an intelligent way to search around an initial point or requires an initial and final point to be connected. The former are called single-ended methods, and techniques like the dimer method or growing string method are used. These essentially find the saddle points of the PES, and thus the final state is just the "downhill" path from the saddle point. The other method is dual-ended, where the initial and final coordinates are known, but possible paths are not. We then can find the possible barriers and usually only tabulate the one that is the lowest.

Most of the effort is always centered around proper PES description and setting up the TST calculations, which can be truly challenging. This is why we probably don't see many on-the-fly KMC codes in practice. The only one that I came across was EON from the Henkelman group at UT Austin [1]. It uses a client-server approach where the server runs the TST calculations and then hands the event table back to the KMC client.

Putting It All Together

In practice, I think the outline for an on-the-fly off-lattice KMC code that simulates deposition and diffusion would be something like:

flowchart TD Start{Start Simulation} --> Init[Initialize Deposition Events] Init --> OuterLoop{Max steps?} OuterLoop -->|No| Deposition([Sample deposition event]) Deposition --> InnerKMC[Perform inner KMC loop] InnerKMC --> Update[Update KMC times] Update --> Stats[(Store Statistics)] Stats --> OuterLoop OuterLoop -->|Yes| End(((Terminate))) subgraph InnerKMC [Inner KMC Loop] TST(Transition State Calculation) TST --> SampleTS(Sample from transition state table) end

As this blog was more a discussion on KMC, just so I can jot down my thoughts, I'll leave it at this for now. If you want to read more on lattice KMC, you can read the chapter in the book by LeSar on it; it's a pretty good starting point [2].

References

[1] S.T. Chill, M. Welborn, R. Terrell, L. Zhang, J.-C. Berthet, A. Pedersen, H. JΓ³nsson, G. Henkelman, EON: software for long time simulations of atomic scale systems, Modelling Simul. Mater. Sci. Eng. 22 (2014) 055002. DOI.

[2] R. LeSar, Introduction to Computational Materials Science: Fundamentals to Applications, Cambridge University Press, Cambridge; New York, 2013. DOI



Reuse and Attribution