I had to write a blog post about a simple and old but super powerful Linux utility. In computational materials science, or any computational science, remote computing resources are essential for carrying out high-performance computing. In many cases your calculations will generate a large number of files or files that are large in size. The issue always becomes where do you access the data, remotely or locally, and where should you perform the analysis. In general the best practice is to do as much as you can remotely to avoid the overhead of shuttling data back and forth. The issue is that some analysis or data wranggling is best done in an environment you control or have elevated privileges.
In the scenario where you can't VPN or mount a remote filesystem, it might seem that there isn't a path forward! But indeed there is, well at least for Linux3. The small but nice SSHFS Filesystem utility offers a seamless solution by mounting remote directories directly into your local filesystem, allowing secure, integrated access without constant file transfers1. It is a piggy back off of SSH and FUSE for Linux, both these utilities are mainstays in Linux.
What is SSHFS?
SSHFS is a FUSE (Filesystem in Userspace)-based filesystem client [1] that uses SSH for securely mounting and interacting with remote directories on your local machine. Unlike traditional file transfer tools such as scp or sftp, SSHFS allows you to work with remote files as if they were stored locally, enabling easy access to simulation outputs on a remote server or datasets without manual transfers [2]. For me particularly this is very useful as I have been using TensorDock a lot recently for my ML/AI and computational materials science projects that I do on my personal time. The issue is -- more because I'm a little bit lazy -- that I don't want to setup the VM instance with all the software I use to do analysis, plotting, etc. I much rather just mount the remote directory directly into my local filesystem and access the files. So how does it work, it works pretty well so far.
Benefits of SSHFS
- Seamless Integration: Direct access to remote files within the local filesystem enables the use of local applications, such as data visualization or code editors.
- Security: SSHFS employs SSH encryption, ensuring secure data transfers; I think! I don't know how secure this really is.
- Efficiency: Reduces the need for manual file transfers, allowing a smoother research workflow.
Setting Up SSHFS
To begin, install SSHFS on your system:
# On Ubuntu/Debian systems
sudo apt install sshfs
Create a mount point and use the following syntax to mount:
mkdir ~/remote_folder_mounted_locally
sshfs username@remote-server:/remote/folder ~/remote_folder_mounted_locally
Thats pretty easy. You can look to see if its mounted using mount command and the best way to unmount is using fusermount -u.
Performance Optimization
To make SSHFS more efficient for my scientific workflows where there are a lot of files or very large files that are produced, I try to optimize the mount flags for speed and stability:
Caching Options
- cache=yes: Enables file and directory caching to improve access speeds.
- kernel_cache and auto_cache: Allow system-level caching and automatic cache updates.
Transfer Settings
- readahead=32768: Sets the read-ahead buffer size to 32KB, which is used by SSHFS to pre-fetch file data. This can improve throughput by reducing the number of read operations needed, especially beneficial when accessing large files or directories with many files.
- big_writes: Supports larger write sizes, enhancing transfer speeds.
Stability Enhancements
- reconnect: Automatically re-establishes the connection if interrupted.
- ServerAliveInterval=15: Keeps the connection alive with periodic pings.
An optimized mount command might look like this:
sshfs username@remote-server:/remote/path ~/remote-calc \
-o cache=yes \
-o kernel_cache \
-o auto_cache \
-o readahead=32768 \
-o big_writes \
-o reconnect \
-o ServerAliveInterval=15
I haven't benchmarked how this different settings effect the performance of SSHFS for the type/size of files I produce, but seems to work pretty well. One thing the keep in mind is you will want to use ssh keys to make the connection more seamless (and secure). You can do this like:
ssh-keygen -t ed25519 -C "your_email@example.com"
ssh-copy-id username@remote-server
Limitations and Considerations
While SSHFS is powerful, it probably has limitations:
- File Size Constraints: SSHFS seems to be slower with files larger than 500MB, making it less optimal for very large datasets.
- Latency Sensitivity: The performance is highly dependent on network quality; high latency or low bandwidth can impact file operations4.
- Concurrent Access: File locking support is limited, so SSHFS may not handle concurrent access well, which could lead to data inconsistencies.
Performance Comparison
Based on my experience over the years this would be the table of comparisons for different transfer methods:
Transfer Method | Small Files (<10MB) | Large Files (>500MB) | Concurrent Access |
---|---|---|---|
SSHFS | Excellent | Fair | Limited |
SCP | Good | Good | N/A |
Tar over SSH | Good | Excellent | N/A |
This table reflects SSHFS's performance strengths, especially with small files, while acknowledging its limitations in concurrent access and handling very large files. SCP and tar over SSH generally outperform SSHFS for single large files or batched file transfers due to reduced latency concerns.
Final Thoughts
For me SSHFS is becoming a valuable tool for my personal workflows providing local-like access to remote files and supporting efficient data analysis. While it has limitations with large files and concurrent editing, the integration it offers is just straighforward and simple.
Footnotes
-
SSHFS is particularly helpful when VPN access is unavailable or when working across multiple remote systems. ↩
-
Performance optimizations referenced are based on SSHFS version 3.7.0. ↩
-
There probably are variants of SSHFS or steps that let you work with Windows and MacOS. ↩
-
When I grab servers in Estonia I get disconnected or experience latency issues more regularly. ↩
References
- FUSE: Filesystem in Userspace, Wikipedia.
- SSHFS utility, GitHub Repo.