It's not a rare situation when files are already present on several servers before creating a job, and Resilio Connect Agents are supposed to synchronize them fast and reliably.
Agents perform a lot of complex logical operations in order to do the following:
- minimize data transfer across the network;
- make the data available on all Agents as soon as possible;
- allow end users to manage the data at their own discretion as soon as possible.
In the pre-seeded use case scenario the Resilio Agent faces two quite opposite challenges:
1. Do not transfer the matching pieces of files, transfer only those that differ. To make it possible, agents on all computers need to check their local files, learn the hash of each piece of file, exchange this information with each other (merge the folder tree), and make a decision what piece to transfer and from what Agent (which is the newest). Apparently, this is going to take a while, especially if there are a lot of agents in the job with a lot of files. Also, this requires performing a lot of disk read operations which is usually slow on HDD or network drives;
2. Bring the system into balance ASAP and eliminate all those background activities.
Resilio Connect provides a pretty flexible and refined configuration for these cases. In this guide we will go through each of the possible tweaks in detail, based on which Resilio admin may set up the most desirable final configuration. Change them with caution as each of the confirmation has its pros and cons.
Basically, the two above-mentioned challenges boil down to two big blocks of settings - 1) whether a file needs to be synced or not, and 2) whether to hash files or not.
1. To sync or not to sync?
When making this decision, by default Agent looks at the four attributes of a file: creation timestamp, modification timestamp, size and file permissions. Checking these is quite a quick operation, so if all the attributes on all computers match, Agents assume that nothing needs to be synced. If at least one of these doesn't match, Agents will get ready for data transfer.
It's possible to remove creation timestamp from the equation. Add custom parameter transfer_job_exact_ctime_timestamps with value 3 into the Agent profile.
It's possible to remove file permissions from the equation by disabling "Synchronize NTFS permissions" or "Synchronize Posix permissions" in Job profile.
Once the Agents decide that a file needs to be synced, they follow "Disable differential sync" parameter in the profile
Yes (default value) - the whole file will be simply synced across the network.
No - Agents will check file pieces, calculate their hashes so as to discover the changed pieces and sync only those.
2. To hash or not to hash?
The default behavior is for the Agents to not hash the file unless the file is requested to be synced. While this might seem to be an optimal solution it has a significant drawback: when a lot of files need to be synced, Agents will sync and hash at the same time, which will look like a slow data replication. This behavior is defined with a custom parameter lazy_indexing in the Agent Profile.
- With this value set to true as default behavior, the Agent does not hash the files, but instead hashes the file during upload request or when any other agent explicitly asks to hash a file.
- If this value is set to false, the Agents will hash the files the same moment it discovers the file or file change.
The speed of hashing greatly depends on the speed of the disk. It's generally highly advisable to enable hashing of files, despite the fact that hashing may take a long time.
Another custom parameter in Agent Profile is transfer_job_force_owner_to_hash_file. It works similarly to lazy_indexing, but applies only to file's owner. File owner is the Agent which has the latest file version or where the file originated initially. If the parameter is set to true, the file's owner will hash the files regardless of whether they need to be synced or not. This may greatly reduce the overall job's progress and performance, though. With lazy_indexing enabled, only file's owner will know the hash of a file. So if any remote Agent needs to download the file, file's owner will be the only source of the file. Another side effect is that if a cloud Agent is the owner of the file, the agent will need to download the file so as to hash it.
Setting this parameter to false may improve job's performance, but with enabled lazy_indexing, no Agent will have file's hashed which will result in full files re-download if anything changes. This might be especially ineffective if a cloud storage is involved.
If either hashing is enabled, it's advisable to add custom parameter prioritize_initial_indexing_mode with value 15 to Agent Profile. This will force agents to first hash the files, and only then merge the folder tree (don't do it in parallel).
Below are a few sample configurations to illustrate the idea:
Disable differential sync = Yes
|This is the default behavior. Optimal if attributes of most of files match; nothing will be synced.
However, if anything is changed, only file's owner will have the hash of file, which will slow down data replication - all files will be redownloaded.
Disable differential sync = No
|Advised configuration, if files' attributes don't match. This configuration forces all agents to hash all files in advance, even before they start merging folder trees. Only file difference will be synced. As a side effect job won't report any transfer activity until agents hash the files.|
Disable differential sync = No
|It is similar to sample #1. The difference between them is that if files change, only changed part will be synced|
Disable differential sync = No
|It is the fastest of all to bring the job in balance. If files differ, only changed part will be synched.
No agent will hash the matching files. However, this is also a great disadvantage if files are moved around in the job folder - this will results in full file redownload.
1) Archive. If files' timestamps don't perfectly match, ensure there's plenty of space for the Archive folder. If there's no, and no way to give more space, disable Archive through Agent profile at all (mind the other drawbacks of a disabled Archive and enable it back once the job folders are in balance).
2) File permissions. If modified timestamps match, file permissions will get randomly synced between all read-write peers. To avoid it, select Reference Agent in the job
3) Time required to get job folders in balance. If you go with "hash all in advance" scenario, it may take a long time.