I think the mechanism by which the network sections split and merge is quite fascinating and underappreciated, so I wrote this up. Hopefully the elegance of the implemented algorithm can be better appreciated by those who may otherwise not read code etc.
The code for splits and merges is quite difficult to interpret but I think I’ve got it right. If there’s any corrections on the technical details I’d be very glad to know so I can incorporate the changes.
Splitting and Merging Sections
Splitting and merging sections on the safe network is an important and basic part of the network functionality. It determines which nodes are responsible for your data and how that responsibility changes over time. It’s quite an elegant and fascinating mechanism and deserves some explanation.
Why Split And Merge
The safe network consists of many vaults. 8 or more vaults can combine to form a cluster of nodes known as a ‘section’ which becomes responsible for any data allocated to it by the network.
The structure of the section that form the safe network is constantly changing. New vaults will periodically join a section (making the section larger) and existing vaults periodically leave the section (making the section smaller).
Very large sections are detrimental to the network because they create a centralisation risk to the data they’re responsible for.
Very small sections are also detrimental because there’s less redundancy of data and a greater risk of data loss.
So to achieve a balance, large sections can split into two smaller sections and small sections can merge with another to form a larger section.
This splitting and merging mechanism ensures data remains secure and efficiently distributed on the network.
There are only a couple of important numbers to keep in mind.
The minimum number of vaults in a section is 8.
The minimum number of vaults to form two new sections is 11 per new section.
Sections need a way to organize and refer to themselves and each other. This is done by using the common leading digits in the vault names (ie the ‘prefix’) as the name for the section.
Consider two vaults, one with name starting with
001100... and the other starting with
These vaults both start with
0 so could be part of a section identified by that prefix, ie
But they could never be part of
section 1 since the start of their name doesn’t match that section prefix.
Similarly, these vaults both start with
001 so could be part of a section identified by that prefix, ie
Vaults try to join the ‘longest’ prefix they can match to, since this leads to vaults being as evenly distributed through the network as possible, and can only be part of one section at a time.
The section prefix determines how they should split or merge, and allows sections to refer to each other during these processes.
If a section can safely split into two new sections it will do so.
The rules for this are very simple:
- Extend the current section prefix into two new hypothetical prefixes by adding a
0in one case and a
1in the other
- Simulate splitting the current vaults in the section into these two new hypothetical prefixes
- If both new sections would have at least 11 members, do the split
- Otherwise, don’t split and stay as a single section
For example, consider
section 0 with 25 vaults.
Say 10 of these vaults have a name starting with
And the remaining 15 of these vaults have a name starting with
Because the hypothetical
00 prefix would only have 10 vaults nothing further happens (ie 11 vaults are required before splitting).
But then a new vault with name starting
00 joins the section, taking the total to 26.
This means the hypothetical prefix
00 would now have 11 vaults, so a split should happen.
The vaults organize between themselves to split into
section 00 with 11 vaults and
section 01 with 15 vaults.
The data is also split between these two new sections, so each section has about half the amount of data to look after.
Once the sections have split,
section 0 no longer exists! Any vault trying to join
section 0 will have to try instead joining
section 00 or
section 01 instead, depending on the specific name of the new vault.
This splitting can continue as nodes join the network, gradually extending the prefix.
Vaults may leave the network at any time. This means sections may end up being so small they can no longer provide sufficient security for the data they’re managing. If this happens, the section needs to merge with another.
The rules for merging are also simple:
- If the section size more than 8, don’t try to merge
- If the section size is less than 8, try to merge with the sibling section
There’s some interesting situations to consider, but let’s look at the simplest merge case first:
section 000 is stable with 8 vaults. But then a vault leaves, reducing it to 7 vaults. This section needs to merge with another section.
section 000 looks for the sibling section, ie
section 001 (which is created at the same time as
section 000 when
section 00 split).
section 001 exists,
section 000 will try to merge with it and become
section 00. This is the simplest merge case.
section 001 doesn’t exist (maybe it already split into
section 0010 and
section 000 merges with all children of
section 001, ie with
section 0010 and
section 0011. This sounds drastic but in reality the imbalance between the number of children siblings have is never too great.
It’s worth noting that
section 00 (the section it merges into) cannot exist on the network since any node trying to join
section 00 would instead prefer to join the longer prefix of
section 000 or
section 001 (or any of their children sections). So the merge can always shorten the prefix without conflicting on another section which already might have that prefix, since no such shorter-prefix-section can exist.
Merging is definitely more complex than splitting, but the fundamental decisions are still quite simple. Hopefully someone closer to the code can clarify what happens when the simplest merge situation can’t be performed.
One of the neat things about merging and splitting is splitting needs more than 11 vaults for each new section but merging needs less than 8. So even if a vault or two leaves immediately after a split it won’t cause a merge. This buffer makes it hard to induce lots of churn just when a section has split by trying to make it merge.
Section Size Distribution
One of the implicit features of this mechansim is sections have no upper limit on their size.
Most sections will have about 8-16 vaults, but some will contain many more. It all depends on the names of vaults joining the network and whether it leads to splitting or not.
For example, if
section 001 has 50 nodes all starting with
001000... there can be no split (even though the section is very large!) because it can only split into hypothetical
section 0010; there are no nodes to form hypothetical sibling
section 0011. So sometimes large sections will be formed.
The network tends to exhibit a Poisson Distribution of section sizes with the most common size being 12.
There are some more details which are still being finalised while datachains are being developed.
Large sections will be able to relocate young nodes if a split is desired but not imminent.
Quorum decisions are not formed by the entire section, mainly by excluding the youngest nodes. This means large sections do not necessarily reach quorum decisions any slower than small sections.
Simulating the split and merge of nodes shows some interesting information about the algorithm.
The following comes from simulating a network of 100K nodes (1M joins and 900K departures).
the simulation does not cap section sizes (ie doesn’t account for the as-yet-undeveloped feature where young nodes in large sections are relocated).
nodes are added until the network is 100K in size, then has a one-joins-one-leaves mechanism for the next 900K nodes.
Section Size Distribution
The final distribution of section sizes in the network is shown below:
size count percent ------------------ 8 163 2.365 9 345 5.006 10 538 7.806 11 746 10.824 12 794 11.521 13 746 10.824 14 604 8.764 15 578 8.387 16 494 7.168 17 373 5.412 18 344 4.991 19 283 4.106 20 231 3.352 21 170 2.467 22 131 1.901 23 112 1.625 24 81 1.175 25 53 0.769 26 36 0.522 27 25 0.363 28 20 0.290 29 14 0.203 30 7 0.102 33 2 0.029 34 1 0.015 35 1 0.015
This distribution does change slightly depending on the initial conditions of the simulation, but overall the distribution is quite consistent.
52 was the largest section size throughout the simulation
35 was the largest section size at the end of the simulation
30 sibling nodes were involved in the largest merge
2 sibling sections were involved in the largest merge
The simulation code is at github.com/iancoleman/safe_network_simulations. The simulation is deterministic, so the same seed will always produce the same simulation which is a handy thing.
There’s more detail about this in RFC 0037 - Disjoint Groups, which does seem to provide some clarity around the merging questions, ie “No two sections must be comparable”.
The splitting and merging logic is remarkably simple but produces some fairly complex behaviour. The main points I feel are not obvious from the code or RFC documents and worth highlighting are
splits happen when two valid sections can be formed, not simply if the section is overall large in size.
merges may include many sections from the sibling side of the merge.
further development is happening to reduce the occurrence of large sections using a relocation mechanism based on vault ageing.
Discussion can be found at https://safenetforum.org/t/explaining-group-splits-and-merges/18383