Disco 0.4.2
Disco 0.4.2 is here finally, in time for summer data crunching!
The highlight of the new release is a new garbage collector (GC) and re-replicator (RR) for DDFS. The previous GC/RR followed a conservative strategy, and terminated early on faults that the implementation could not easily handle. This meant that GC often did not run to completion in large and/or busy clusters. The new GC/RR tries to handle and recover from more such faults, and hence has a higher chance of running to completion. It also computes some basic statistics during its run, which are presented in the UI.
The new GC implementation also allowed us to implement the scheduled removal of a node from DDFS. Earlier, one could just remove a node from the DDFS cluster, but one did not get an indication when the replicas on that node were replaced by new copies on other nodes by GC/RR. This, in combination with its likelihood of often not running to completion, meant that it could have been a long time before the DDFS data availability was restored to the advertised number of replicas. With this release, one can mark (proactively or retroactively, i.e. after a node has died) a node for removal from DDFS, using a "DDFS blacklist", and receive an indication in the UI of when its data has been completely replicated. Note that it might take several runs of the new GC/RR for a node to be safe for removal. Since runs are scheduled at intervals of one day by default, it might take several days for this to occur. It is not advisable to have multiple nodes in the DDFS blacklist at a time; although it is supported, it hasn't been as well tested.
If you have been having issues with the GC not deleting stale DDFS data, you should consider upgrading to this release. In case you are understandably wary of the new GC unexpectedly deleting your valuable data, you can consider using the PARANOID mode, which just renames files instead of deleting them. However, please note that precisely because in this mode data is not actually deleted, coupled with the fact that the new GC/RR has a higher likelihood of re-replicating data when needed, your DDFS disk usage might increase substantially.
As mentioned in the release notes, there are also some bug fixes, as well as other smaller improvements.