BOINC System Restart


The storage failure recovery process has been completed and we have resumed computation on the new storage system.



Background

On March 1st, we encountered a major network file system failure (read our initial Hardware Recovery Update article and our follow-up update thread for more detail). The data center was able to locate and lend us an alternate system that could recognize our RAID keys after swapping the disks out of the failed system. After rebuilding the RAID on the stopgap storage system, and recovering the state of all databases and files at the time of the failure, we then transferred all data to a new storage system with better performance due to using fast SSD disk instead of spinning disk HDDs in the old system.

BOINC System has restarted

As of March 31st, all data has been transferred to the new system and we began running the network file system from our new storage, which provides much improved performance.

We have adjusted timeouts for the completion of WUs, and on April 2nd, we restarted BOINC – receiving processed WUs and making OPN1 and MCM work units available for processing. New SCC work units will be sent out soon.

Now that BOINC functionality is back, we will return to fixing the issues we have included on our Comprehensive Issue List. As always, we encourage those with bugs to report to share them here instead of creating new threads. We are still monitoring the system, but in the best case scenario we will be able to solve all of these issues and prepare for the complete WCG restart.

Active project updates

During the storage system recovery we have posted research updates from the ARP, MCM and SCC teams. We have more news from the SCC team coming shortly. The HSTB team remains on the science pause.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team