Rosetta Checkpointing

Message boards : Number crunching : Rosetta Checkpointing

To post messages, you must log in.

AuthorMessage
Profile Robert Gammon

Send message
Joined: 9 Nov 07
Posts: 14
Credit: 969,848
RAC: 0
Message 54742 - Posted: 29 Jul 2008, 23:33:36 UTC

I have a XP laptop machine that runs BOINC. its old, and somewhat unreliable, but I cannot afford to replace it now.

The power cord is frayed and MUST stay in a PARTICULAR position in order for the machine to stay powered up. This means if the machine gets bumped or moved, we get a sudden, unexpected power failure. This is no different than someone experiencing power failure due to lighting, just LOTS more frequent. In addition, the laptop has to be moved to get to a location with internet access (wireless access only)

On the intentional power down events, I have done an orderly shutdown of BOINC prior to shutdown. In both cases (orderly shutdown and unexpected), Rosetta REPEATS the WU from 0.0% almost regardless of how far along the WU is.

It was suggested by the moderator that Suspending the project before intentional power downs would go a long way to solving the problem.

Well, I just Suspended the project, told BOINC to shutdown, then did a PowerDown of the computer using XP's Shutdown command. I moved the computer to get wireless inet access, checked a few things, and did an XP Shutdown again.

When I powered back up, restarted BOINC, and Resumed Rosetta, the Rosetta 5.98 WU that was at 92+% complete, reset back to 0.0%. This is very frustrating!!!
ID: 54742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 66,423,759
RAC: 9,705
Message 54826 - Posted: 2 Aug 2008, 10:01:06 UTC - in response to Message 54742.  

Thank you for crunching R@H.

This issue sounds frustrating and unfortunately, there is likely little that can be done about it.

Some work units save checkpoints more often that others. In many cases, the check point is at the end of a model. If you are 99% complete with a model and your machine restarts, you will restart the work unit a 0%.

Maybe someone else has better ideas.

sorry and thanks for crunching R@H.
Thx!

Paul

ID: 54826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Adam Gajdacs (Mr. Fusion)

Send message
Joined: 26 Nov 05
Posts: 13
Credit: 2,819,688
RAC: 2,580
Message 54828 - Posted: 2 Aug 2008, 11:11:49 UTC

Suspending BOINC/projects only temporarily stops them from running (and thus using CPU time), but nothing else.

What would probably work tho is:
- set project/BIONC preferences to leave applications in memory when preempted
- do not exit BOINC when you want to turn the computer off
- instead of shutting down your system, use hybernate (which is usually the preferred method for laptops anyway), which will save a snapshot of the system memory (needs at least as much free hard disk space as much physical memory you have), including the state of processes in a way that they will be restored the exact same state next time you power up the system, meaning that workunits, checkpointed or not, should continue processing from the point where they were before you hybernated the system
ID: 54828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Rosetta Checkpointing



©2024 University of Washington
https://www.bakerlab.org