Message boards : Number crunching : Issue with checkpointing.
Author | Message |
---|---|
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
Hi there, I am writing this report as I have seen a problem with checkpointing - unfortunately again. Work Unit: lr5_combine_smooth_torsion_it06_A_rlbd_1cg5_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_15145_49 Computed on a portable version of a good old BOINC 5.10.45 prepared by my team (BOINC@Poland) (sorry, can't use non-portable version but this software has been heavily used before). The WU has obvious problems with checkpointing. It's been computed on one computer and done in almost 3 hrs 8 models. The progress was something 4x.xx%. After a restart on another computer, the graphics app showed me a Model 0, Step 0. Suddenly the progress dropped to something around 25% and now a Model 0, Step 25 is being computed. It looks like a whole work has been wasted. The stderr.txt file shows logs of two runs of this Work Unit - one in the morning and one right now (in the evening). See: [2009-10- 9 6:29:47:] :: BOINC:: Initializing ... ok. [2009-10- 9 6:29:47:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev32257.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lr5_combine_smooth_torsion_it06_A.zip Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr5_1cg5.out.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Fullatom mode .. # cpu_run_time_pref: 21600 Fullatom mode .. Fullatom mode .. Fullatom mode .. Fullatom mode .. Fullatom mode .. Fullatom mode .. Fullatom mode .. Fullatom mode .. Fullatom mode .. [2009-10- 9 22:16: 1:] :: BOINC:: Initializing ... ok. [2009-10- 9 22:16: 1:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev32257.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lr5_combine_smooth_torsion_it06_A.zip Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr5_1cg5.out.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Fullatom mode .. # cpu_run_time_pref: 21600 I am pretty sure I have seen this bug before so it is probable it is not a question of this particular WU. Can anyone confirm this issue and deliver a solution? In a few days I will see results of this WU - i.e. how many models will be crunched and how many headers with number of results will be given (see a known bug with multiple headers in the result file). Have a nice weekend and keep rocking. Best from Warsaw. :) a.m. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Checkpoints are taken at the end of each model at the very minimum. So, after initializing, your graphic should have shown the 8 models. Is it possible the client was unable to write the checkpoint to disk for any reason? What is your setting for "write to disk at most... seconds"? The double header in outfile issue was different then what you are describing here. Yes, if you restart a task, you will see some of the "starting up" type of messages more then once. The other issue was where the actual result summary showing number of models etc. appeared more then once. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Issue with checkpointing.
©2025 University of Washington
https://www.bakerlab.org