Rosetta WUs restart after BOINC restart

Author	Message
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0	Message 63124 - Posted: 2 Sep 2009, 17:08:54 UTC There's a thread "After Restart BOINC begins with 0 %" but it's 2+ years old so... Two days ago my i7 Vista Home Premium 64 PC arrived. I installed BOINC 6.6.36 with GPUGRID and Rosetta. GPUGRID behaves. Rosetta does not. Each time BOINC restarts, all eight Rosetta threads restart from 0% progress. In preferences, "Write to disk at most every" was set to 1800 seconds. I changed it to 30 seconds and did a BOINC Update. I rebooted and opened BOINC as soon as it came up. I watched each thread change from its pre-boot 15% to 0%. Any thoughts? Thanks, Tom ID: 63124 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 63125 - Posted: 2 Sep 2009, 18:53:53 UTC Last modified: 2 Sep 2009, 18:56:56 UTC Any time a task is closed, it will restart from it's last checkpoint. If the task has not yet reached a checkpoint, it will therefore restart from the beginning. Some tasks are able to checkpoint frequently, and others cannot. But generally, every 10 or 20 minutes, a task will attempt to write a checkpoint if your "write at most..." setting will allow it. If you are running with the default runtime preference, the 15% progress in to the 3 hour runtime is only 27 minutes. So you must have a collection of tasks that are not checkpointing as often as is common. Over time, your machine will work through various lengths of tasks and the cores will no longer be in synch where they all start a new task at basically the same time. If you would like to see messages recorded when checkpoints are taken, you can enable the checkpoint_debug setting in your cc_config as describe here. That way you can see how frequently they are occurring. [edit] I should also mention that when I referred above to a task closing, I meant to explain that this can occur when a task is suspended to run another project, or by user control if you are not keeping tasks in memory. It also occurs when BOINC shuts down, or when you turn off your computer. Rosetta Moderator: Mod.Sense ID: 63125 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0	Message 63132 - Posted: 3 Sep 2009, 4:49:25 UTC - in response to Message 63125. If you would like to see messages recorded when checkpoints are taken, you can enable the checkpoint_debug setting in your cc_config as describe here. That way you can see how frequently they are occurring. Thanks very much for responding. I did the above, being very careful to ensure the XML file was right. When I did Advanced / Read config file, BOINC went to sleep for many seconds then "BOINC Connection Failed - BOINC manager is not able to connect to a BOINC client." I retried a couple of times; same. I removed the XML and BOINC restarted. I watched in frustration as each thread backed off 30% of progress. BOINC is clearly checkpointing the progress regularly, but the WU status checkpointing is much less frequent. I poked around the TXT files in the data directory and found in stdoutgui.txt many instances of "Error: can't open file 'C:Program FilesBOINC\RebootPending.txt". Note the pair of backslashes. There are many instances of entries like this: TRACE [3088]: CAN'T FIND RESULT lr5_A_seq_score12_ss1.7_rlbd_1hz6_IGNORE_THE_REST_DECOY_14635_2026_0 Hmmmmm... Tom ID: 63132 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1898 Credit: 12,860,414 RAC: 5,224	Message 63137 - Posted: 3 Sep 2009, 9:39:27 UTC - in response to Message 63124. There's a thread "After Restart BOINC begins with 0 %" but it's 2+ years old so... Two days ago my i7 Vista Home Premium 64 PC arrived. I installed BOINC 6.6.36 with GPUGRID and Rosetta. GPUGRID behaves. Rosetta does not. Each time BOINC restarts, all eight Rosetta threads restart from 0% progress. In preferences, "Write to disk at most every" was set to 1800 seconds. I changed it to 30 seconds and did a BOINC Update. I rebooted and opened BOINC as soon as it came up. I watched each thread change from its pre-boot 15% to 0%. Any thoughts? Thanks, Tom just one...try changing the setting under Your Account, computing preferences and then this line: Leave applications in memory while suspended? (suspended applications will consume swap space if 'yes') yes If it is not set to yes change it to yes and see if the units pick up where they left off when they stop and then restart. This uses your memory to remember where they were and not your hard drive. Maybe the hard drive settings are funky right now and if this fixes it then it will isolate the problem. ID: 63137 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0	Message 63139 - Posted: 3 Sep 2009, 10:33:32 UTC - in response to Message 63137. try changing the setting under Your Account, computing preferences and then this line: Leave applications in memory while suspended? (suspended applications will consume swap space if 'yes') yes Thanks for the input but I already have that as "yes". Tom ID: 63139 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 63143 - Posted: 3 Sep 2009, 15:37:28 UTC tomba, the cc_config is a little odd in that it has some required and some optional values. The first three options, <task>, <file_xfer> and <sched_ops> are enabled by default and should always be enabled. Does your file have these three all enabled? Perhaps you could post the file you made, since it seems to have caused a new problem. Rosetta Moderator: Mod.Sense ID: 63143 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0	Message 63144 - Posted: 3 Sep 2009, 15:58:21 UTC - in response to Message 63143. Last modified: 3 Sep 2009, 15:59:00 UTC tomba, the cc_config is a little odd in that it has some required and some optional values. The first three options, <task>, <file_xfer> and <sched_ops> are enabled by default and should always be enabled. Does your file have these three all enabled? Yep! See below. Perhaps you could post the file you made, since it seems to have caused a new problem. Here it is, Tom: <cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> <state_debug>0</state_debug> <task_debug>0</task_debug> <file_xfer_debug>0</file_xfer_debug> <sched_op_debug>0</sched_op_debug> <http_debug>0</http_debug> <work_fetch_debug>0</work_fetch_debug> <unparsed_xml>0</unparsed_xml> <proxy_debug>0</proxy_debug> <time_debug>0</time_debug> <http_xfer_debug>0</http_xfer_debug> <benchmark_debug>0</benchmark_debug> <poll_debug>0</poll_debug> <guirpc_debug>0</guirpc_debug> <scrsave_debug>0</scrsave_debug> <rr_simulation>0</rr_simulation> <cpu_sched>0</cpu_sched> <cpu_sched_debug>0</cpu_sched_debug> <app_msg_send>0</app_msg_send> <app_msg_receive>0</app_msg_receive> <mem_usage_debug>0</mem_usage_debug> <network_status_debug>0</network_status_debug> <checkpoint_debug>1</checkpoint_debug> <coproc_debug>0</coproc_debug> <dcf_debug>0</dcf_debug> <debt_debug>0</debt_debug> <statefile_debug>0</statefile_debug> <slot_debug>0</slot_debug> </log_flags> <options> <alt_platform>platform_name</alt_platform> <data_dir>/path/to/dir</data_dir> <disallow_attach>0\|1</disallow_attach> <dont_contact_ref_site> <dont_check_file_sizes>0\|1</dont_check_file_sizes> <exclusive_app>important.exe</exclusive_app> <force_auth>basic \| digest \| gss-negotiate \| ntlm</force_auth> <http_1_0>0\|1</http_1_0> <max_file_xfers>N</max_file_xfers> <max_file_xfers_per_project>N</max_file_xfers_per_project> <max_stderr_file_size>size_in_bytes</max_stderr_file_size> <max_stdout_file_size>size_in_bytes</max_stdout_file_size> <ncpus>N</ncpus> <no_alt_platform>0\|1</no_alt_platform> <no_gpus>0\|1</no_gpus> <no_priority_change>0</no_priority_change> <os_random_only>0\|1</os_random_only> <report_results_immediately>0\|1</report_results_immediately> <run_apps_manually>0\|1</run_apps_manually> <save_stats_days>N</save_stats_days> <simple_gui_only>0\|1</simple_gui_only> <start_apps_manually>0\|1</start_apps_manually> <start_delay>N</start_delay> <suppress_net_info>0\|1</suppress_net_info> <use_all_gpus>0\|1</use_all_gpus> <use_certs>0\|1</use_certs> <use_certs_only>0\|1</use_certs_only> <zero_debts>0\|1</zero_debts> </options> </cc_config> ID: 63144 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 63149 - Posted: 3 Sep 2009, 22:37:27 UTC At a minimum, your options section is wrong. Probably should just be removed. You have not filled in the desired values there. The zero and the one are the choices for that field, not a valid value when combined. The less lines in the file, the better. Anything not shown will default. Try something like this: <cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> <checkpoint_debug>1</checkpoint_debug> </log_flags> <options> <max_file_xfers>1</max_file_xfers> </options> </cc_config> Rosetta Moderator: Mod.Sense ID: 63149 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0	Message 63153 - Posted: 4 Sep 2009, 6:10:02 UTC - in response to Message 63149. At a minimum, your options section is wrong. Probably should just be removed. You have not filled in the desired values there. The zero and the one are the choices for that field, not a valid value when combined. The less lines in the file, the better. Anything not shown will default. Oops... That's what happens when a technical innocent is let loose!! Try something like this: <cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> <checkpoint_debug>1</checkpoint_debug> </log_flags> <options> <max_file_xfers>1</max_file_xfers> </options> </cc_config> OK. Did that, and the "Read config file" bit. Rebooted. When BOINC came up, again I watched the progress % back off. Here are the before/after percentages for the eight threads: 80 35 66 29 64 29 50 22 47 21 4 4 0 0 0 0 The one GPUGRID WU running resumed where it left off. Here are some checkpoints: 04/09/2009 07:25:51 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1o73_IGNORE_THE_REST_DECOY_14633_1658_0 checkpointed 04/09/2009 07:26:31 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2ib0_IGNORE_THE_REST_DECOY_14634_1659_0 checkpointed 04/09/2009 07:27:07 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1elw_IGNORE_THE_REST_DECOY_14633_2419_0 checkpointed 04/09/2009 07:27:43 rosetta@home [checkpoint_debug] result lr8_A_seq_score12_ss1.7_rlbd_1wit_IGNORE_THE_REST_DECOY_14637_2401_0 checkpointed 04/09/2009 07:28:29 GPUGRID [checkpoint_debug] result p1265000-OTTO_r_pYEEI_0209-1-10-RND5758_0 checkpointed 04/09/2009 07:28:30 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1nps_IGNORE_THE_REST_DECOY_14633_1674_0 checkpointed 04/09/2009 07:29:09 rosetta@home Sending scheduler request: To fetch work. 04/09/2009 07:29:09 rosetta@home Reporting 1 completed tasks, requesting new tasks for CPU 04/09/2009 07:29:11 rosetta@home [checkpoint_debug] result symm_lr8_seq_score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_14636_2449_0 checkpointed 04/09/2009 07:29:19 rosetta@home Scheduler request completed: got 1 new tasks 04/09/2009 07:29:21 rosetta@home Started download of boinc_rb1_1opd.pdb 04/09/2009 07:29:23 rosetta@home Finished download of boinc_rb1_1opd.pdb 04/09/2009 07:29:23 rosetta@home Started download of lr10_1opd.out.zip 04/09/2009 07:29:29 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2iiy_IGNORE_THE_REST_DECOY_14634_1664_0 checkpointed 04/09/2009 07:29:35 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_1c7k_IGNORE_THE_REST_DECOY_14634_2410_1 checkpointed 04/09/2009 07:29:55 rosetta@home Finished download of lr10_1opd.out.zip 04/09/2009 07:30:01 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1o73_IGNORE_THE_REST_DECOY_14633_1658_0 checkpointed 04/09/2009 07:30:36 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2ib0_IGNORE_THE_REST_DECOY_14634_1659_0 checkpointed 04/09/2009 07:31:08 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1elw_IGNORE_THE_REST_DECOY_14633_2419_0 checkpointed 04/09/2009 07:31:43 rosetta@home [checkpoint_debug] result lr8_A_seq_score12_ss1.7_rlbd_1wit_IGNORE_THE_REST_DECOY_14637_2401_0 checkpointed 04/09/2009 07:32:30 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1nps_IGNORE_THE_REST_DECOY_14633_1674_0 checkpointed 04/09/2009 07:33:11 rosetta@home [checkpoint_debug] result symm_lr8_seq_score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_14636_2449_0 checkpointed 04/09/2009 07:33:21 rosetta@home Sending scheduler request: To fetch work. 04/09/2009 07:33:21 rosetta@home Requesting new tasks for CPU 04/09/2009 07:33:30 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2iiy_IGNORE_THE_REST_DECOY_14634_1664_0 checkpointed 04/09/2009 07:33:31 rosetta@home Scheduler request completed: got 1 new tasks 04/09/2009 07:33:33 rosetta@home Started download of lr5_1o5u.out.zip 04/09/2009 07:33:35 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_1c7k_IGNORE_THE_REST_DECOY_14634_2410_1 checkpointed 04/09/2009 07:33:44 rosetta@home Finished download of lr5_1o5u.out.zip 04/09/2009 07:34:07 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1o73_IGNORE_THE_REST_DECOY_14633_1658_0 checkpointed 04/09/2009 07:34:36 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2ib0_IGNORE_THE_REST_DECOY_14634_1659_0 checkpointed 04/09/2009 07:35:09 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1elw_IGNORE_THE_REST_DECOY_14633_2419_0 checkpointed 04/09/2009 07:35:44 rosetta@home [checkpoint_debug] result lr8_A_seq_score12_ss1.7_rlbd_1wit_IGNORE_THE_REST_DECOY_14637_2401_0 checkpointed 04/09/2009 07:36:31 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1nps_IGNORE_THE_REST_DECOY_14633_1674_0 checkpointed 04/09/2009 07:37:08 GPUGRID [checkpoint_debug] result p1265000-OTTO_r_pYEEI_0209-1-10-RND5758_0 checkpointed 04/09/2009 07:37:21 rosetta@home [checkpoint_debug] result symm_lr8_seq_score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_14636_2449_0 checkpointed 04/09/2009 07:37:37 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_1c7k_IGNORE_THE_REST_DECOY_14634_2410_1 checkpointed 04/09/2009 07:37:38 rosetta@home Sending scheduler request: To fetch work. Tom ID: 63153 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0	Message 63154 - Posted: 4 Sep 2009, 6:28:28 UTC - in response to Message 63153. Here are the before/after percentages for the eight threads: 80 35 66 29 64 29 50 22 47 21 4 4 0 0 0 0 Something strange is going on here. Within 20 minutes of the reboot reported in my previous post I returned seven "success" WUs! Seems that progress % bears no relationship to reality. At the moment, about 35 minutes after the reboot, all eight threads are showing between 12% and 16% progress... Tom ID: 63154 · Rating: 0 · rate: / Reply Quote

P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 63155 - Posted: 4 Sep 2009, 6:42:31 UTC I've had some finish short of my run time since we went over to this new 1.97app all these are from my different rigs after restarting in the mornings. Some tasks go for a few minutes others half an hour or so as you can see none have hit the 100 model limit or whatever it is now there was still time to do more. # cpu_run_time_pref: 14400 Fullatom mode .. ====================================================== DONE :: 20 starting structures 8211.84 cpu seconds This process generated 20 decoys from 20 attempts ====================================================== ====================================================== DONE :: 17 starting structures 9754.09 cpu seconds This process generated 17 decoys from 17 attempts ====================================================== ====================================================== DONE :: 22 starting structures 7997.82 cpu seconds This process generated 22 decoys from 22 attempts ====================================================== ID: 63155 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0	Message 63156 - Posted: 4 Sep 2009, 6:55:58 UTC - in response to Message 63155. Pete, Your post went right over this technical innocent's head!! Tom ID: 63156 · Rating: 0 · rate: / Reply Quote

P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 63158 - Posted: 4 Sep 2009, 7:57:25 UTC Hi tomba. Sorry bout that i posted to the wrong thread, it's been a long day! i've copied and moved it to the right one, i hope. ID: 63158 · Rating: 0 · rate: / Reply Quote