Message boards : Number crunching : Rosetta WUs restart after BOINC restart
Author | Message |
---|---|
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0 |
There's a thread "After Restart BOINC begins with 0 %" but it's 2+ years old so... Two days ago my i7 Vista Home Premium 64 PC arrived. I installed BOINC 6.6.36 with GPUGRID and Rosetta. GPUGRID behaves. Rosetta does not. Each time BOINC restarts, all eight Rosetta threads restart from 0% progress. In preferences, "Write to disk at most every" was set to 1800 seconds. I changed it to 30 seconds and did a BOINC Update. I rebooted and opened BOINC as soon as it came up. I watched each thread change from its pre-boot 15% to 0%. Any thoughts? Thanks, Tom |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Any time a task is closed, it will restart from it's last checkpoint. If the task has not yet reached a checkpoint, it will therefore restart from the beginning. Some tasks are able to checkpoint frequently, and others cannot. But generally, every 10 or 20 minutes, a task will attempt to write a checkpoint if your "write at most..." setting will allow it. If you are running with the default runtime preference, the 15% progress in to the 3 hour runtime is only 27 minutes. So you must have a collection of tasks that are not checkpointing as often as is common. Over time, your machine will work through various lengths of tasks and the cores will no longer be in synch where they all start a new task at basically the same time. If you would like to see messages recorded when checkpoints are taken, you can enable the checkpoint_debug setting in your cc_config as describe here. That way you can see how frequently they are occurring. [edit] I should also mention that when I referred above to a task closing, I meant to explain that this can occur when a task is suspended to run another project, or by user control if you are not keeping tasks in memory. It also occurs when BOINC shuts down, or when you turn off your computer. Rosetta Moderator: Mod.Sense |
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0 |
If you would like to see messages recorded when checkpoints are taken, you can enable the checkpoint_debug setting in your cc_config as describe here. That way you can see how frequently they are occurring. Thanks very much for responding. I did the above, being very careful to ensure the XML file was right. When I did Advanced / Read config file, BOINC went to sleep for many seconds then "BOINC Connection Failed - BOINC manager is not able to connect to a BOINC client." I retried a couple of times; same. I removed the XML and BOINC restarted. I watched in frustration as each thread backed off 30% of progress. BOINC is clearly checkpointing the progress regularly, but the WU status checkpointing is much less frequent. I poked around the TXT files in the data directory and found in stdoutgui.txt many instances of "Error: can't open file 'C:Program FilesBOINC\RebootPending.txt". Note the pair of backslashes. There are many instances of entries like this: TRACE [3088]: CAN'T FIND RESULT lr5_A_seq_score12_ss1.7_rlbd_1hz6_IGNORE_THE_REST_DECOY_14635_2026_0 Hmmmmm... Tom |
mikey Send message Joined: 5 Jan 06 Posts: 1896 Credit: 9,387,844 RAC: 9,807 |
There's a thread "After Restart BOINC begins with 0 %" but it's 2+ years old so... just one...try changing the setting under Your Account, computing preferences and then this line: Leave applications in memory while suspended? (suspended applications will consume swap space if 'yes') yes If it is not set to yes change it to yes and see if the units pick up where they left off when they stop and then restart. This uses your memory to remember where they were and not your hard drive. Maybe the hard drive settings are funky right now and if this fixes it then it will isolate the problem. |
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0 |
try changing the setting under Your Account, computing preferences and then this line: Thanks for the input but I already have that as "yes". Tom |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
tomba, the cc_config is a little odd in that it has some required and some optional values. The first three options, <task>, <file_xfer> and <sched_ops> are enabled by default and should always be enabled. Does your file have these three all enabled? Perhaps you could post the file you made, since it seems to have caused a new problem. Rosetta Moderator: Mod.Sense |
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0 |
tomba, the cc_config is a little odd in that it has some required and some optional values. The first three options, <task>, <file_xfer> and <sched_ops> are enabled by default and should always be enabled. Does your file have these three all enabled? Yep! See below. Perhaps you could post the file you made, since it seems to have caused a new problem. Here it is, Tom: <cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> <state_debug>0</state_debug> <task_debug>0</task_debug> <file_xfer_debug>0</file_xfer_debug> <sched_op_debug>0</sched_op_debug> <http_debug>0</http_debug> <work_fetch_debug>0</work_fetch_debug> <unparsed_xml>0</unparsed_xml> <proxy_debug>0</proxy_debug> <time_debug>0</time_debug> <http_xfer_debug>0</http_xfer_debug> <benchmark_debug>0</benchmark_debug> <poll_debug>0</poll_debug> <guirpc_debug>0</guirpc_debug> <scrsave_debug>0</scrsave_debug> <rr_simulation>0</rr_simulation> <cpu_sched>0</cpu_sched> <cpu_sched_debug>0</cpu_sched_debug> <app_msg_send>0</app_msg_send> <app_msg_receive>0</app_msg_receive> <mem_usage_debug>0</mem_usage_debug> <network_status_debug>0</network_status_debug> <checkpoint_debug>1</checkpoint_debug> <coproc_debug>0</coproc_debug> <dcf_debug>0</dcf_debug> <debt_debug>0</debt_debug> <statefile_debug>0</statefile_debug> <slot_debug>0</slot_debug> </log_flags> <options> <alt_platform>platform_name</alt_platform> <data_dir>/path/to/dir</data_dir> <disallow_attach>0|1</disallow_attach> <dont_contact_ref_site> <dont_check_file_sizes>0|1</dont_check_file_sizes> <exclusive_app>important.exe</exclusive_app> <force_auth>basic | digest | gss-negotiate | ntlm</force_auth> <http_1_0>0|1</http_1_0> <max_file_xfers>N</max_file_xfers> <max_file_xfers_per_project>N</max_file_xfers_per_project> <max_stderr_file_size>size_in_bytes</max_stderr_file_size> <max_stdout_file_size>size_in_bytes</max_stdout_file_size> <ncpus>N</ncpus> <no_alt_platform>0|1</no_alt_platform> <no_gpus>0|1</no_gpus> <no_priority_change>0</no_priority_change> <os_random_only>0|1</os_random_only> <report_results_immediately>0|1</report_results_immediately> <run_apps_manually>0|1</run_apps_manually> <save_stats_days>N</save_stats_days> <simple_gui_only>0|1</simple_gui_only> <start_apps_manually>0|1</start_apps_manually> <start_delay>N</start_delay> <suppress_net_info>0|1</suppress_net_info> <use_all_gpus>0|1</use_all_gpus> <use_certs>0|1</use_certs> <use_certs_only>0|1</use_certs_only> <zero_debts>0|1</zero_debts> </options> </cc_config> |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
At a minimum, your options section is wrong. Probably should just be removed. You have not filled in the desired values there. The zero and the one are the choices for that field, not a valid value when combined. The less lines in the file, the better. Anything not shown will default. Try something like this: <cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> <checkpoint_debug>1</checkpoint_debug> </log_flags> <options> <max_file_xfers>1</max_file_xfers> </options> </cc_config> Rosetta Moderator: Mod.Sense |
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0 |
At a minimum, your options section is wrong. Probably should just be removed. You have not filled in the desired values there. The zero and the one are the choices for that field, not a valid value when combined. The less lines in the file, the better. Anything not shown will default. Oops... That's what happens when a technical innocent is let loose!! Try something like this: OK. Did that, and the "Read config file" bit. Rebooted. When BOINC came up, again I watched the progress % back off. Here are the before/after percentages for the eight threads:
|
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0 |
Here are the before/after percentages for the eight threads: Something strange is going on here. Within 20 minutes of the reboot reported in my previous post I returned seven "success" WUs! Seems that progress % bears no relationship to reality. At the moment, about 35 minutes after the reboot, all eight threads are showing between 12% and 16% progress... Tom |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
I've had some finish short of my run time since we went over to this new 1.97app all these are from my different rigs after restarting in the mornings. Some tasks go for a few minutes others half an hour or so as you can see none have hit the 100 model limit or whatever it is now there was still time to do more. # cpu_run_time_pref: 14400 Fullatom mode .. ====================================================== DONE :: 20 starting structures 8211.84 cpu seconds This process generated 20 decoys from 20 attempts ====================================================== ====================================================== DONE :: 17 starting structures 9754.09 cpu seconds This process generated 17 decoys from 17 attempts ====================================================== ====================================================== DONE :: 22 starting structures 7997.82 cpu seconds This process generated 22 decoys from 22 attempts ====================================================== |
tomba Send message Joined: 29 May 06 Posts: 43 Credit: 1,558,972 RAC: 0 |
Pete, Your post went right over this technical innocent's head!! Tom |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi tomba. Sorry bout that i posted to the wrong thread, it's been a long day! i've copied and moved it to the right one, i hope. |
Message boards :
Number crunching :
Rosetta WUs restart after BOINC restart
©2025 University of Washington
https://www.bakerlab.org