Rosetta WUs restart after BOINC restart

Message boards : Number crunching : Rosetta WUs restart after BOINC restart

To post messages, you must log in.

AuthorMessage
tomba

Send message
Joined: 29 May 06
Posts: 43
Credit: 1,558,972
RAC: 0
Message 63124 - Posted: 2 Sep 2009, 17:08:54 UTC

There's a thread "After Restart BOINC begins with 0 %" but it's 2+ years old so...

Two days ago my i7 Vista Home Premium 64 PC arrived.

I installed BOINC 6.6.36 with GPUGRID and Rosetta. GPUGRID behaves. Rosetta does not.

Each time BOINC restarts, all eight Rosetta threads restart from 0% progress.

In preferences, "Write to disk at most every" was set to 1800 seconds. I changed it to 30 seconds and did a BOINC Update. I rebooted and opened BOINC as soon as it came up. I watched each thread change from its pre-boot 15% to 0%.

Any thoughts?

Thanks, Tom

ID: 63124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63125 - Posted: 2 Sep 2009, 18:53:53 UTC
Last modified: 2 Sep 2009, 18:56:56 UTC

Any time a task is closed, it will restart from it's last checkpoint. If the task has not yet reached a checkpoint, it will therefore restart from the beginning. Some tasks are able to checkpoint frequently, and others cannot. But generally, every 10 or 20 minutes, a task will attempt to write a checkpoint if your "write at most..." setting will allow it.

If you are running with the default runtime preference, the 15% progress in to the 3 hour runtime is only 27 minutes. So you must have a collection of tasks that are not checkpointing as often as is common. Over time, your machine will work through various lengths of tasks and the cores will no longer be in synch where they all start a new task at basically the same time.

If you would like to see messages recorded when checkpoints are taken, you can enable the checkpoint_debug setting in your cc_config as describe here. That way you can see how frequently they are occurring.

[edit]
I should also mention that when I referred above to a task closing, I meant to explain that this can occur when a task is suspended to run another project, or by user control if you are not keeping tasks in memory. It also occurs when BOINC shuts down, or when you turn off your computer.
Rosetta Moderator: Mod.Sense
ID: 63125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tomba

Send message
Joined: 29 May 06
Posts: 43
Credit: 1,558,972
RAC: 0
Message 63132 - Posted: 3 Sep 2009, 4:49:25 UTC - in response to Message 63125.  

If you would like to see messages recorded when checkpoints are taken, you can enable the checkpoint_debug setting in your cc_config as describe here. That way you can see how frequently they are occurring.


Thanks very much for responding.

I did the above, being very careful to ensure the XML file was right. When I did Advanced / Read config file, BOINC went to sleep for many seconds then "BOINC Connection Failed - BOINC manager is not able to connect to a BOINC client."

I retried a couple of times; same. I removed the XML and BOINC restarted. I watched in frustration as each thread backed off 30% of progress.

BOINC is clearly checkpointing the progress regularly, but the WU status checkpointing is much less frequent.

I poked around the TXT files in the data directory and found in stdoutgui.txt many instances of "Error: can't open file 'C:Program FilesBOINC\RebootPending.txt". Note the pair of backslashes.

There are many instances of entries like this:

TRACE [3088]: CAN'T FIND RESULT lr5_A_seq_score12_ss1.7_rlbd_1hz6_IGNORE_THE_REST_DECOY_14635_2026_0

Hmmmmm...

Tom




ID: 63132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1896
Credit: 9,387,844
RAC: 9,807
Message 63137 - Posted: 3 Sep 2009, 9:39:27 UTC - in response to Message 63124.  

There's a thread "After Restart BOINC begins with 0 %" but it's 2+ years old so...

Two days ago my i7 Vista Home Premium 64 PC arrived.

I installed BOINC 6.6.36 with GPUGRID and Rosetta. GPUGRID behaves. Rosetta does not.

Each time BOINC restarts, all eight Rosetta threads restart from 0% progress.

In preferences, "Write to disk at most every" was set to 1800 seconds. I changed it to 30 seconds and did a BOINC Update. I rebooted and opened BOINC as soon as it came up. I watched each thread change from its pre-boot 15% to 0%.

Any thoughts? Thanks, Tom


just one...try changing the setting under Your Account, computing preferences and then this line:
Leave applications in memory while suspended?
(suspended applications will consume swap space if 'yes') yes

If it is not set to yes change it to yes and see if the units pick up where they left off when they stop and then restart. This uses your memory to remember where they were and not your hard drive. Maybe the hard drive settings are funky right now and if this fixes it then it will isolate the problem.
ID: 63137 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tomba

Send message
Joined: 29 May 06
Posts: 43
Credit: 1,558,972
RAC: 0
Message 63139 - Posted: 3 Sep 2009, 10:33:32 UTC - in response to Message 63137.  

try changing the setting under Your Account, computing preferences and then this line:
Leave applications in memory while suspended?
(suspended applications will consume swap space if 'yes') yes

Thanks for the input but I already have that as "yes".

Tom

ID: 63139 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63143 - Posted: 3 Sep 2009, 15:37:28 UTC

tomba, the cc_config is a little odd in that it has some required and some optional values. The first three options, <task>, <file_xfer> and <sched_ops> are enabled by default and should always be enabled. Does your file have these three all enabled? Perhaps you could post the file you made, since it seems to have caused a new problem.
Rosetta Moderator: Mod.Sense
ID: 63143 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tomba

Send message
Joined: 29 May 06
Posts: 43
Credit: 1,558,972
RAC: 0
Message 63144 - Posted: 3 Sep 2009, 15:58:21 UTC - in response to Message 63143.  
Last modified: 3 Sep 2009, 15:59:00 UTC

tomba, the cc_config is a little odd in that it has some required and some optional values. The first three options, <task>, <file_xfer> and <sched_ops> are enabled by default and should always be enabled. Does your file have these three all enabled?


Yep! See below.

Perhaps you could post the file you made, since it seems to have caused a new problem.


Here it is, Tom:

<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<state_debug>0</state_debug>
<task_debug>0</task_debug>
<file_xfer_debug>0</file_xfer_debug>
<sched_op_debug>0</sched_op_debug>
<http_debug>0</http_debug>
<work_fetch_debug>0</work_fetch_debug>
<unparsed_xml>0</unparsed_xml>
<proxy_debug>0</proxy_debug>
<time_debug>0</time_debug>
<http_xfer_debug>0</http_xfer_debug>
<benchmark_debug>0</benchmark_debug>
<poll_debug>0</poll_debug>
<guirpc_debug>0</guirpc_debug>
<scrsave_debug>0</scrsave_debug>
<rr_simulation>0</rr_simulation>
<cpu_sched>0</cpu_sched>
<cpu_sched_debug>0</cpu_sched_debug>
<app_msg_send>0</app_msg_send>
<app_msg_receive>0</app_msg_receive>
<mem_usage_debug>0</mem_usage_debug>
<network_status_debug>0</network_status_debug>
<checkpoint_debug>1</checkpoint_debug>
<coproc_debug>0</coproc_debug>
<dcf_debug>0</dcf_debug>
<debt_debug>0</debt_debug>
<statefile_debug>0</statefile_debug>
<slot_debug>0</slot_debug>
</log_flags>
<options>
<alt_platform>platform_name</alt_platform>
<data_dir>/path/to/dir</data_dir>
<disallow_attach>0|1</disallow_attach>
<dont_contact_ref_site>
<dont_check_file_sizes>0|1</dont_check_file_sizes>
<exclusive_app>important.exe</exclusive_app>
<force_auth>basic | digest | gss-negotiate | ntlm</force_auth>
<http_1_0>0|1</http_1_0>
<max_file_xfers>N</max_file_xfers>
<max_file_xfers_per_project>N</max_file_xfers_per_project>
<max_stderr_file_size>size_in_bytes</max_stderr_file_size>
<max_stdout_file_size>size_in_bytes</max_stdout_file_size>
<ncpus>N</ncpus>
<no_alt_platform>0|1</no_alt_platform>
<no_gpus>0|1</no_gpus>
<no_priority_change>0</no_priority_change>
<os_random_only>0|1</os_random_only>
<report_results_immediately>0|1</report_results_immediately>
<run_apps_manually>0|1</run_apps_manually>
<save_stats_days>N</save_stats_days>
<simple_gui_only>0|1</simple_gui_only>
<start_apps_manually>0|1</start_apps_manually>
<start_delay>N</start_delay>
<suppress_net_info>0|1</suppress_net_info>
<use_all_gpus>0|1</use_all_gpus>
<use_certs>0|1</use_certs>
<use_certs_only>0|1</use_certs_only>
<zero_debts>0|1</zero_debts>
</options>
</cc_config>
ID: 63144 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63149 - Posted: 3 Sep 2009, 22:37:27 UTC

At a minimum, your options section is wrong. Probably should just be removed. You have not filled in the desired values there. The zero and the one are the choices for that field, not a valid value when combined. The less lines in the file, the better. Anything not shown will default.

Try something like this:

<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<checkpoint_debug>1</checkpoint_debug>
</log_flags>
<options>
<max_file_xfers>1</max_file_xfers>
</options>
</cc_config>
Rosetta Moderator: Mod.Sense
ID: 63149 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tomba

Send message
Joined: 29 May 06
Posts: 43
Credit: 1,558,972
RAC: 0
Message 63153 - Posted: 4 Sep 2009, 6:10:02 UTC - in response to Message 63149.  

At a minimum, your options section is wrong. Probably should just be removed. You have not filled in the desired values there. The zero and the one are the choices for that field, not a valid value when combined. The less lines in the file, the better. Anything not shown will default.

Oops... That's what happens when a technical innocent is let loose!!
Try something like this:

<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
<checkpoint_debug>1</checkpoint_debug>
</log_flags>
<options>
<max_file_xfers>1</max_file_xfers>
</options>
</cc_config>

OK. Did that, and the "Read config file" bit. Rebooted. When BOINC came up, again I watched the progress % back off. Here are the before/after percentages for the eight threads:

  • 80 35
  • 66 29
  • 64 29
  • 50 22
  • 47 21
  • 4 4
  • 0 0
  • 0 0


The one GPUGRID WU running resumed where it left off.

Here are some checkpoints:

04/09/2009 07:25:51 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1o73_IGNORE_THE_REST_DECOY_14633_1658_0 checkpointed
04/09/2009 07:26:31 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2ib0_IGNORE_THE_REST_DECOY_14634_1659_0 checkpointed
04/09/2009 07:27:07 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1elw_IGNORE_THE_REST_DECOY_14633_2419_0 checkpointed
04/09/2009 07:27:43 rosetta@home [checkpoint_debug] result lr8_A_seq_score12_ss1.7_rlbd_1wit_IGNORE_THE_REST_DECOY_14637_2401_0 checkpointed
04/09/2009 07:28:29 GPUGRID [checkpoint_debug] result p1265000-OTTO_r_pYEEI_0209-1-10-RND5758_0 checkpointed
04/09/2009 07:28:30 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1nps_IGNORE_THE_REST_DECOY_14633_1674_0 checkpointed
04/09/2009 07:29:09 rosetta@home Sending scheduler request: To fetch work.
04/09/2009 07:29:09 rosetta@home Reporting 1 completed tasks, requesting new tasks for CPU
04/09/2009 07:29:11 rosetta@home [checkpoint_debug] result symm_lr8_seq_score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_14636_2449_0 checkpointed
04/09/2009 07:29:19 rosetta@home Scheduler request completed: got 1 new tasks
04/09/2009 07:29:21 rosetta@home Started download of boinc_rb1_1opd.pdb
04/09/2009 07:29:23 rosetta@home Finished download of boinc_rb1_1opd.pdb
04/09/2009 07:29:23 rosetta@home Started download of lr10_1opd.out.zip
04/09/2009 07:29:29 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2iiy_IGNORE_THE_REST_DECOY_14634_1664_0 checkpointed
04/09/2009 07:29:35 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_1c7k_IGNORE_THE_REST_DECOY_14634_2410_1 checkpointed
04/09/2009 07:29:55 rosetta@home Finished download of lr10_1opd.out.zip
04/09/2009 07:30:01 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1o73_IGNORE_THE_REST_DECOY_14633_1658_0 checkpointed
04/09/2009 07:30:36 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2ib0_IGNORE_THE_REST_DECOY_14634_1659_0 checkpointed
04/09/2009 07:31:08 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1elw_IGNORE_THE_REST_DECOY_14633_2419_0 checkpointed
04/09/2009 07:31:43 rosetta@home [checkpoint_debug] result lr8_A_seq_score12_ss1.7_rlbd_1wit_IGNORE_THE_REST_DECOY_14637_2401_0 checkpointed
04/09/2009 07:32:30 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1nps_IGNORE_THE_REST_DECOY_14633_1674_0 checkpointed
04/09/2009 07:33:11 rosetta@home [checkpoint_debug] result symm_lr8_seq_score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_14636_2449_0 checkpointed
04/09/2009 07:33:21 rosetta@home Sending scheduler request: To fetch work.
04/09/2009 07:33:21 rosetta@home Requesting new tasks for CPU
04/09/2009 07:33:30 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2iiy_IGNORE_THE_REST_DECOY_14634_1664_0 checkpointed
04/09/2009 07:33:31 rosetta@home Scheduler request completed: got 1 new tasks
04/09/2009 07:33:33 rosetta@home Started download of lr5_1o5u.out.zip
04/09/2009 07:33:35 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_1c7k_IGNORE_THE_REST_DECOY_14634_2410_1 checkpointed
04/09/2009 07:33:44 rosetta@home Finished download of lr5_1o5u.out.zip
04/09/2009 07:34:07 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1o73_IGNORE_THE_REST_DECOY_14633_1658_0 checkpointed
04/09/2009 07:34:36 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_2ib0_IGNORE_THE_REST_DECOY_14634_1659_0 checkpointed
04/09/2009 07:35:09 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1elw_IGNORE_THE_REST_DECOY_14633_2419_0 checkpointed
04/09/2009 07:35:44 rosetta@home [checkpoint_debug] result lr8_A_seq_score12_ss1.7_rlbd_1wit_IGNORE_THE_REST_DECOY_14637_2401_0 checkpointed
04/09/2009 07:36:31 rosetta@home [checkpoint_debug] result lr10_A_seq_score12_ss1.7_rlbd_1nps_IGNORE_THE_REST_DECOY_14633_1674_0 checkpointed
04/09/2009 07:37:08 GPUGRID [checkpoint_debug] result p1265000-OTTO_r_pYEEI_0209-1-10-RND5758_0 checkpointed
04/09/2009 07:37:21 rosetta@home [checkpoint_debug] result symm_lr8_seq_score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_14636_2449_0 checkpointed
04/09/2009 07:37:37 rosetta@home [checkpoint_debug] result lr13_A_seq_score12_ss1.7_rlbd_1c7k_IGNORE_THE_REST_DECOY_14634_2410_1 checkpointed
04/09/2009 07:37:38 rosetta@home Sending scheduler request: To fetch work.

Tom



ID: 63153 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tomba

Send message
Joined: 29 May 06
Posts: 43
Credit: 1,558,972
RAC: 0
Message 63154 - Posted: 4 Sep 2009, 6:28:28 UTC - in response to Message 63153.  

Here are the before/after percentages for the eight threads:

  • 80 35
  • 66 29
  • 64 29
  • 50 22
  • 47 21
  • 4 4
  • 0 0
  • 0 0


Something strange is going on here. Within 20 minutes of the reboot reported in my previous post I returned seven "success" WUs! Seems that progress % bears no relationship to reality. At the moment, about 35 minutes after the reboot, all eight threads are showing between 12% and 16% progress...

Tom

ID: 63154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 63155 - Posted: 4 Sep 2009, 6:42:31 UTC

I've had some finish short of my run time since we went over to this new 1.97app

all these are from my different rigs after restarting in the mornings.

Some tasks go for a few minutes others half an hour or so as you can see none

have hit the 100 model limit or whatever it is now there was still time to do more.

# cpu_run_time_pref: 14400
Fullatom mode ..
======================================================
DONE :: 20 starting structures 8211.84 cpu seconds
This process generated 20 decoys from 20 attempts
======================================================

======================================================
DONE :: 17 starting structures 9754.09 cpu seconds
This process generated 17 decoys from 17 attempts
======================================================

======================================================
DONE :: 22 starting structures 7997.82 cpu seconds
This process generated 22 decoys from 22 attempts
======================================================

ID: 63155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tomba

Send message
Joined: 29 May 06
Posts: 43
Credit: 1,558,972
RAC: 0
Message 63156 - Posted: 4 Sep 2009, 6:55:58 UTC - in response to Message 63155.  

Pete,

Your post went right over this technical innocent's head!!

Tom
ID: 63156 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 63158 - Posted: 4 Sep 2009, 7:57:25 UTC

Hi tomba.

Sorry bout that i posted to the wrong thread, it's been a long day!

i've copied and moved it to the right one, i hope.


ID: 63158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Rosetta WUs restart after BOINC restart



©2025 University of Washington
https://www.bakerlab.org