minirosetta 2.03

Message boards : Number crunching : minirosetta 2.03

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 64920 - Posted: 11 Jan 2010, 20:41:48 UTC

This died last night, same as others.

homopt4.t328_.t328_.IGNORE_THE_REST.S_00002_0000009_0_0_0_0001.pdb_00004.pdb_00002.pdb.JOB_16816_14_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282247578

ERROR: [ERROR] Error opening RBSeg file 'S_00001_0000002_07.pdb_00001.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish



ID: 64920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 64923 - Posted: 11 Jan 2010, 23:20:01 UTC - in response to Message 64905.  

Hi again,

David Kim and I have tracked down this problem and I'm going to test a fix to it in the upcoming release. The problem was that per-decoy checkpointing was not on in this batch of simulations. When I mentioned that these protocols do not need checkpointing I only meant within-trajectory checkpointing.

For the time being, I've stopped sending out this type of simulation, though over the next few days your computers might still work on them as quite a few have already been sent out. To assure you, the results of these simulations are certainly useful to us and in most cases credit will be allocated correctly.

Thanks a lot for sending specific comments that allowed us to figure this out!

Hello,

Thanks for your comments! In this sort of WU trajectories are typically not very long, but successful trajectories are extremely rare (credit is allocated per time spent computing, so you get credit for the work regardless of success). Accordingly, don't worry about turning off the computer in the middle of one of these trajectories. The most you're going to lose is a couple of minutes of computation.

Thanks, Sarel.

Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly.
For example WUs named "ha_notyr_..." - after several hours of computing "CPU time at last checkpoint" stays "---" (none). If I restart(or shut down) computer (or BOINC client only) while such WU running - all results are lost and after restart computation starts from the very beginning.
Here examples of such tasks:
https://boinc.bakerlab.org/rosetta/result.php?resultid=308985993
https://boinc.bakerlab.org/rosetta/result.php?resultid=309233711

And so they look from BOINС Manager:


And a part of other tasks write about checkpoints, but similar actually them do not do it (or most likely do, but after restarting them do not use it). It looks so: "CPU time at last checkpoint" looks correct (only for some minutes less, than total CPU Time), BUT before exiting from BOINC I look "Show graphics", we will accept there is displayed that 38 models are calculated already. After restarting counting of models starts with 0,1,2.. and so on. Аnd then reporting to the server is referred less models, than has been considered before restarting(for exsample only 20) - similar only that has been computed after the last restarting.

It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise).
Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H).

While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum.
But I think, it only partial and is far not the best solution...

P.S.
I running minirosetta 2.03, BOINC 6.10.18 on Windows XP SP3.



ID: 64923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 64925 - Posted: 12 Jan 2010, 5:15:01 UTC
Last modified: 12 Jan 2010, 5:47:01 UTC

I don't know if this is a Validator problem or the task, any ideas.

Edit/ It ran for over 4hrs none stop to finish.

9gbnnotyr_3gbn_2hxm_9Jan2010_16880_35_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282555003

# cpu_run_time_pref: 14400
======================================================
DONE :: 27 starting structures 15475.2 cpu seconds
This process generated 27 decoys from 27 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

Over__Validate error__Done__15,475.75
ID: 64925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2126
Credit: 41,253,494
RAC: 7,932
Message 64926 - Posted: 12 Jan 2010, 7:34:40 UTC

More of the previously reported errors here on what's usually a very reliable error-free machine.

One new odd one though, relating to credits rather than anything else:

9gbnnotyr_3gbn_2p8g_9Jan2010_16860_4_0
Outcome Success
Client state Done
Exit status 0 (0x0)

<core_client_version>6.10.18</core_client_version>

# cpu_run_time_pref: 28800
======================================================
DONE :: 10 starting structures 28348.9 cpu seconds
This process generated 10 decoys from 10 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Claimed credit 133.53157877111
Granted credit 1.50234135281901
application version 2.03

Generally my granted credit on this W7 laptop is close the claimed credit, with the occasional one being 30% less or 50% more, but 99% less seems very odd. Any ideas, or just a one off?
ID: 64926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 64927 - Posted: 12 Jan 2010, 8:22:10 UTC

Been having a slight issue with WU's freezing on my computer for the last few days. Most seem fine but a few odd ones lately. Most are Homopt WU's and now im having problems with boinc_filtered_loopbuild_threading (2nd on that has frozen). They seem to get to 20-70% then for some reason stop at some point and just tick away with process sitting idle. Ive chosen to manually abort these, has anyone else been having this issue? Also when i go to "show graphics" the graphics window freezes which makes me have to kill the process. I dont wanna keep aborting these Wu's but doesnt seem like anything else i can do.... anyone wanna help me out?
ID: 64927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 64937 - Posted: 12 Jan 2010, 19:29:24 UTC

Just downloaded 2 more of the same type WU's and they are stuck at 8 and 9%... can anyone tell me whats going on?
ID: 64937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64938 - Posted: 12 Jan 2010, 20:19:12 UTC

Admin, I've not heard of such problems until just the past few days. I've EMailed the Project Team asking they look in to it.

Do you spot any pattern in the WU names that are working vs those hanging up? Your profile looks like you are running Win7.
Rosetta Moderator: Mod.Sense
ID: 64938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 64940 - Posted: 12 Jan 2010, 20:25:17 UTC - in response to Message 64938.  

Admin, I've not heard of such problems until just the past few days. I've EMailed the Project Team asking they look in to it.

Do you spot any pattern in the WU names that are working vs those hanging up? Your profile looks like you are running Win7.


admin, I see you have one computer and it's running Windows System 7. I've had identical problems with R@h tasks running under this OS, some of which I've reported above. There seems no common pattern to the tasks that have to be aborted: given two tasks with names identical apart from the digits at the end one may successfully complete while the other has to be aborted. It always seems though, for those tasks I've looked at, that it gets successfully completed by a wingman running under a different OS.
ID: 64940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 64941 - Posted: 12 Jan 2010, 20:29:53 UTC
Last modified: 12 Jan 2010, 20:31:06 UTC

Its really random so I cant quite say which work, but Ive stated the ones that dont work for me above. If you can check the tasks Ive aborted, those are the WU's that have been faulty. Its been more and more the past few days, so ive aborted the last 2 bad ones and wont get anymore for right now until the issue is looked at. Homopt and boinc_filtered_loopback_threading seem to be the biggest issues for me and they have been the only ones ive been getting. Anything else you need to know? Yes im running Windows 7 RC right now.
ID: 64941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 64942 - Posted: 13 Jan 2010, 7:01:16 UTC

This lasted about 11sec.

t287__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_576_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282963623

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>

Wed 13 Jan 2010 16:27:59 EST|rosetta@home|Output file t287__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_576_0_0 for task absent

ID: 64942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 26,101,436
RAC: 16,911
Message 64946 - Posted: 13 Jan 2010, 13:40:04 UTC - in response to Message 64905.  

Hello,

Thanks for your comments! In this sort of WU trajectories are typically not very long, but successful trajectories are extremely rare (credit is allocated per time spent computing, so you get credit for the work regardless of success). Accordingly, don't worry about turning off the computer in the middle of one of these trajectories. The most you're going to lose is a couple of minutes of computation.

Thanks, Sarel.


I do not worry about possible losses of 1 not completed model - in these tasks they are small, so losses will really make no more than several minutes of CPU time.
And what about possible losses of all models calculated before turn of (or reboot of the computer or boinc client) - apparently from a screenshot(posted above), this type of WUs at all does not do any checkpoints for all time of the computation.
Or results of ready models (completely calculated) are saved somehow differently (not through the mechanism of checkpoints), and checkpoints are necessary only for saving of subproducts in 1 model?
And BOINC simply does not know about it and writes about them "no CPU time at last checkpoint?
ID: 64946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 26,101,436
RAC: 16,911
Message 64947 - Posted: 13 Jan 2010, 14:12:59 UTC - in response to Message 64923.  

Hi again,

David Kim and I have tracked down this problem and I'm going to test a fix to it in the upcoming release. The problem was that per-decoy checkpointing was not on in this batch of simulations. When I mentioned that these protocols do not need checkpointing I only meant within-trajectory checkpointing.

For the time being, I've stopped sending out this type of simulation, though over the next few days your computers might still work on them as quite a few have already been sent out. To assure you, the results of these simulations are certainly useful to us and in most cases credit will be allocated correctly.

Thanks a lot for sending specific comments that allowed us to figure this out!

Oh, I have written the previous post before has read this one.
Is glad to hear that the problem is localised. Always it is pleasant to "squash up" one more bug in software. :)
(On the my main work I am linked with programming as a whole, and with testing and debugging in particular. Projects are much easier, in comparison with scientific, but in programming much in common).
ID: 64947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 26,101,436
RAC: 16,911
Message 64948 - Posted: 13 Jan 2010, 15:17:46 UTC - in response to Message 64926.  

More of the previously reported errors here on what's usually a very reliable error-free machine.

One new odd one though, relating to credits rather than anything else:

9gbnnotyr_3gbn_2p8g_9Jan2010_16860_4_0
Outcome Success
Client state Done
Exit status 0 (0x0)

<core_client_version>6.10.18</core_client_version>

# cpu_run_time_pref: 28800
======================================================
DONE :: 10 starting structures 28348.9 cpu seconds
This process generated 10 decoys from 10 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Claimed credit 133.53157877111
Granted credit 1.50234135281901
application version 2.03

Generally my granted credit on this W7 laptop is close the claimed credit, with the occasional one being 30% less or 50% more, but 99% less seems very odd. Any ideas, or just a one off?


I think here too the problem with saving of results of calculations takes place. Your computer has transmitted in the report "10 decoys" it is a very little for the given type of WUs.
For matching here result of calculation of the similar WU on my processor: https://boinc.bakerlab.org/rosetta/result.php?resultid=309983219
Apparently my processor has calculated "96 decoys" all for 7138.77 cpu seconds.
And your result: 10 decoys for 28348.9 cpu seconds, despite more powerful processor.
Credits are calculated seem correctly:
15.15 Cr for 96 decoys (my result)
1.5 Cr for 10 decoys (your result)
I.e. nearby 0,15 Cr for 1 result in both cases.

So I think a problem on the side of you computer, instead of on a server.
If on the computer there are no serious problems, capable to call sharp (many times over) degradation of calculations speed (for example hard swopping) most likely you computer calculates is much more "decoys", but their most part for any reason has been lost, and in the report have been referred only 10.
ID: 64948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 26,101,436
RAC: 16,911
Message 64983 - Posted: 14 Jan 2010, 22:44:50 UTC
Last modified: 14 Jan 2010, 23:06:32 UTC

On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds).
Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all).
If the job of 2nd type once again gets to me I will try to catch it.
I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit"


I think that I have caught the second bug with checkpoints. This time not "between small models", and "intro one big".
Here one of such tasks: https://boinc.bakerlab.org/rosetta/result.php?resultid=310448366
Apparently the ratio between "Claimed credit" and "Granted credit" very bad that indirectly testifies to a problem (too few useful results for such CPU time) Those tasks which never interrupted in an operating time usual shows much better ratio on my computer.

And now as performance of this job on my computer looked: it was fulfilled in 3 stages with 2 restartings between them (the 1st - this turn off of the computer for the night, 2nd - I specially restarted BOINC for testing).
In the end of the first stage (before 1st restarting) CPU time was about 2.5 hours, the progress percent was ~88 %, "show graphics" - 1 model and it is a lot of steps (some thousand).

Next day at start the progress percent has fallen at once to ~47 %, though I think that it has reduced to zero, is simple BOINC has calculated it as 2:49 hours (already used CPU Time) to divide at 6 hours (as much as possible admissible time = target CPU Time х 3 = 6h). In "show graphics" was a following:
http://s004.radikal.ru/i206/1001/e5/15254410b960.jpg
http://s005.radikal.ru/i210/1001/d5/a235df07123e.jpg
Looks as though computing went from the very beginning.

After two hours of computing I restarted BOINC 2nd time (Exit on the tray icon), after start "show graphics" looks so:
http://i069.radikal.ru/1001/1f/f431840cb759.jpg
Again counting of models and steps goes from 0...

In task logs (stderr out) record about reading checkpoint is, but it only one though the job interrupted and restarted twice. Besides in a working folder was much more files concerning to checkpoints.
ID: 64983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 65039 - Posted: 19 Jan 2010, 0:25:07 UTC
Last modified: 19 Jan 2010, 0:25:29 UTC

seems to be a common theme going on with these tasks:

homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000001_0_0_10089.pdb_00002.pdb_00006.pdb.JOB_16819_18_1
https://boinc.bakerlab.org/rosetta/result.php?resultid=309699920

homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000022_0_0_0_0077.pdb_00001.pdb_00001.pdb.JOB_16816_16_1
https://boinc.bakerlab.org/rosetta/result.php?resultid=309816167

homopt4.t322_.t322_.IGNORE_THE_REST.S_00006_0000023_0_0_00034.pdb_00008.pdb_00006.pdb.JOB_16815_24_1
https://boinc.bakerlab.org/rosetta/result.php?resultid=309824444

They all died immediately due to:
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file
ID: 65039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 65051 - Posted: 21 Jan 2010, 11:43:21 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=310017128
homopt_fa_cstmc_1.t370_.t370_.IGNORE_THE_REST.S_00003_0000784_010.pdb.JOB_16898_23_0


Outcome Client error
Client state Compute error
Exit status -177 (0xffffff4f)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E

Engaging BOINC Windows Runtime Debugger...

BOINC Windows Runtime Debugger Version 6.5.0

Dump Timestamp : 01/21/10 00:25:36
LoadLibraryA( E:xxxxx: GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0

ID: 65051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 65052 - Posted: 21 Jan 2010, 11:44:29 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=310017144
t308__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_69_0

Outcome Client error
Client state Compute error
Exit status -177 (0xffffff4f)

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
]]>
ID: 65052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : minirosetta 2.03



©2024 University of Washington
https://www.bakerlab.org