MiniRosetta 3.17 Problems.

Message boards : Number crunching : MiniRosetta 3.17 Problems.

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71576 - Posted: 7 Nov 2011, 0:35:32 UTC
Last modified: 7 Nov 2011, 0:42:15 UTC

Looks like the 3.14 problem with workunits that stop using any CPU time at all but don't tell BOINC that they're finished isn't fully fixed.

Does appear to be less frequent, though.

Rosetta Mini 3.17
T0552_boinc_alignment_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_34966_22
CPU time at last checkpoint 01:17:50
CPU time 01:17:51
Elapsed time 25:00:05
Estimated time remaining 60:12:19
Fraction done 10.594%
Max RAM usage 95 MB
Working set size 546.09 MB

No longer using any CPU time, but still claims to be running.

64-bit Vista SP2 with 8 GB; BOINC allowed to use 40%

11/3/2011 1:42:40 AM | | Starting BOINC client version 6.12.34 for windows_x86_64
11/3/2011 1:42:40 AM | | log flags: file_xfer, sched_ops, task
11/3/2011 1:42:40 AM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5
11/3/2011 1:42:40 AM | | Data directory: C:ProgramDataBOINC
11/3/2011 1:42:40 AM | | Running under account Bobby
11/3/2011 1:42:40 AM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
11/3/2011 1:42:40 AM | | Processor: 6.00 MB cache
11/3/2011 1:42:40 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
11/3/2011 1:42:40 AM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
11/3/2011 1:42:40 AM | | Memory: 8.00 GB physical, 15.66 GB virtual
11/3/2011 1:42:40 AM | | Disk: 919.67 GB total, 555.16 GB free
11/3/2011 1:42:40 AM | | Local time is UTC -5 hours
11/3/2011 1:42:40 AM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 476 GFLOPS peak)

Selected workunit length 12 hours.

Restarting BOINC lost all but 01:19:52 of the elapsed time.

I'l give the workunit one more chance to restart properly; if that isn't adequate, I'll put Rosetta@Home on No new tasks again until the next minirosetta version is ready.

I have not seen such a problem with the RALPH@Home 3.18 workunits (6 hour length selected), so I'll continue to run those.
ID: 71576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71577 - Posted: 7 Nov 2011, 3:32:59 UTC - in response to Message 71576.  

Now finished, returned, and in Pending status.
ID: 71577 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71578 - Posted: 7 Nov 2011, 19:55:56 UTC

The same no-longer-using-CPU-time problem is also present in another workunit.

T0538_boinc_rosetta_cm_medal_ss_v2_cmiles_IGNORE_THE_REST_34758_10367
CPU time at last checkpoint 02:06:31
CPU time 02:07:46
Elapsed time 03:11:44
Fraction done 16.687%

Boinc manager claims it is running, but Windows task manager says it is using no CPU time at all.

11/6/2011 6:23:11 PM | | Starting BOINC client version 6.12.34 for windows_x86_64
11/6/2011 6:23:11 PM | | log flags: file_xfer, sched_ops, task
11/6/2011 6:23:11 PM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5
11/6/2011 6:23:11 PM | | Data directory: C:ProgramDataBOINC
11/6/2011 6:23:11 PM | | Running under account Bobby
11/6/2011 6:23:11 PM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
11/6/2011 6:23:11 PM | | Processor: 6.00 MB cache
11/6/2011 6:23:11 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
11/6/2011 6:23:11 PM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
11/6/2011 6:23:11 PM | | Memory: 8.00 GB physical, 15.80 GB virtual
11/6/2011 6:23:11 PM | | Disk: 919.67 GB total, 527.06 GB free
11/6/2011 6:23:11 PM | | Local time is UTC -6 hours
11/6/2011 6:23:11 PM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 476 GFLOPS peak)

I'm about to restart BOINC to give that workunit another chance to restart properly, but I've already set No new tasks for Rosetta@home on that computer.
ID: 71578 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71579 - Posted: 7 Nov 2011, 20:04:16 UTC - in response to Message 71578.  

The restart made that workunit return quickly, with 99 decoys done; now in a pending state.

Could that mean that 3.17 has trouble doing something reasonable after it finishes 99 decoys? Some of the previous versions of minirosetta did.
ID: 71579 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71583 - Posted: 9 Nov 2011, 22:26:20 UTC

Another workunit gone into no-CPU time. This time, on a computer where I haven't seen this before.

Rosetta Mini 3.17
T0540_boinc_medal_split_medal_free_tex_IGNORE_THE_REST_34737_19403
shows as Running, but not using any CPU time at all
CPU time at last checkpoint 03:45:25
CPU time 03:49:40
Elapsed time 16:21:27
Estimated time remaining 24:00:10
Fraction done 31.719%
Working set size 518.83 MB

Selected workunit length 12 hours

11/9/2011 2:57:59 AM | | Starting BOINC client version 6.12.34 for windows_x86_64
11/9/2011 2:57:59 AM | | log flags: file_xfer, sched_ops, task
11/9/2011 2:57:59 AM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5
11/9/2011 2:57:59 AM | | Data directory: C:ProgramDataBOINC
11/9/2011 2:57:59 AM | | Running under account Bobby
11/9/2011 2:57:59 AM | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz [Family 6 Model 42 Stepping 7]
11/9/2011 2:57:59 AM | | Processor: 256.00 KB cache
11/9/2011 2:57:59 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 syscall nx lm vmx smx tm2 popcnt aes pbe
11/9/2011 2:57:59 AM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
11/9/2011 2:57:59 AM | | Memory: 15.98 GB physical, 31.96 GB virtual
11/9/2011 2:57:59 AM | | Disk: 136.03 GB total, 70.70 GB free
11/9/2011 2:57:59 AM | | Local time is UTC -6 hours
11/9/2011 2:57:59 AM | | NVIDIA GPU 0: GeForce GT 440 (driver version 28562, CUDA version 4010, compute capability 2.1, 1536MB, 228 GFLOPS peak)

64-bit Windows 7 Professional SP1
16 GB memory
Another HP computer - h8-1070t

Uncommon enough on this computer that I'll restart BOINC to give that workunit a chance to finish properly.

Around a dozen more BOINC projects enabled, like the computer where I saw this problem before.
ID: 71583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 71584 - Posted: 10 Nov 2011, 1:27:35 UTC - in response to Message 71583.  

Another workunit gone into no-CPU time. This time, on a computer where I haven't seen this before.


I was hoping this problem would go away with the recent update but apparently not. It seems to happen on W7 and on tasks whose name begins with Txxx (xxx = 3 digit number) only.
ID: 71584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71585 - Posted: 10 Nov 2011, 2:00:18 UTC - in response to Message 71584.  

Another workunit gone into no-CPU time. This time, on a computer where I haven't seen this before.


I was hoping this problem would go away with the recent update but apparently not. It seems to happen on W7 and on tasks whose name begins with Txxx (xxx = 3 digit number) only.


I've also seen it on Windows Vista. Task names usually begin with T0xxx (xxx = 3 digit number).

If you have access to the source code, look for a section used for little other than that series of workunits, and rather soon after a checkpoint. It appears to need some debugging specific to that section enabled.
ID: 71585 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 71589 - Posted: 10 Nov 2011, 22:39:16 UTC - in response to Message 71585.  



I've also seen it on Windows Vista. Task names usually begin with T0xxx (xxx = 3 digit number).

If you have access to the source code, look for a section used for little other than that series of workunits, and rather soon after a checkpoint. It appears to need some debugging specific to that section enabled.


You're right: it's tasks starting with T0xxx.

I'm not a dev and don't have access to the source code. The fact that it's not easily reproducible and not necessarily a problem with R@h but perhaps with BOINC must make it hard to track down.
ID: 71589 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71590 - Posted: 11 Nov 2011, 0:07:36 UTC - in response to Message 71589.  



I've also seen it on Windows Vista. Task names usually begin with T0xxx (xxx = 3 digit number).

If you have access to the source code, look for a section used for little other than that series of workunits, and rather soon after a checkpoint. It appears to need some debugging specific to that section enabled.


You're right: it's tasks starting with T0xxx.

I'm not a dev and don't have access to the source code. The fact that it's not easily reproducible and not necessarily a problem with R@h but perhaps with BOINC must make it hard to track down.


I suspect that it's specific to R@h, since I don't see it on any more of the at least a dozen BOINC projects those two computers are connected to. Also, when it occurs, the end of CPU time use comes within a few minutes of the last checkpoint of the R@h workunit it affects.

I suppose it could be some section of BOINC that none of the other projects happen to use, though.
ID: 71590 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gary

Send message
Joined: 28 Oct 11
Posts: 1
Credit: 35,145
RAC: 0
Message 71608 - Posted: 15 Nov 2011, 6:49:00 UTC

Hello,

I am very new to these forums, so I apologize if this is the wrong place to post this, but I have been seeing errors on every single one of the projects that my computer has completed. When I look at my tasks, nearly all of them that have finished show:

Server state: Over
Outcome: Client error
Client state: New

And then it will show me some claimed credit, but no granted credit. Once I click on it however, I see that in some cases I WAS granted credit, but none of this seems to be reflected in any of my statistics. I am also doing other projects which are all working fine. The reason I posted here is because when I searched for some key phrases from my error, I noticed that others were seeing the following:

ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 740
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

This error isn't present in every case, but it is in a fair few. Anyone know what's wrong? I've got two computers working on projects, but I cant tell is all are having this problem yet since the other one hasnt finished anything yet. Thanks!
ID: 71608 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 71616 - Posted: 16 Nov 2011, 6:25:43 UTC - in response to Message 71608.  

Hello,

I am very new to these forums, so I apologize if this is the wrong place to post this, but I have been seeing errors on every single one of the projects that my computer has completed. When I look at my tasks, nearly all of them that have finished show:

Server state: Over
Outcome: Client error
Client state: New

And then it will show me some claimed credit, but no granted credit. Once I click on it however, I see that in some cases I WAS granted credit, but none of this seems to be reflected in any of my statistics. I am also doing other projects which are all working fine. The reason I posted here is because when I searched for some key phrases from my error, I noticed that others were seeing the following:

ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 740
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

This error isn't present in every case, but it is in a fair few. Anyone know what's wrong? I've got two computers working on projects, but I cant tell is all are having this problem yet since the other one hasnt finished anything yet. Thanks!


A partial answer: Manually granted credit (by the project) does not show up in both places, only one of them. Automatically granted credit shows up in both places.

Also, the following line is a normal result after any error has prevented generation of one of the output files:

BOINC:: Error reading and gzipping output datafile: default.out
ID: 71616 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 71627 - Posted: 21 Nov 2011, 3:28:46 UTC

Hi.

I got a validate error on this after 28min's, is it because of the task or validator.?

rlx_ds_decoys_1vie_SAVE_ALL_OUT_35479_404_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=424127535

Validate error__Done__CPU time (sec) 1,673.61

# cpu_run_time_pref: 14400
======================================================
DONE :: 99 starting structures 1672.76 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================
BOINC :: WS_max 0


ID: 71627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rocco Moretti

Send message
Joined: 18 May 10
Posts: 66
Credit: 585,745
RAC: 0
Message 71628 - Posted: 21 Nov 2011, 19:02:24 UTC - in response to Message 71627.  

Hi.

I got a validate error on this after 28min's, is it because of the task or validator.?

======================================================
DONE :: 99 starting structures 1672.76 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================


The "99 decoys" bit is a hint at what's likely causing the validator error. There's currently a check on the number of decoys being returned, the thought being that having too many decoys being returned in a short time period is likely indicative of an error. I believe the current limit is something like 100 structures per hour.

Usually we try to arrange things so that decoys are produced at a more reasonable pace, but occasionally a quickly processed structure gets through and the faster computers hit up against number of decoys limit. (So get a slower computer and you won't have this issue ;) There's been some informal discussion about raising the limit, but the thought is that large number of quickly produced results isn't the best use of resources, as boinc works most efficiently with computation-heavy/communication-light tasks, and rapidly produced decoys flips that around. The real solution is for us not to send out such jobs in the first place.
ID: 71628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 71629 - Posted: 21 Nov 2011, 20:47:03 UTC

Hi Rocco.

Get a slower computer you say, let me think a minute no i don't think so. :p

So it is your fault, ;) thanks for letting us know.

No problem.



ID: 71629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 71631 - Posted: 22 Nov 2011, 21:06:27 UTC

Hi.

This erred after 29min.

Aug20_needle_11start_h2tail_latA_left_SAVE_ALL_OUT__35349_110923

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=424411874

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage1 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage2 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_1 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_2 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_3 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_4 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_5 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_6 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_7 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_8 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_9 ... success!

ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: src/core/pose/symmetry/util.cc line: 740
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

ID: 71631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : MiniRosetta 3.17 Problems.



©2024 University of Washington
https://www.bakerlab.org