Message boards : Number crunching : MiniRosetta 3.17 Problems.
Previous · 1 · 2
Author | Message |
---|---|
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
Looks like the 3.14 problem with workunits that stop using any CPU time at all but don't tell BOINC that they're finished isn't fully fixed. Does appear to be less frequent, though. Rosetta Mini 3.17 T0552_boinc_alignment_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_34966_22 CPU time at last checkpoint 01:17:50 CPU time 01:17:51 Elapsed time 25:00:05 Estimated time remaining 60:12:19 Fraction done 10.594% Max RAM usage 95 MB Working set size 546.09 MB No longer using any CPU time, but still claims to be running. 64-bit Vista SP2 with 8 GB; BOINC allowed to use 40% 11/3/2011 1:42:40 AM | | Starting BOINC client version 6.12.34 for windows_x86_64 11/3/2011 1:42:40 AM | | log flags: file_xfer, sched_ops, task 11/3/2011 1:42:40 AM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5 11/3/2011 1:42:40 AM | | Data directory: C:ProgramDataBOINC 11/3/2011 1:42:40 AM | | Running under account Bobby 11/3/2011 1:42:40 AM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10] 11/3/2011 1:42:40 AM | | Processor: 6.00 MB cache 11/3/2011 1:42:40 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe 11/3/2011 1:42:40 AM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00) 11/3/2011 1:42:40 AM | | Memory: 8.00 GB physical, 15.66 GB virtual 11/3/2011 1:42:40 AM | | Disk: 919.67 GB total, 555.16 GB free 11/3/2011 1:42:40 AM | | Local time is UTC -5 hours 11/3/2011 1:42:40 AM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 476 GFLOPS peak) Selected workunit length 12 hours. Restarting BOINC lost all but 01:19:52 of the elapsed time. I'l give the workunit one more chance to restart properly; if that isn't adequate, I'll put Rosetta@Home on No new tasks again until the next minirosetta version is ready. I have not seen such a problem with the RALPH@Home 3.18 workunits (6 hour length selected), so I'll continue to run those. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
Now finished, returned, and in Pending status. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
The same no-longer-using-CPU-time problem is also present in another workunit. T0538_boinc_rosetta_cm_medal_ss_v2_cmiles_IGNORE_THE_REST_34758_10367 CPU time at last checkpoint 02:06:31 CPU time 02:07:46 Elapsed time 03:11:44 Fraction done 16.687% Boinc manager claims it is running, but Windows task manager says it is using no CPU time at all. 11/6/2011 6:23:11 PM | | Starting BOINC client version 6.12.34 for windows_x86_64 11/6/2011 6:23:11 PM | | log flags: file_xfer, sched_ops, task 11/6/2011 6:23:11 PM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5 11/6/2011 6:23:11 PM | | Data directory: C:ProgramDataBOINC 11/6/2011 6:23:11 PM | | Running under account Bobby 11/6/2011 6:23:11 PM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10] 11/6/2011 6:23:11 PM | | Processor: 6.00 MB cache 11/6/2011 6:23:11 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe 11/6/2011 6:23:11 PM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00) 11/6/2011 6:23:11 PM | | Memory: 8.00 GB physical, 15.80 GB virtual 11/6/2011 6:23:11 PM | | Disk: 919.67 GB total, 527.06 GB free 11/6/2011 6:23:11 PM | | Local time is UTC -6 hours 11/6/2011 6:23:11 PM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 476 GFLOPS peak) I'm about to restart BOINC to give that workunit another chance to restart properly, but I've already set No new tasks for Rosetta@home on that computer. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
The restart made that workunit return quickly, with 99 decoys done; now in a pending state. Could that mean that 3.17 has trouble doing something reasonable after it finishes 99 decoys? Some of the previous versions of minirosetta did. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
Another workunit gone into no-CPU time. This time, on a computer where I haven't seen this before. Rosetta Mini 3.17 T0540_boinc_medal_split_medal_free_tex_IGNORE_THE_REST_34737_19403 shows as Running, but not using any CPU time at all CPU time at last checkpoint 03:45:25 CPU time 03:49:40 Elapsed time 16:21:27 Estimated time remaining 24:00:10 Fraction done 31.719% Working set size 518.83 MB Selected workunit length 12 hours 11/9/2011 2:57:59 AM | | Starting BOINC client version 6.12.34 for windows_x86_64 11/9/2011 2:57:59 AM | | log flags: file_xfer, sched_ops, task 11/9/2011 2:57:59 AM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5 11/9/2011 2:57:59 AM | | Data directory: C:ProgramDataBOINC 11/9/2011 2:57:59 AM | | Running under account Bobby 11/9/2011 2:57:59 AM | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz [Family 6 Model 42 Stepping 7] 11/9/2011 2:57:59 AM | | Processor: 256.00 KB cache 11/9/2011 2:57:59 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 syscall nx lm vmx smx tm2 popcnt aes pbe 11/9/2011 2:57:59 AM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00) 11/9/2011 2:57:59 AM | | Memory: 15.98 GB physical, 31.96 GB virtual 11/9/2011 2:57:59 AM | | Disk: 136.03 GB total, 70.70 GB free 11/9/2011 2:57:59 AM | | Local time is UTC -6 hours 11/9/2011 2:57:59 AM | | NVIDIA GPU 0: GeForce GT 440 (driver version 28562, CUDA version 4010, compute capability 2.1, 1536MB, 228 GFLOPS peak) 64-bit Windows 7 Professional SP1 16 GB memory Another HP computer - h8-1070t Uncommon enough on this computer that I'll restart BOINC to give that workunit a chance to finish properly. Around a dozen more BOINC projects enabled, like the computer where I saw this problem before. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Another workunit gone into no-CPU time. This time, on a computer where I haven't seen this before. I was hoping this problem would go away with the recent update but apparently not. It seems to happen on W7 and on tasks whose name begins with Txxx (xxx = 3 digit number) only. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
Another workunit gone into no-CPU time. This time, on a computer where I haven't seen this before. I've also seen it on Windows Vista. Task names usually begin with T0xxx (xxx = 3 digit number). If you have access to the source code, look for a section used for little other than that series of workunits, and rather soon after a checkpoint. It appears to need some debugging specific to that section enabled. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
You're right: it's tasks starting with T0xxx. I'm not a dev and don't have access to the source code. The fact that it's not easily reproducible and not necessarily a problem with R@h but perhaps with BOINC must make it hard to track down. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
I suspect that it's specific to R@h, since I don't see it on any more of the at least a dozen BOINC projects those two computers are connected to. Also, when it occurs, the end of CPU time use comes within a few minutes of the last checkpoint of the R@h workunit it affects. I suppose it could be some section of BOINC that none of the other projects happen to use, though. |
Gary Send message Joined: 28 Oct 11 Posts: 1 Credit: 35,145 RAC: 0 |
Hello, I am very new to these forums, so I apologize if this is the wrong place to post this, but I have been seeing errors on every single one of the projects that my computer has completed. When I look at my tasks, nearly all of them that have finished show: Server state: Over Outcome: Client error Client state: New And then it will show me some claimed credit, but no granted credit. Once I click on it however, I see that in some cases I WAS granted credit, but none of this seems to be reflected in any of my statistics. I am also doing other projects which are all working fine. The reason I posted here is because when I searched for some key phrases from my error, I noticed that others were seeing the following: ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6 ERROR:: Exit from: ......srccoreposesymmetryutil.cc line: 740 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish This error isn't present in every case, but it is in a fair few. Anyone know what's wrong? I've got two computers working on projects, but I cant tell is all are having this problem yet since the other one hasnt finished anything yet. Thanks! |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,150 |
Hello, A partial answer: Manually granted credit (by the project) does not show up in both places, only one of them. Automatically granted credit shows up in both places. Also, the following line is a normal result after any error has prevented generation of one of the output files: BOINC:: Error reading and gzipping output datafile: default.out |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. I got a validate error on this after 28min's, is it because of the task or validator.? rlx_ds_decoys_1vie_SAVE_ALL_OUT_35479_404_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=424127535 Validate error__Done__CPU time (sec) 1,673.61 # cpu_run_time_pref: 14400 ====================================================== DONE :: 99 starting structures 1672.76 cpu seconds This process generated 99 decoys from 99 attempts ====================================================== BOINC :: WS_max 0 |
Rocco Moretti Send message Joined: 18 May 10 Posts: 66 Credit: 585,745 RAC: 0 |
Hi. The "99 decoys" bit is a hint at what's likely causing the validator error. There's currently a check on the number of decoys being returned, the thought being that having too many decoys being returned in a short time period is likely indicative of an error. I believe the current limit is something like 100 structures per hour. Usually we try to arrange things so that decoys are produced at a more reasonable pace, but occasionally a quickly processed structure gets through and the faster computers hit up against number of decoys limit. (So get a slower computer and you won't have this issue ;) There's been some informal discussion about raising the limit, but the thought is that large number of quickly produced results isn't the best use of resources, as boinc works most efficiently with computation-heavy/communication-light tasks, and rapidly produced decoys flips that around. The real solution is for us not to send out such jobs in the first place. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi Rocco. Get a slower computer you say, let me think a minute no i don't think so. :p So it is your fault, ;) thanks for letting us know. No problem. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. This erred after 29min. Aug20_needle_11start_h2tail_latA_left_SAVE_ALL_OUT__35349_110923 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=424411874 BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 14400 Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage1 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage2 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_1 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_2 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_3 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_4 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_5 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_6 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_7 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_8 ... success! Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_9 ... success! ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6 ERROR:: Exit from: src/core/pose/symmetry/util.cc line: 740 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
Message boards :
Number crunching :
MiniRosetta 3.17 Problems.
©2024 University of Washington
https://www.bakerlab.org