Message boards : Number crunching : Minirosetta v1.40 bug thread
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 15 · Next
Author | Message |
---|---|
Alec Rosa Send message Joined: 11 Nov 08 Posts: 18 Credit: 2,635 RAC: 0 |
read this. It was posted in another new thread by peter leman. Thank you! It worked out, booting the computer. The 'slots' disappeared and, with them, the lock file. Now to see if the error wont happen again after I resume Rosetta's tasks. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
read this. It was posted in another new thread by peter leman. glad to help, but also thanks to peter leman for creating the original thread with the lockfile topic. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
|
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
4 lockfiles <--- see discussion below Yes sorry, I didn't see that until later. Strange that all the errors appear to be on the loopbuild models and yet some of these are not affected. In the past few days I have had boinc stop completely which in the past has meant that at a model has jammed up the works. Restart boinc and everything start working again with no apparent model failure. Strange. |
Rob Lilley Send message Joined: 11 Jan 06 Posts: 11 Credit: 133,120 RAC: 0 |
This Minirosetta v1.40 Work Unit is another one that won't suspend and continues running when pre-empted by a QMC WU. Didn't do that, just suspended all other projects then restarted the computer. Turns out it was probably a bad WU anyway, as it worked for a while then came to a dead stop and woudln't restart. After I tried restarting BOINC, it then errored out and came up with the lockfile error others are experiencing, as you will see here. Ah well, some other poor sap will get that nasty WU now :( |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I think the "rule" for loopbuild is to not do anything to it or it crashes and burns badly. |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
rochester..have a look at this mornings discussion down below on lockfile issues. it will save you more errors and loss of credit. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=189333988 |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. I have another task that dosen't want to stop when preempted time & percentage are ticking up, it is currently running. 1lis__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1lis_-_4768_2176_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=191579804 pete. |
Stacey Baird Send message Joined: 11 Apr 06 Posts: 19 Credit: 74,745 RAC: 0 |
Probable Problem 11/26/2008 9:15:00 PM|rosetta@home|Restarting task 1acf__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1acf_-_4768_1359_0 using minirosetta version 140 The above is stuck on 00.9:59.00, nine minutes 59 seconds remaining. CPU time of more than five hours increases but time remaining never decreases. Should I abort? Hmmm, as I read farther below, others are having the same problem. Good Luck |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 23 |
210279108 NAN in HBonding. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
FalconFly Send message Joined: 11 Jan 08 Posts: 23 Credit: 2,163,056 RAC: 0 |
I'm seeing a significantly above average failures, which result in the shutdown/crash of BOINC (MiniRosetta 1.40). Happens across all my Linux Systems with no derterminable pattern (64bit BOINC V5.10.45) and naturally results in loss of computing power (need to restart BOINC or the System for ease of purpose) Otherwise, repeatedly above average numbers of WorkUnits stuck at a certain percentage and its MiniRosetta Task either failed or using 0% CPU power, effectively blocking a CPU core each. Also requires a BOINC restart to get the affected WorkUnits kick into action again. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I'm seeing a significantly above average failures, which result in the shutdown/crash of BOINC (MiniRosetta 1.40). for the team to know what is going on, please post your affected work units links in your next message. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2127 Credit: 41,266,340 RAC: 8,573 |
read this. It was posted in another new thread by peter leman. Thanks for highlighting Peter's message on this subject, Greg. I've closed all apps, ended the MiniRosetta processes, deleted the files and am about to do a re-boot. Fingers crossed. I promise to report back soon. |
FalconFly Send message Joined: 11 Jan 08 Posts: 23 Credit: 2,163,056 RAC: 0 |
for the team to know what is going on, please post your affected work units links in your next message. This is going to be a tedious task, as the WorkUnits (most of them) complete normally after the deadlock is solved. And after BOINC has crashed, I have no way of telling which WorkUnit may have caused it, since I'm looking at upto 8 WorkUnits per Host which will restart all normal when re-launching BOINC. For now I'm afraid I'm best off with just solving the deadlocks, had to do that ~8 times today already. (the only real solution I'd see is to run BOINC in debug mode to get behind it crashing or the MiniRosetta Client failing, which I'm very hesitant to do on 24 active production Systems running 24/7 at full speed - sounds like loads of work :p ) Anyway, for now I haven't seen any such behaviour on my 32bit Win32 Systems so far, only my Linux Systems seem randomly affected. -- edit -- Oh, forgot : How does Rosetta react to undervolting of CPUs ? Most of my Systems run with reduced Vcore tested stable with Prime95, given a small safety buffer and have 100% validation on other Projects (Einstein, MalariaControl, SETI, LHC). I'm very careful before I blame anything on a Project Client when I'm not running hardware 100% to its specifications. |
Alec Rosa Send message Joined: 11 Nov 08 Posts: 18 Credit: 2,635 RAC: 0 |
read this. It was posted in another new thread by peter leman. Now the update: The boot was no more than a (short-lived) temporary solution. It all happened again. I believe the error occurs when Rosetta tasks are paused (for the BOINC client to switch to other projects) and, when they start again, it all goes to crap (and this is the technical term). These were the tasks. I let them be processed till the end': https://boinc.bakerlab.org/rosetta/result.php?resultid=209770604 https://boinc.bakerlab.org/rosetta/result.php?resultid=209817742 More came by, I aborted them when they started the usual (afore transcribed) 'you may need to reset the project'. The Rosetta project is now suspended again until a solution to this is 'Revealed' to me. |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=210070351 https://boinc.bakerlab.org/rosetta/result.php?resultid=210070348 https://boinc.bakerlab.org/rosetta/result.php?resultid=209966564 https://boinc.bakerlab.org/rosetta/result.php?resultid=208462198 https://boinc.bakerlab.org/rosetta/result.php?resultid=209224858 https://boinc.bakerlab.org/rosetta/result.php?resultid=209224828 |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2127 Credit: 41,266,340 RAC: 8,573 |
read this. It was posted in another new thread by peter leman. Sorry, no good whatsoever - possibly worse. 1 success, 6 failures. Of the 4 that errored out before I aborted them: 210309372 210290406 Can't acquire lockfile - exiting Outcome Client error Client state Compute error Exit status -226 (0xffffff1e) <core_client_version>6.2.19</core_client_version> <![CDATA[ <message>too many exit(0)s</message> And Outcome Client error Client state Compute error Exit status 1 (0x1) <core_client_version>6.2.19</core_client_version> <![CDATA[ <message>Incorrect function. (0x1) - exit code 1 (0x1)</message> 210317441 <stderr_txt> # cpu_run_time_pref: 7200 recovering checkpoint of tag S_1VYHA_5_00000001 with id abrelax_rg_state Loops::add_loop error -- overlapping loop regions existing loop begin/end: 92/124 new loop begin/end: 124/191 ERROR:: Exit from: ....srcprotocolsloopsLoopClass.cc line: 233 called boinc_finish </stderr_txt> 210318343 <stderr_txt> recovering checkpoint of tag S_1BE9A_3_00000001 with id abrelax_rg_state Loops::add_loop error -- overlapping loop regions existing loop begin/end: 1/20 new loop begin/end: 20/31 ERROR:: Exit from: ....srcprotocolsloopsLoopClass.cc line: 233 called boinc_finish </stderr_txt> Not sure what those last two were about tbh, but they fell over quick enough. Any more ideas, anyone? |
Message boards :
Number crunching :
Minirosetta v1.40 bug thread
©2024 University of Washington
https://www.bakerlab.org