Message boards : Number crunching : Common Denominator?: compute errors and zero cpu usage
Author | Message |
---|---|
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I awoke this morning to a prompt saying "boinc needs to connect to the internet". Strangely though I'm on an adsl line and so the internet should always be available. I have had this several times since moving to adsl years ago and know that I usually just need to reset the modem. What this means is that my internet connection went down during the night. I think this is the common denominator to what I describe next and to many of the problems sparsely reported at Rosetta. I then noticed two things before resetting the modem. One, Many of my hosts had consecutive "computation errors" (as listed below by host) and Secondly, two of the 5 machines running linux showed ZERO cpu usage although the task page showed wus as "running". Killall -9 boinc and restarting boinc fixed this error (after resetting modem). I have seen many reports of the "zero cpu use" bug and have even seen it once myself. It's tough to track down freakish occurences. This report is mainly applies to Linux users as my one windows machine seemed to work flawlessly last night. The "zero cpu usage" bug might be affecting windows users but I'm not sure of that. I wonder what process is calling for internet, then not getting it, and causing "computation errors". Or perhaps, once that call fails yeilds computation errors on successive wus until the internet returns. This has to have something to do with the loss of communications. I don't think it is solely a 5.91 issue, as I have seen it before with earlier versions of the Rosetta app and there have been reports of this by others prior to 5.91. This may well be a "Boinc" problem as well. The reason I'm certain this to be either an app and/or Boinc problem is that 5.91 has run well up until last nite, then failed last nite, then has continued to run well after resetting the modem. For my AMD64 3700 hostid=692481 these are the computation errors: resultid=128342834 resultid=128342219 For my AMD64 X2 4800 hostid=692483 this is the computation error resultid=128614143 Form my AMD64 X2 6000 hostid=699377 these are the computation errors: resultid=128626741 resultid=128632201 resultid=128647762 resultid=128660399 For my host AMD64 X2 5200 hostid=586640 these are the computation errors: resultid=128344775 resultid=128344731 Now the AMD64 2800 didn't have any errors and my Windows Mobile AMD64 3700 didn't as well. However, this many errors consecutively, spread across this many hosts, is truely unique to my memory. If any of those out there who also think their computation and/or zero cpu usage problems might also have to do with internet failure, please keep an eye out for this and report it as they come. thanks tony |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
I remember reading about this a couple years ago. This is a known BOINC problem. From what I remember, it's net code uses blocking calls. The problem is that those blocking calls prevent BOINC from communicating with any running app. If I'm remembering correctly, the BOINC devs have said that it's broken by design and don't want to put in the effort required to fix it. Edit: Here's a good explanation of the problem. |
Luuklag Send message Joined: 13 Sep 07 Posts: 262 Credit: 4,171 RAC: 0 |
well the problem is i think from what i read in your post. the rosetta app errored out, not the boinc the rosetta!. that made the wu's fail, and if rosetta error's out the rosetta app itself, and not boinc, wich is given the ok to use the inet, wants to connect to the project servers and upload an bug report. so it asks for permission to use your inet. |
Message boards :
Number crunching :
Common Denominator?: compute errors and zero cpu usage
©2024 University of Washington
https://www.bakerlab.org