Having problems with client and compute errors.

Message boards : Number crunching : Having problems with client and compute errors.

To post messages, you must log in.

AuthorMessage
NUTNDUN

Send message
Joined: 31 Dec 08
Posts: 11
Credit: 1,047,811
RAC: 0
Message 59062 - Posted: 27 Jan 2009, 17:03:33 UTC

I don't know what I need to do other then to back it off to stock speed but it runs prime95 for hours with no problems. The machine I am having problems with is a Q9550 quad core running at 3.4 right now. I had it running at 3.7 on Sunday through the night then on monday that is when I started having the compute errors. When I got home last night I added the second hard drive I was waiting on and set up a raid 0 array and continued to reinstall the os which is windows 7.

Once the os was installed I got boinc up and running and I had the cpu running at 3.4 all night. I didn't have any problems until today for some reason it just sent in a couple more compute error work units. Besides setting it to run at stock speeds is there anything else that may be causing the problem? Could the raid 0 have an affect on boinc or rosetta?

I don't understand how the system can bench fine on prime but be erroring all the time on rosetta. Could my problem more so be the memory? I have the voltage on it set to auto which is 1.8v but if I am not mistaken it is spec'd for 2.1. It is 2 x 2gb OCZ platinum.

Thanks in advance for any help.
ID: 59062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dagorath

Send message
Joined: 20 Apr 06
Posts: 32
Credit: 29,176
RAC: 0
Message 59064 - Posted: 27 Jan 2009, 18:58:37 UTC - in response to Message 59062.  

When you change 2 things at the same time (for example the OC and the RAID) you make it difficult to discover the cause of the problem via process of elimination. Leave the RAID as it is and don't change ANYTHING else but the OC. Put the clocks and voltages back to stock. Ignore compute errors on any tasks that started crunching when it was OC'd for they may have been tainted by the OC before they errored. Let it run on stock settings for several days not just several hours. If you still get compute errors on BOINC tasks then leave it at stock speeds and tweak other things one at a time until it runs right. It could be the RAID, BOINC settings, dirty power supply or 100 other things but don't make the mistake of thinking that if it stills gives errors at stock speeds then the problem can't be OC. When you eventually get it running right at stock speed for several weeks then try a conservative OC for at least a week. If that works then try a little more OC.

Prime95 is only 1 test and lots of people think it's not as rigorous/difficult a test as many BOINC projects are. Passing prime95 doesn't mean it will pass other tests.


BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 59064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NUTNDUN

Send message
Joined: 31 Dec 08
Posts: 11
Credit: 1,047,811
RAC: 0
Message 59065 - Posted: 27 Jan 2009, 19:01:45 UTC - in response to Message 59064.  

I agree with your point. I am going to set it back to stock speeds and see how things go. I copied part of one of the errored work units, I will paste it here. I don't know if maybe someone can tell from it if it was the cpu or the memory or maybe something else.

Task ID 223951384
Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_93299_0
Workunit 204102693
Created 26 Jan 2009 23:11:36 UTC
Sent 26 Jan 2009 23:14:04 UTC
Received 27 Jan 2009 16:29:34 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 989975
Report deadline 5 Feb 2009 23:14:04 UTC
CPU time 6529.999
stderr out
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-27 8:45: 5:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
Starting work on structure: _00002


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7730B19B read attempt to address 0x8C5799DC

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.5.0


Dump Timestamp : 01/27/09 10:34:03
Install Directory :
Data Directory : C:BOINC_DATA
Project Symstore : https://boinc.bakerlab.org/rosetta/symstore
LoadLibraryA( C:BOINC_DATAdbghelp.dll ): GetLastError = 126
Loaded Library : dbghelp.dll
LoadLibraryA( C:BOINC_DATAsymsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( C:BOINC_DATAsrcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
LoadLibraryA( C:BOINC_DATAversion.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: C:BOINC_DATAslots2;C:BOINC_DATAprojectsboinc.bakerlab.org_rosetta;srv*C:BOINC_DATAprojectsboinc.bakerlab.org_rosettasymbols*http://msdl.microsoft.com/download/symbols;srv*C:BOINC_DATAprojectsboinc.bakerlab.org_rosettasymbols*https://boinc.bakerlab.org/rosetta/symstore;srv*C:BOINC_DATAprojectsboinc.bakerlab.org_rosettasymbols*http://boinc.berkeley.edu/symstore


ModLoad: 00400000 00724000 C:BOINC_DATAprojectsboinc.bakerlab.org_rosettaminirosetta_1.54_windows_x86_64.exe (-nosymbols- Symbols Loaded)
Linked PDB Filename : D:boinc_buildminirosetta_windowsminiVisual StudioBoincReleaseminirosetta_1.54_windows_intelx86.pdb
ID: 59065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dagorath

Send message
Joined: 20 Apr 06
Posts: 32
Credit: 29,176
RAC: 0
Message 59066 - Posted: 27 Jan 2009, 19:50:29 UTC - in response to Message 59065.  

One clue is here:


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7730B19B read attempt to address 0x8C5799DC


There was an Access Violation which means the application tried to read an address that was not part of its assigned address space. Programs are not allowed to read addresses that are not in their assigned memory space.

Access Violations are often caused by pointers gone awry. Faulty memory, OC'd or faulty CPU or programming errors can cause pointers to go awry and point to an address outside of the program's assigned address space. The error report itself doesn't give enough info to favor any one of those possible causes over the others. Experience with OC, however, tells us that if you're OCing and getting Access Violations then the first thing you should do is drop back to stock settings.


BOINC FAQ Service
Official BOINC wiki
Installing BOINC on Linux
ID: 59066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NUTNDUN

Send message
Joined: 31 Dec 08
Posts: 11
Credit: 1,047,811
RAC: 0
Message 59067 - Posted: 27 Jan 2009, 19:57:17 UTC

I am going to drop it back as soon as I get home from work. I called my wife and had her shut the pc down for now. I have a feeling it is the memory, it didn't like even +1 on the fsb over stock 1066 and I still had it on auto voltage which puts it at 1.8 volts and I am pretty sure it needs to be at 2.1.

I will drop it back and get it running steady again for a couple weeks and slowly start upping it. I know it will do 4ghz easily but it is a matter of me finding the right settings. I just need to realize it is going to take some time.
ID: 59067 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 59070 - Posted: 27 Jan 2009, 20:35:45 UTC - in response to Message 59067.  
Last modified: 27 Jan 2009, 20:37:41 UTC

I am going to drop it back as soon as I get home from work. I called my wife and had her shut the pc down for now. I have a feeling it is the memory, it didn't like even +1 on the fsb over stock 1066 and I still had it on auto voltage which puts it at 1.8 volts and I am pretty sure it needs to be at 2.1.

I will drop it back and get it running steady again for a couple weeks and slowly start upping it. I know it will do 4ghz easily but it is a matter of me finding the right settings. I just need to realize it is going to take some time.



what your machine can do and what Rosie likes are two different things.
i noticed with einstein i could push my machine to the max, but rosie didn't like that speed at all. i was OC'ing my machine pretty hard and ran into access violations within minutes of completion or halfway through the task. i dropped my speed back about 25mhz and the slowly crept it back up to just below the crash point and everything is fine now. so you can use higher than stock speed, but you need to find how high you can go, but still be below the speed you are at now.
ID: 59070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NUTNDUN

Send message
Joined: 31 Dec 08
Posts: 11
Credit: 1,047,811
RAC: 0
Message 59072 - Posted: 27 Jan 2009, 22:00:24 UTC - in response to Message 59070.  

I think I was trying to go to far too quick. I just built my two new folders last week and got the last of the parts yesterday to finish mine even though it was already running. I have a lot to learn with the newer systems and so many options as far as voltages and oc options. Hopefully I can get her under control and running right. Thank you all for your help.
ID: 59072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 59075 - Posted: 27 Jan 2009, 23:06:15 UTC - in response to Message 59072.  

I think I was trying to go to far too quick. I just built my two new folders last week and got the last of the parts yesterday to finish mine even though it was already running. I have a lot to learn with the newer systems and so many options as far as voltages and oc options. Hopefully I can get her under control and running right. Thank you all for your help.


If you got any Asus boards you should use the AI suite and just use the clock speed adjust and don't mess with voltages.
ID: 59075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NUTNDUN

Send message
Joined: 31 Dec 08
Posts: 11
Credit: 1,047,811
RAC: 0
Message 59077 - Posted: 28 Jan 2009, 0:13:46 UTC - in response to Message 59075.  

If you got any Asus boards you should use the AI suite and just use the clock speed adjust and don't mess with voltages.


I have gigabyte ud3p boards. I could use easy tune to do it but I don't like have a bunch of extra programs running. I will just take my time and learn to be patient. Only change one thing at a time so I can keep track of what messes things up.
ID: 59077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NUTNDUN

Send message
Joined: 31 Dec 08
Posts: 11
Credit: 1,047,811
RAC: 0
Message 59227 - Posted: 1 Feb 2009, 22:54:41 UTC

I just wanted to update my thread. I got everything taken care of. I didn't have my voltages set correctly and now everything is good to go and it is currently at 4.03.

Thanks for the help.
ID: 59227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gil

Send message
Joined: 10 Oct 06
Posts: 16
Credit: 30,279,035
RAC: 0
Message 59293 - Posted: 4 Feb 2009, 21:41:59 UTC

Hi guys,

I am getting computation errors regularly.
They all dispay the following message:

Computation for ´TASK ID´ finished
Output file ´TASK ID´ absent

anyone knows what could be wrong?

( i have an OC QX9650 @ 400 FBS *9 ) hope thats not the issue though

Thanks in advance for a helpful answer!

Gil
ID: 59293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59299 - Posted: 4 Feb 2009, 23:01:00 UTC

Gil, the 1.54 version has really stabilized the running of Rosetta on most environments. See my advice on OC here. Prove to yourself that OC is not the problem.

If you have further problems, I'd suggest you post in the thread on Number Crunching board for the Rosetta version that was run when the problem occurred.
Rosetta Moderator: Mod.Sense
ID: 59299 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59305 - Posted: 4 Feb 2009, 23:26:27 UTC

Gil,

When I looked at the tasks in your computer I saw about 6-8 with random failures at various addresses. This is a fairly low number as these things might go ... except that the indication SEEMS to me to be probably related to the OC ... perhaps the memory is not quite making it ...

So, as Mod.Sense says, that is the first step. Turn off the OC back to stock and run a hundred off ... if they are still erroring out ... we go to step two ...
ID: 59305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Having problems with client and compute errors.



©2024 University of Washington
https://www.bakerlab.org