Message boards : Number crunching : Rosetta overheating my CPU
Author | Message |
---|---|
Fubar the Benevolent Despot Send message Joined: 1 Jul 13 Posts: 2 Credit: 1,020,491 RAC: 172 |
It appears I'm going to have to quit crunching on Rosetta. I've been having an issue with my CPU giving out overheating warnings for some time, ever since I downloaded the Core Temp program for a completely unrelated issue. (I can only imagine how long it's been having the issue before that, probably since day 1 on this build) I've gone all Sherlock Holmes, checking every part even remotely connected to cooling: power supply, heat sink fans, etc. I added 3 more case fans, two for intake, one for exhaust. I even removed and reapplied thermal paste. Finally, by sheer luck, I had left task mangler open as I was preparing to do other things and saw a huge spike in power usage at the same time as the CPU cooler fan kicked into high gear. I looked at the Core Temp program - also open - and saw the temp spike up. The power usage was coming from Rosetta. I suspended the work and reset the min & max in Core Temp. 3 days later, It hasn't gone within 10 degrees of warning and only then when BOINC was running Milky Way. Without that running, I'm down 30 degrees below the warning level at maximum. I'm going to keep checking through the weekend but the data seems pretty clear: Rosetta puts too much load on my system. An AMD Ryzen 7 3700X (Matisse) with 16 GB RAM if you care. Sorry, folks. My computer is too important for me to fry it running Rosetta. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1743 Credit: 18,534,891 RAC: 3,788 |
Sorry, folks. My computer is too important for me to fry it running Rosetta.Given that there are hundreds of thousands of computer systems that are capable of running at 100% load for days & weeks (and months & years) on end without overheating, the other option would be to fix the problem with your CPU cooling. Even at Milkyway your system misses the deadline 75% of the time, so there's a good chance your CPU is thermally throttling even there- as 12 days is plenty of time to return work, here at Rosetta it's 3 days. I even removed and reapplied thermal paste.What were the CPU temperatures before & after re-doing the CPU heatsink, at idle & at full load? (i'd suggest Cinebench 2024 using the multicore test using all cores & threads, to load test it, with BOINC processing suspended). If you used more than the slightest smear of paste, then there is your problem. Heatsink paste is meant to fill the air gaps between the heatsink & the CPU, and not come between the metal of the CPU & the metal of the heatsink- otherwise it acts as heat insulator, not a conductor (it's heat conductivity is much greater than that of air, but way, way, way less than that of metal to metal (i would also hope that the heatsink doesn't still have it's protective plastic cover on it...)). Grant Darwin NT |
Fubar the Benevolent Despot Send message Joined: 1 Jul 13 Posts: 2 Credit: 1,020,491 RAC: 172 |
Thank you for pointing out several things I already knew. For your info, so you are aware of from whence I speak, I've been dealing with computers since 1981, yes, 1981. I have a programming degree and have owned and built computers for years before the internet even existed. For the geeks, my first was an 8Mhz 8088 with a single 5 1/4" floppy drive, a 1200 baud modem and a 12" CGA screen. And I've been running BOINC since it came out and SETI@Home before that. I've been around for a minute. I couldn't give a rat's patoot how long it takes to run a specific task, be it from Milkyway, Rosetta, Einstein or any of the others that are available for BOINC. Perhaps your system misses deadlines, I am not so afflicted. Rosetta is cranking out 10-15 F more than the second hottest thing on this box, Milkyway, and is the only thing on this box that causes overheat warnings. Milkyway is running 20-25 F more than the next hottest load. With BOINC suspended, I'm running 60+ F below warning levels and 65-70 F under Rosetta. I've dealt with "my cooling issues". I've rerouted wiring, added three case fans, et cetera, as detailed in the OP. Perhaps your reading skills are "sub optimal"? Or your comprehension. And I'm not going to go into a boring recitation of temperatures just for your amusement and edification. Over the course of several months, I've identified the problem, and dealt with it. I didn't post to get in a debate about what you think I should do, I posted to be polite and let y'all know I'm done and why. Peace out. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1743 Credit: 18,534,891 RAC: 3,788 |
Perhaps your system misses deadlines, I am not so afflicted.Utter rubbish. Here at Rosetta, you aborted half, and the rest timed out- ie you missed the deadlines. For Milkyway, you completed 7, and 33 missed the deadline. Massive failure rate. I've dealt with "my cooling issues".No, you haven't. Dealing with the issue would imply you fixed the problem. Since that isn't the case, your method of "dealing" with the problem is to just ignore that there is an issue, and avoid activities that bring that issue to the fore. I didn't post to get in a debate about what you think I should do, I posted to be polite and let y'all know I'm done and why.Nothing polite about your response to someone offering suggestions as to how you might be able to fix the issue, as the only reason for someone to post here about such an issue would be to sort out the problem they are having. If they have no intention of fixing their problem, then there's no need for them to post here to say goodbye. But given your attitude and how little you have contributed to the project, and your choice to avoid workloads that show that there is a problem with your system that requires fixing, then your leaving here is our gain, and MilkyWay's loss. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2166 Credit: 41,606,783 RAC: 4,182 |
Thank you for pointing out several things I already knew. At the risk of antagonising you further, can I make one further suggestion before you abandon altogether. My motivation being that there's a world of possibilities between overheating and running 60+F below warning levels. My background being I used to overclock my PCs to within (and beyond) an inch of their lives - having melted sockets into the motherboard when I've got it wrong, done all the same things on high capacity case fans, looked at airflow temp gradients within the case, installing AIO CPU coolers etc. All stuff you'd know and I've no doubt you've max'd out all those possibilities. But there's somewhere else you can go. Given Milky Way is only marginally less problematic than Rosetta it's reasonable to look at Boinc as a whole being the problem and there are settings within it that allow you to throttle Boinc before the strain on the CPU forces effectively the same thing. And I know this because I have 2 PCs - one which runs within thermal limits with Boinc at 100% and one that can't quite seem to manage it. Within Boinc, under Options/Computing Preferences, on the Computing tab, both the sections "When Computer is in use" and "When Compute is not in use" there's a setting "Use at most 100% of the CPUs and at most 100% of CPU time" Reduce the second 100% first to 90% and see how it affects your temps. Adjust further until you reach a CPU temp level you're comfortable with that's consistently below warning levels. On my problem PC, in the summer months, I'm having to use a figure as low as 70%, while at this time of year I can up it to nearer 90% but never quite unrestricted. On my 'good' PC I run at 100% year round. I've never got to the bottom of why the 2 PCs are so different, but the point is that both contribute as much as they can while running safely and reliably, which is a better solution than the overkill of abandoning Boinc projects altogether. What I hope this also does is allow you to run tasks from all Boinc projects more consistently and to completion before deadlines, while your non-Boinc activity has more capacity to run too, at temps that don't make you think your CPU is about to fry, so everyone wins. If you've tried this already, I'll expect you to respond accordingly and that'll be fine too. I can't expect anything less. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1743 Credit: 18,534,891 RAC: 3,788 |
Reduce the second 100% first to 90% and see how it affects your temps. Adjust further until you reach a CPU temp level you're comfortable with that's consistently below warning levels.The drawback with that method is thermal stress- you're maxing out the CPU, then stopping it, then maxing it out & then stopping it etc. Short, sharp heating & cooling cycles aren't good for electronics (really long cycles, no problem. Extremely short cycles, no problem). Reducing the number of cores/threads being used will result in lower CPU temperatures, and it won't result in longer processing times which will take the BONC Manager some time to adjust to (and when 21% or less of downloaded work is actually being processed and returned in time, taking even longer to process work is just going to make that result even worse). Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2166 Credit: 41,606,783 RAC: 4,182 |
Reduce the second 100% first to 90% and see how it affects your temps. Adjust further until you reach a CPU temp level you're comfortable with that's consistently below warning levels.The drawback with that method is thermal stress- you're maxing out the CPU, then stopping it, then maxing it out & then stopping it etc. Short, sharp heating & cooling cycles aren't good for electronics (really long cycles, no problem. Extremely short cycles, no problem). Reducing the number of cores/threads being used will result in lower CPU temperatures, and it won't result in longer processing times which will take the BONC Manager some time to adjust to (and when 21% or less of downloaded work is actually being processed and returned in time, taking even longer to process work is just going to make that result even worse). I actually agree with you 100% if not for (or but or except) this is my badly running PC (Intel i5-9600K), set to run between 85 and 90% of the time (in use/not in use) with a 12hr (43200sec) task runtime. I have no idea at all why it's running with CPU time so close to wall-clock time and pretty much exactly to 12hrs (so no Boinc time adjustment) and delivering so much credit per task. If I hadn't told you how I've got it set up, could you guess? I certainly couldn't. There are no telltale hints at all that I can see. Before I dialled down the Boinc settings, because it's unattended 4 days/wk, I'd visit it and find it'd crashed days previously. I'd had it thoroughly cleaned out, along with case and CPU AIO fans, with only a marginal improvement, so I had no choice but to dial it down as described. And this is the result - no crashes through heat overload while unattended, runs tasks faithfully at temps ~10% below warning levels. I can hardly complain. And if it works for me I offer the information to others who may have an issue to see if it works for them too. Can't do any harm to try. |
Message boards :
Number crunching :
Rosetta overheating my CPU
©2025 University of Washington
https://www.bakerlab.org