Message boards : Number crunching : BOINC unsure how many CPUs to use
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I seem to have a problem on a dual-cpu box, which has only appeared since upgrading BOINC 5.4.11 -> 5.8.11. BOINC can't decide if the box has one or two cpus. When I look at the status in BoincView sometimes one task is running, and sometimes two. I am attached to three projects, all of which have this host set to loacation Work. In my general prefs, Default and Home both have max 1 cpu. Work and School did both have 2 cpus max. I have changed both these to 4cpus in the hope that the change would force a reload of whatever dodgy value is causing the problem, but the problem has recurred. An additional point that makes trouble shooting harer is that 5.8.11 seems not to log when a taks is suspended, only when it is started, restarted, or resumed. This makes it hard to tell from the log exactly when the secind cpu is being stopped. With all versions up to 5.4.11 when the client thought the number of usable cpus had changed, it ran the benchmarks which had the side-effect of making the change obvious in the log. This is not happening here. Here is the log for the box, line starting ### have been inserted by me. 21/02/2007 17:17:54||Starting BOINC client version 5.8.11 for windows_intelx86 21/02/2007 17:17:54||log flags: task, file_xfer, sched_ops 21/02/2007 17:17:54||Libraries: libcurl/7.16.0 OpenSSL/0.9.8a zlib/1.2.3 21/02/2007 17:17:54||Executing as a daemon 21/02/2007 17:17:54||Data directory: C:Program FilesBOINC 21/02/2007 17:17:54||BOINC is running as a service and as a non-system user. 21/02/2007 17:17:54||No application graphics will be available. 21/02/2007 17:17:54||Processor: 2 GenuineIntel x86 Family 6 Model 8 Stepping 3 665MHz [x86 Family 6 Model 8 Stepping 3] [fpu tsc sse mmx] 21/02/2007 17:17:54||Memory: 255.48 MB physical, 617.92 MB virtual 21/02/2007 17:17:54||Disk: 4.34 GB total, 794.00 MB free 21/02/2007 17:17:54|rosetta@home|URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 304101; location: work; project prefs: work 21/02/2007 17:17:54|Leiden Classical|URL: http://boinc.gorlaeus.net/; Computer ID: 8747; location: work; project prefs: default 21/02/2007 17:17:54|lhcathome|URL: http://lhcathome.cern.ch/; Computer ID: 2138390; location: work; project prefs: default 21/02/2007 17:17:54||General prefs: from rosetta@home (last modified 2007-02-21 09:22:18) 21/02/2007 17:17:54||Host location: work 21/02/2007 17:17:54||General prefs: using separate prefs for work 21/02/2007 17:17:55|rosetta@home|Restarting task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom018__1568_10058_0 using rosetta version 546 ### note only one task started, yet there is another waiting to start 21/02/2007 17:18:52|lhcathome|Sending scheduler request: To fetch work 21/02/2007 17:18:52|lhcathome|Requesting 11887 seconds of new work 21/02/2007 17:18:57|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 17:18:57|lhcathome|Deferring communication for 7 sec 21/02/2007 17:18:57|lhcathome|Reason: requested by project 21/02/2007 17:18:57|lhcathome|Deferring communication for 1 min 0 sec 21/02/2007 17:18:57|lhcathome|Reason: no work from project 21/02/2007 17:20:00|lhcathome|Sending scheduler request: To fetch work 21/02/2007 17:20:00|lhcathome|Requesting 11888 seconds of new work 21/02/2007 17:20:05|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 17:20:05|lhcathome|Deferring communication for 7 sec 21/02/2007 17:20:05|lhcathome|Reason: requested by project 21/02/2007 17:20:05|lhcathome|Deferring communication for 1 min 0 sec 21/02/2007 17:20:05|lhcathome|Reason: no work from project 21/02/2007 17:21:11|lhcathome|Sending scheduler request: To fetch work 21/02/2007 17:21:11|lhcathome|Requesting 11888 seconds of new work 21/02/2007 17:21:16|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 17:21:16|lhcathome|Deferring communication for 7 sec 21/02/2007 17:21:16|lhcathome|Reason: requested by project 21/02/2007 17:21:16|lhcathome|Deferring communication for 2 min 27 sec 21/02/2007 17:21:16|lhcathome|Reason: no work from project 21/02/2007 17:21:22|rosetta@home|Sending scheduler request: Requested by user 21/02/2007 17:21:22|rosetta@home|(not requesting new work or reporting completed tasks) 21/02/2007 17:21:26|rosetta@home|Scheduler RPC succeeded [server version 509] 21/02/2007 17:21:26||General prefs: from rosetta@home (last modified 2007-02-21 17:21:05) 21/02/2007 17:21:26||Host location: work 21/02/2007 17:21:26||General prefs: using separate prefs for work 21/02/2007 17:21:26|rosetta@home|Deferring communication for 4 min 2 sec 21/02/2007 17:21:26|rosetta@home|Reason: requested by project 21/02/2007 17:23:48|lhcathome|Sending scheduler request: To fetch work 21/02/2007 17:23:48|lhcathome|Requesting 11890 seconds of new work 21/02/2007 17:23:53|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 17:23:53|lhcathome|Deferring communication for 7 sec 21/02/2007 17:23:53|lhcathome|Reason: requested by project 21/02/2007 17:23:53|lhcathome|Deferring communication for 2 min 58 sec 21/02/2007 17:23:53|lhcathome|Reason: no work from project 21/02/2007 17:26:56|lhcathome|Sending scheduler request: To fetch work 21/02/2007 17:26:56|lhcathome|Requesting 11891 seconds of new work 21/02/2007 17:27:01|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 17:27:01|lhcathome|Deferring communication for 7 sec 21/02/2007 17:27:01|lhcathome|Reason: requested by project 21/02/2007 17:27:01|lhcathome|Deferring communication for 10 min 52 sec 21/02/2007 17:27:01|lhcathome|Reason: no work from project 21/02/2007 17:27:35|rosetta@home|Restarting task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom010__1568_10137_0 using rosetta version 546 ### suddenly for no obvious reason, it now decides to start the other task 21/02/2007 17:37:56|lhcathome|Sending scheduler request: To fetch work 21/02/2007 17:37:56|lhcathome|Requesting 11896 seconds of new work 21/02/2007 17:38:00|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 17:38:00|lhcathome|Deferring communication for 7 sec 21/02/2007 17:38:00|lhcathome|Reason: requested by project 21/02/2007 17:38:00|lhcathome|Deferring communication for 25 min 12 sec 21/02/2007 17:38:00|lhcathome|Reason: no work from project 21/02/2007 17:51:18|rosetta@home|Resuming task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom010__1568_10137_0 using rosetta version 546 ### and here it is resuming the second task again without any sign of having suspended it 21/02/2007 18:03:16|lhcathome|Sending scheduler request: To fetch work 21/02/2007 18:03:16|lhcathome|Requesting 11906 seconds of new work 21/02/2007 18:03:21|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 18:03:21|lhcathome|Deferring communication for 7 sec 21/02/2007 18:03:21|lhcathome|Reason: requested by project 21/02/2007 18:03:21|lhcathome|Deferring communication for 14 min 52 sec 21/02/2007 18:03:21|lhcathome|Reason: no work from project 21/02/2007 18:04:09|rosetta@home|Sending scheduler request: Requested by user 21/02/2007 18:04:09|rosetta@home|(not requesting new work or reporting completed tasks) 21/02/2007 18:04:13|rosetta@home|Scheduler RPC succeeded [server version 509] 21/02/2007 18:04:13|rosetta@home|Deferring communication for 4 min 2 sec 21/02/2007 18:04:13|rosetta@home|Reason: requested by project 21/02/2007 18:18:21|lhcathome|Sending scheduler request: To fetch work 21/02/2007 18:18:21|lhcathome|Requesting 11909 seconds of new work 21/02/2007 18:18:31|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 18:18:31|lhcathome|Deferring communication for 7 sec 21/02/2007 18:18:31|lhcathome|Reason: requested by project 21/02/2007 18:18:31|lhcathome|Deferring communication for 4 min 10 sec 21/02/2007 18:18:31|lhcathome|Reason: no work from project 21/02/2007 18:22:45|lhcathome|Sending scheduler request: To fetch work 21/02/2007 18:22:45|lhcathome|Requesting 11912 seconds of new work 21/02/2007 18:22:56|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 18:22:56|lhcathome|Deferring communication for 7 sec 21/02/2007 18:22:56|lhcathome|Reason: requested by project 21/02/2007 18:22:56|lhcathome|Deferring communication for 1 min 0 sec 21/02/2007 18:22:56|lhcathome|Reason: no work from project 21/02/2007 18:23:59|lhcathome|Fetching scheduler list 21/02/2007 18:24:04|lhcathome|Master file download succeeded 21/02/2007 18:24:10|lhcathome|Sending scheduler request: To fetch work 21/02/2007 18:24:10|lhcathome|Requesting 11912 seconds of new work 21/02/2007 18:24:15|lhcathome|Scheduler RPC succeeded [server version 502] 21/02/2007 18:24:15|lhcathome|Deferring communication for 7 sec 21/02/2007 18:24:15|lhcathome|Reason: requested by project 21/02/2007 18:24:15|lhcathome|Deferring communication for 1 min 0 sec 21/02/2007 18:24:15|lhcathome|Reason: no work from project 21/02/2007 18:24:48|rosetta@home|Resuming task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom010__1568_10137_0 using rosetta version 546 ### and again much more of the same follows. From the progress I'd estimate that the second task has run for less than 20min in over two hours, whereas with two tasks and two cpus there should be full opportunity to run both tasks. Is the an artefact of version 5.8.11 or someting else? River~~ |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Look at the status, after download units report "ready to run", if they are running they report "running", If they have been paused/preempted, they read "waiting to run". With the new memory settings in your "general prefs", if it decides you don't have enough memory, it used to (a couple alpha versions ago) read "waiting for memory", and it will eventually change the status to "waiting to run" if not memory wasn't available right away. The default setting are 50% while in use and 90% while not in use. If you haven't changed them, you might want to. Also look at the boinc manager status to see if that's what you're seeing. I had quite an issue with this prior to Rosetta updating their server software, and when I was using linux |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Look at the status, after download units report "ready to run", if they are running they report "running", If they have been paused/preempted, they read "waiting to run". Thanks Astro, spot on. And thanks too for a very quick reply. The 'Waiting for memory' message is there if I look at the state from the new shiny BOINCmgr, but BOINCview simply says 'Paused'. So chalk one advantage up to the new BM over BV. R~~ |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
It had my head scratching there for a while when I first saw that funky behavior, and even made a fool of myself trying to convince Dr. Anderson that there was a problem. LOL It seems Rosetta is one that either uses or reserves lots of memory, I'm still not positive how that works. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
It had my head scratching there for a while when I first saw that funky behavior, and even made a fool of myself trying to convince Dr. Anderson that there was a problem. LOL Rosetta uses a lot of memory and increasingly so as it works through. Eventual it may just stop, I am unsure if/how/what/when fix or happens whn this occurs, but I do remember it comming up in alpha (Mikus i think). Since you only have 256MB you will se this 'stalling' more often and little chance of getting 2xrosetta's to run at the same time. Team mauisun.org |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Rosetta uses a lot of memory and increasingly so as it works through. That's interesting. As it works through a single decoy, or as it works through a series of decoys in a long run? Since you only have 256MB you will se this 'stalling' more often and little chance of getting 2xrosetta's to run at the same time. Well, I admit running two in 256 is a bit cheeky when the spec is for 256 anyway. Tho I could be even more cheeky and point out that the spec does not ask for any more for multi-cpus... But in fact two are running happily together now I've given them 99% of the memory to share between them. The server had blanks in for these figures, so presumably they defaulted to something plausible. R~~ |
BobCat13 Send message Joined: 18 Jun 06 Posts: 4 Credit: 130,387 RAC: 0 |
An additional point that makes trouble shooting harer is that 5.8.11 seems not to log when a taks is suspended, only when it is started, restarted, or resumed. This makes it hard to tell from the log exactly when the secind cpu is being stopped. You need to set a flag in cc_config.xml to see those messages on 5.8.x If you already have a cc_config.xml file, check for the following: <cpu_sched>1</cpu_sched> If you don't have a cc_config.xml file, create a blank text file, then add the following: <cc_config> <log_flags> <cpu_sched>1</cpu_sched> </log_flags> </cc_config> and save it as cc_config.xml You can find the flags and options available for cc_config.xml here. |
BobCat13 Send message Joined: 18 Jun 06 Posts: 4 Credit: 130,387 RAC: 0 |
The 'Waiting for memory' message is there if I look at the state from the new shiny BOINCmgr, but BOINCview simply says 'Paused'. So chalk one advantage up to the new BM over BV. Which version of BV are you using? The 1.4.1 and 1.4.2 beta versions have the "Waiting for memory" message listed, but I have never had BOINC pause for that reason so I can't be sure it is displayed. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
The 'Waiting for memory' message is there if I look at the state from the new shiny BOINCmgr, but BOINCview simply says 'Paused'. So chalk one advantage up to the new BM over BV. Good spot. Still running BV 1.2.2 :-( And thanks for the config file info. It still seems odd to me to have a default setting that shows things resuming but not pausing, to my mind it would seem more logical to have both or neither. But at least if it is settable then I can tweak it up to my liking. btw I like your handle which you share with my downstairs neighbour who runs a home publishing business called BobCat press. Named after his cat, Bob, of course... R~~ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Since you only have 256MB you will se this 'stalling' more often and little chance of getting 2xrosetta's to run at the same time. Been looking at this FC. Where I win is by not having a GUI. Not windoze, not KDE, not Gnome, and not BoincMgr. Not even BV on the same machine. So here are my meminfo figures at a point where one Rosetta has been running 12hrs and is 4000 sec into the current decoy, and the other Rosetta has been running 600 sec on its first decoy: ric-gw-live:~# cat /proc/meminfo MemTotal: 256268 kB MemFree: 15204 kB Buffers: 17364 kB Cached: 51656 kB SwapCached: 0 kB Active: 184028 kB Inactive: 32752 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 256268 kB LowFree: 15204 kB SwapTotal: 682720 kB SwapFree: 682720 kB Dirty: 100 kB Writeback: 0 kB Mapped: 159144 kB Slab: 19728 kB Committed_AS: 298648 kB PageTables: 824 kB VmallocTotal: 770040 kB VmallocUsed: 3064 kB VmallocChunk: 766804 kB As you can see, both Rosettas are fitting well into real memory. The problem was that before Astro's tip they were trying to both fit into half the memory, as BOINC seems clever enough to spot that this machine is also doing other stuff, like serving internal web pages and so on. And, of course I'd have noticed if there was a genuine memeory problem as the machine's main mission would have suffered and max cpus would have been set back to 1 (or even BOINC removed). The new memory limits will prove useful in future, when Rosetta does get big enough to cause this kind of issue, but on a Linux command-line only box there is a long way to go yet. Edit, added: I won't bore you with anonther meminfo listing, but my experiments indicate than running a second Rosetta adds between 80Mb - 100Mb memory usage. For example, top shows the two Rosettas each with around 30% to 35% of memory. The BOINC client (v5.8.11) weighs in at 1.3% of 256M = under 4Mb so the client is not a significant memory issue. One area where the footprint of the second task is smaller than you'd expect is that all the shared library code is only loaded into real memry once, even if both tasks are using it and even if they both have it at different virtual addresses (the magic of the VM mapping - all credit to Intel for that, for their 386 memory design) Perhaps there is a case for the System Requirements page to show a smaller figure for the memory usage needed on a command line only machine, and larger requirements for people hoping to run multi cpus. So for a GUI operating system (Win, Mac, KDE, Gnome) you need (I am suggesting) around 150Mb overhead plus 100Mb per cpu running Rosetta, which for one cpu is consisten with the advice to have 256MB installed. On a linux command-line only box something like 50Mb overhead plus 100Mb per cpu running Rosetta. So the other way of looking at it is that by throwing out KDE/Windows I get to run a whole extra Rosetta. Seems like a good tradeoff on a box that doesn't even have its own monitor... R~~ |
Message boards :
Number crunching :
BOINC unsure how many CPUs to use
©2025 University of Washington
https://www.bakerlab.org