Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 279 · 280 · 281 · 282 · 283 · 284 · 285 . . . 315 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109342 - Posted: 5 Jun 2024, 10:56:08 UTC - in response to Message 109341.  

They probably rebooted it.
It'd be nice if they fixed whatever it was that keeps causing it to die so they don't need to keep rebooting it.
Grant
Darwin NT
ID: 109342 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109350 - Posted: 7 Jun 2024, 9:22:24 UTC - in response to Message 109342.  

They probably rebooted it.
It'd be nice if they fixed whatever it was that keeps causing it to die so they don't need to keep rebooting it.

It is very odd - it never used to happen.
Anyway, glad it got sorted before too long and they didn't need a nudge this time seeing as I'm 2 days late in finding out
ID: 109350 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109363 - Posted: 11 Jun 2024, 7:49:15 UTC

New work at Ralph, with new errors.
So some work has been done, but looks like there's still quite a way to go.

RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_d_pred_188_16900_2_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv1rf2aapredict.py", line 733, in <module>
    with zipfile.ZipFile(args.z) as z:
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libzipfile.py", line 1268, in __init__
    self._RealGetContents()
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libzipfile.py", line 1335, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

</stderr_txt>
]]>



RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_d_pred_60_16900_5_1

<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
The access code is invalid.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
'C:ProgramDataBOINC/projects/ralph.bakerlab.orgev0Scriptsactivate.bat' is not recognized as an internal or external command,
operable program or batch file.

</stderr_txt>
]]>

Grant
Darwin NT
ID: 109363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109364 - Posted: 11 Jun 2024, 14:51:39 UTC

Total queued jobs on the front page down to 222k
Advance warning we may be out of new tasks in the next 24hrs unless we get lucky again.
Fingers crossed.
ID: 109364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109365 - Posted: 12 Jun 2024, 7:36:48 UTC
Last modified: 12 Jun 2024, 7:39:06 UTC

Now out of work new.
Also, although the Server status shows all green, there is a backlog of Tasks waiting on Validation.
3,078 at the moment.
Grant
Darwin NT
ID: 109365 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109366 - Posted: 12 Jun 2024, 9:55:29 UTC - in response to Message 109365.  

Also, although the Server status shows all green, there is a backlog of Tasks waiting on Validation.
3,078 at the moment.
Whatever was going on before, the backlog has now cleared.
Grant
Darwin NT
ID: 109366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109367 - Posted: 12 Jun 2024, 11:43:19 UTC - in response to Message 109365.  

Now out of work new

This has been the best run we've had for a couple of years - bound to end at some point once everyone's offline cache runs down.
It's at this point my 12hr runtime setting ekes out my remaining work as far as possible.

What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs,
As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected".
This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too.

This should be considered a high priority for everyone imo.
ID: 109367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile RDTSC

Send message
Joined: 29 Jan 24
Posts: 4
Credit: 1,096,106
RAC: 6,127
Message 109368 - Posted: 12 Jun 2024, 12:16:14 UTC

https://boinc.bakerlab.org/rosetta/ Their home page could do with some updates; last post almost two years ago. I get it, web hosting and administration is expensive, along with preparing, running, and maintaining massive job servers. It just seems to me that a little grease, at the right points of this machine, would greatly help it function.
ID: 109368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 276
Credit: 513,050
RAC: 161
Message 109369 - Posted: 12 Jun 2024, 12:19:21 UTC

Hal jobs run for three hours because subtasks are short and produce many results per task.

Other jobs run for 8 hours.
ID: 109369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109370 - Posted: 12 Jun 2024, 12:45:17 UTC - in response to Message 109369.  

Hal jobs run for three hours because subtasks are short and produce many results per task.

Other jobs run for 8 hours.

No. All mine run for 12hrs because I set them to run for 12hrs.

They don't hit a top limit of decoys and end because some internal limit has been reached.

Rosetta Beta 6.04 tasks wrongly default to 3hrs CPU runtime while Rosetta v4.20 rightly default to 8hrs.

So set the Rosetta@home Target CPU Runtime explicitly to 8hrs so that CPU runtime matches what Boinc is told to assume, and not to 'not selected'.

Do more work, get more credits, Boinc schedules more correctly and sooner, batches of tasks issued by Rosetta last longer. Rosetta tasks run out less often. <Everyone> wins.

The alternative is what we have now - no new tasks. Everyone loses.
ID: 109370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 276
Credit: 513,050
RAC: 161
Message 109371 - Posted: 12 Jun 2024, 12:48:37 UTC

tasks starting with RosettaVS run for 8 hours for me.
ID: 109371 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109374 - Posted: 12 Jun 2024, 19:05:03 UTC - in response to Message 109371.  

Tasks starting with RosettaVS run for 8 hours for me.

Great, but I don't say this for the ones that run as expected, but for all those that don't, of which there seem to be many.
Also, I don't recall seeing any RosettaVS tasks. I don't know how they behave.
ID: 109374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 411
Credit: 12,359,416
RAC: 3,742
Message 109375 - Posted: 13 Jun 2024, 6:51:24 UTC - in response to Message 109367.  

Now out of work new

This has been the best run we've had for a couple of years - bound to end at some point once everyone's offline cache runs down.
It's at this point my 12hr runtime setting ekes out my remaining work as far as possible.

What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs,
As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected".
This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too.

This should be considered a high priority for everyone imo.


I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do.
ID: 109375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1751
Credit: 18,534,891
RAC: 857
Message 109376 - Posted: 13 Jun 2024, 7:51:00 UTC

New batch of work over at Ralph, with new errors.

RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_148_16902_5_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 8, in <module>
    import torch
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorch__init__.py", line 124, in <module>
    raise err
OSError: [WinError 1455] Il file di paging &#232; troppo piccolo per essere completato. Error loading "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchlibcaffe2_detectron_ops_gpu.dll" or one of its dependencies.

</stderr_txt>
]]>




RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_e_pred_195_16901_6_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 698, in <module>
    b.write(base64.b64decode(f.read()))
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libbase64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Invalid base64-encoded string: number of data characters (65) cannot be 1 more than a multiple of 4

</stderr_txt>
]]>




RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_119_16902_6_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 708, in <module>
    pred.predict(out_name+f'_{n}', 
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 551, in predict
    logit_s, logit_aa_s, logit_pae, logit_pde, p_bind, pred_crds, alpha, pred_allatom, pred_lddt_binned,                msa_prev, pair_prev, state_prev = self.model(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaRoseTTAFoldModel.py", line 358, in forward
    msa, pair, xyz, alpha_s, xyz_allatom, state, symmsub = self.simulator(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 1106, in forward
    msa, pair, xyz, state, alpha, symmsub = self.main_block[i_m](msa, pair,
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 929, in forward
    xyz, state, alpha = self.str2str(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcudaampautocast_mode.py", line 141, in decorate_autocast
    return func(*args, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 503, in forward
    shift = self.se3(G, node.reshape(B*L, -1, 1), l1_feats, edge_feats)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaSE3_network.py", line 96, in forward
    return self.se3(G, node_features, edge_features)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 185, in forward
    node_feats = self.graph_modules(node_feats, edge_feats, graph=graph, basis=basis)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 47, in forward
    input = module(input, *args, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersattention.py", line 162, in forward
    fused_key_value = self.to_key_value(node_features, edge_features, graph, basis)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 347, in forward
    out += self.conv_in[str(degree_in)](feature, invariant_edge_feats,
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 186, in forward
    radial_weights = self.radial_func(invariant_edge_feats[e_i:e_j]) 
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 118, in forward
    return self.net(features)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulescontainer.py", line 139, in forward
    input = module(input)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmoduleslinear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnfunctional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: [enforce fail at ..c10coreCPUAllocator.cpp:79] data. DefaultCPUAllocator: not enough memory: you tried to allocate 536870912 bytes.

</stderr_txt>]]>

Grant
Darwin NT
ID: 109376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 276
Credit: 513,050
RAC: 161
Message 109377 - Posted: 13 Jun 2024, 12:29:17 UTC
Last modified: 13 Jun 2024, 13:06:40 UTC

Did they port rosetta python projects to native windows?
Try to increase pagefile size.
It helped with gpugrid python project.
It even uses gpu.
ID: 109377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109379 - Posted: 13 Jun 2024, 23:42:54 UTC - in response to Message 109375.  

What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs,
As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected".
This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too.

This should be considered a high priority for everyone imo.

I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do.

While generally true, it's clear imo this 3hr target runtime is an error as it's inconsistent with what Rosetta tells Boinc.
It only ever slips through when a new version of the app comes out.
Istr it happened once before and was corrected in the days when the admins paid more attention to us.
If the 8hr default ever changes I think something would be said - and seeing as no-one's saying anything these days I doubt it ever will change without a very specific reason.
ID: 109379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109380 - Posted: 14 Jun 2024, 3:20:41 UTC

Ooh, 360k tasks. We live to fight another day (or two)
ID: 109380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2025
Credit: 9,943,884
RAC: 6,777
Message 109383 - Posted: 15 Jun 2024, 6:48:29 UTC
Last modified: 15 Jun 2024, 6:48:45 UTC

Today a lot of "classical" error

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
08:16:19 (5164): called boinc_finish(1)

ID: 109383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109385 - Posted: 15 Jun 2024, 9:32:32 UTC - in response to Message 109383.  

Today a lot of "classical" error

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
08:16:19 (5164): called boinc_finish(1)

Yes, but very quickly, so I'm not too worried by them

More concerning are two Validate errors after running to completion
hal_8a_i_hal_8aa_2jp5597_d99_0001_SAVE_ALL_OUT_2978378_13_0
hal_8a_i_hal_8aa_2jp1316_d224_0001_SAVE_ALL_OUT_2978378_13_0
ID: 109385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2185
Credit: 41,726,991
RAC: 6,784
Message 109387 - Posted: 17 Jun 2024, 20:29:44 UTC - in response to Message 109380.  

Ooh, 360k tasks. We live to fight another day (or two)

Turned into 3+ days, but we're out again.
ID: 109387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 279 · 280 · 281 · 282 · 283 · 284 · 285 . . . 315 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org