[OMPI users] ompio on Lustre

Discussion:

Dave Love

2018-10-05 10:15:19 UTC

Is romio preferred over ompio on Lustre for performance or correctness?
If it's relevant, the context is MPI-IO on Lustre mounts without flock,
which ompio doesn't seem to require.
Thanks.

Gabriel, Edgar

2018-10-05 13:56:24 UTC

Permalink

It was originally for performance reasons, but this should be fixed at this point. I am not aware of correctness problems.

However, let me try to clarify your question about: What do you precisely mean by "MPI I/O on Lustre mounts without flock"? Was the Lustre filesystem mounted without flock? If yes, that could lead to some problems, we had that on our Lustre installation for a while, but problems were even occurring without MPI I/O in that case (although I do not recall all details, just that we had to change the mount options). Maybe just take a testsuite (either ours or HDF5), make sure to run it in a multi-node configuration and see whether it works correctly.

Thanks
Edgar

-----Original Message-----
Love
Sent: Friday, October 5, 2018 5:15 AM
Subject: [OMPI users] ompio on Lustre
Is romio preferred over ompio on Lustre for performance or correctness?
If it's relevant, the context is MPI-IO on Lustre mounts without flock, which
ompio doesn't seem to require.
Thanks.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Dave Love

2018-10-05 14:42:38 UTC

Permalink

Post by Gabriel, Edgar
It was originally for performance reasons, but this should be fixed at
this point. I am not aware of correctness problems.
However, let me try to clarify your question about: What do you
precisely mean by "MPI I/O on Lustre mounts without flock"? Was the
Lustre filesystem mounted without flock?

No, it wasn't (and romio complains).

Post by Gabriel, Edgar
If yes, that could lead to
some problems, we had that on our Lustre installation for a while, but
problems were even occurring without MPI I/O in that case (although I
do not recall all details, just that we had to change the mount
options).

Yes, without at least localflock you might expect problems with things
like bdb and sqlite, but I couldn't see any file locking calls in the
Lustre component. If it is a problem, shouldn't the component fail like
without it like romio does?

I have suggested ephemeral PVFS^WOrangeFS but I doubt that will be
thought useful.

Post by Gabriel, Edgar
Maybe just take a testsuite (either ours or HDF5), make sure
to run it in a multi-node configuration and see whether it works
correctly.

For some reason I didn't think MTT, if that's what you mean, was
available, but I see it is; I'll see if I can drive it when I have a
chance. Tests from HDF5 might be easiest, thanks for the suggestion.
I'd tried with ANL's "testmpio", which was the only thing I found
immediately, but it threw up errors even on a local filesystem, at which
stage I thought it was best to ask... I'll report back if I get useful
results.

Dave Love

2018-10-08 15:20:24 UTC

Permalink

I said I'd report back about trying ompio on lustre mounted without flock.

I couldn't immediately figure out how to run MTT. I tried the parallel
hdf5 tests from the hdf5 1.10.3, but I got errors with that even with
the relevant environment variable to put the files on (local) /tmp.
Then it occurred to me rather late that romio would have tests. Using
the "runtests" script modified to use "--mca io ompio" in the romio/test
directory from ompi 3.1.2 on no-flock-mounted Lustre, after building the
tests with an installed ompi-3.1.2, it did this and apparently hung at
the end:

**** Testing simple.c ****
No Errors
**** Testing async.c ****
No Errors
**** Testing async-multiple.c ****
No Errors
**** Testing atomicity.c ****
Process 3: readbuf[118] is 0, should be 10
Process 2: readbuf[65] is 0, should be 10
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Process 1: readbuf[145] is 0, should be 10
**** Testing coll_test.c ****
No Errors
**** Testing excl.c ****
error opening file test
error opening file test
error opening file test

Then I ran on local /tmp as a sanity check and still got errors:

**** Testing I/O functions ****
**** Testing simple.c ****
No Errors
**** Testing async.c ****
No Errors
**** Testing async-multiple.c ****
No Errors
**** Testing atomicity.c ****
Process 2: readbuf[155] is 0, should be 10
Process 1: readbuf[128] is 0, should be 10
Process 3: readbuf[128] is 0, should be 10
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
**** Testing coll_test.c ****
No Errors
**** Testing excl.c ****
No Errors
**** Testing file_info.c ****
No Errors
**** Testing i_noncontig.c ****
No Errors
**** Testing noncontig.c ****
No Errors
**** Testing noncontig_coll.c ****
No Errors
**** Testing noncontig_coll2.c ****
No Errors
**** Testing aggregation1 ****
No Errors
**** Testing aggregation2 ****
No Errors
**** Testing hindexed ****
No Errors
**** Testing misc.c ****
file pointer posn = 265, should be 10

byte offset = 3020, should be 1080

file pointer posn = 265, should be 10

byte offset = 3020, should be 1080

file pointer posn = 265, should be 10

byte offset = 3020, should be 1080

file pointer posn in bytes = 3280, should be 1000

file pointer posn = 265, should be 10

byte offset = 3020, should be 1080

file pointer posn in bytes = 3280, should be 1000

file pointer posn in bytes = 3280, should be 1000

file pointer posn in bytes = 3280, should be 1000

Found 12 errors
**** Testing shared_fp.c ****
No Errors
**** Testing ordered_fp.c ****
No Errors
**** Testing split_coll.c ****
No Errors
**** Testing psimple.c ****
No Errors
**** Testing error.c ****
File set view did not return an error
Found 1 errors
**** Testing status.c ****
No Errors
**** Testing types_with_zeros ****
No Errors
**** Testing darray_read ****
No Errors

I even got an error with romio on /tmp (modifying the script to use
mpirun --mca io romi314):

**** Testing error.c ****
Unexpected error message MPI_ERR_ARG: invalid argument of some other kind
Found 1 errors

Gabriel, Edgar

2018-10-08 15:27:52 UTC

Permalink

Hm, thanks for the report, I will look into this. I did not run the romio tests, but the hdf5 tests are run regularly and with 3.1.2 you should not have any problems on a regular unix fs. How many processes did you use, and which tests did you run specifically? The main tests that I execute from their parallel testsuite are testphdf5 and t_shapesame.

I will also look into the testmpio that you mentioned in the next couple of days.
Thanks
Edgar

-----Original Message-----
Love
Sent: Monday, October 8, 2018 10:20 AM
Subject: Re: [OMPI users] ompio on Lustre
I said I'd report back about trying ompio on lustre mounted without flock.
I couldn't immediately figure out how to run MTT. I tried the parallel
hdf5 tests from the hdf5 1.10.3, but I got errors with that even with the
relevant environment variable to put the files on (local) /tmp.
Then it occurred to me rather late that romio would have tests. Using the
"runtests" script modified to use "--mca io ompio" in the romio/test directory
from ompi 3.1.2 on no-flock-mounted Lustre, after building the tests with an
**** Testing simple.c ****
No Errors
**** Testing async.c ****
No Errors
**** Testing async-multiple.c ****
No Errors
**** Testing atomicity.c ****
Process 3: readbuf[118] is 0, should be 10
Process 2: readbuf[65] is 0, should be 10
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Process 1: readbuf[145] is 0, should be 10
**** Testing coll_test.c ****
No Errors
**** Testing excl.c ****
error opening file test
error opening file test
error opening file test
**** Testing I/O functions ****
**** Testing simple.c ****
No Errors
**** Testing async.c ****
No Errors
**** Testing async-multiple.c ****
No Errors
**** Testing atomicity.c ****
Process 2: readbuf[155] is 0, should be 10
Process 1: readbuf[128] is 0, should be 10
Process 3: readbuf[128] is 0, should be 10
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
**** Testing coll_test.c ****
No Errors
**** Testing excl.c ****
No Errors
**** Testing file_info.c ****
No Errors
**** Testing i_noncontig.c ****
No Errors
**** Testing noncontig.c ****
No Errors
**** Testing noncontig_coll.c ****
No Errors
**** Testing noncontig_coll2.c ****
No Errors
**** Testing aggregation1 ****
No Errors
**** Testing aggregation2 ****
No Errors
**** Testing hindexed ****
No Errors
**** Testing misc.c ****
file pointer posn = 265, should be 10
byte offset = 3020, should be 1080
file pointer posn = 265, should be 10
byte offset = 3020, should be 1080
file pointer posn = 265, should be 10
byte offset = 3020, should be 1080
file pointer posn in bytes = 3280, should be 1000
file pointer posn = 265, should be 10
byte offset = 3020, should be 1080
file pointer posn in bytes = 3280, should be 1000
file pointer posn in bytes = 3280, should be 1000
file pointer posn in bytes = 3280, should be 1000
Found 12 errors
**** Testing shared_fp.c ****
No Errors
**** Testing ordered_fp.c ****
No Errors
**** Testing split_coll.c ****
No Errors
**** Testing psimple.c ****
No Errors
**** Testing error.c ****
File set view did not return an error
Found 1 errors
**** Testing status.c ****
No Errors
**** Testing types_with_zeros ****
No Errors
**** Testing darray_read ****
No Errors
I even got an error with romio on /tmp (modifying the script to use mpirun --
**** Testing error.c ****
Unexpected error message MPI_ERR_ARG: invalid argument of some other kind
Found 1 errors
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Dave Love

2018-10-09 12:05:19 UTC

Permalink

Post by Gabriel, Edgar
Hm, thanks for the report, I will look into this. I did not run the
romio tests, but the hdf5 tests are run regularly and with 3.1.2 you
should not have any problems on a regular unix fs. How many processes
did you use, and which tests did you run specifically? The main tests
that I execute from their parallel testsuite are testphdf5 and
t_shapesame.

Using OMPI 3.1.2, in the hdf5 testpar directory I ran this as a 24-core
SMP job (so 24 processes), where $TMPDIR is on ext4:

export HDF5_PARAPREFIX=$TMPDIR
make check RUNPARALLEL='mpirun'

It stopped after testphdf5 spewed "Atomicity Test Failed" errors.

Gabriel, Edgar

2018-10-09 14:23:46 UTC

Permalink

Ok, thanks. I usually run these test with 4 or 8, but the major item is that atomicity is one of the areas that are not well supported in ompio (along with data representations), so a failure in those tests is not entirely surprising . Most of the work to support atomicity properly is actually in place, but we didn't have the manpower (and requests to be honest) to finish that work.

Thanks
Edgar

-----Original Message-----
Sent: Tuesday, October 9, 2018 7:05 AM
Subject: Re: [OMPI users] ompio on Lustre

Using OMPI 3.1.2, in the hdf5 testpar directory I ran this as a 24-core SMP job
export HDF5_PARAPREFIX=$TMPDIR
make check RUNPARALLEL='mpirun'
It stopped after testphdf5 spewed "Atomicity Test Failed" errors.

Dave Love

2018-10-10 08:45:34 UTC

Permalink

Post by Gabriel, Edgar
Ok, thanks. I usually run these test with 4 or 8, but the major item
is that atomicity is one of the areas that are not well supported in
ompio (along with data representations), so a failure in those tests
is not entirely surprising .

If it's not expected to work, could it be made to return a helpful
error, rather than just not working properly?

Gabriel, Edgar

2018-10-10 13:26:21 UTC

Permalink

Well, good question. To be fair, the test passes if you run it with a lower number of processes. In addition, I had a couple of years back a discussion on that with one of the HDF5 developers, and it seemed to be ok to run it this way.

That being said, after thinking about it a bit, I think the fix to properly support it is at this point relatively easy, I will try to make it work in the next couple of days (there was a big chunk of code brought in for another fix last year in fall, and I think we have actually everything in place to properly support the atomicity operations).

Edgar

-----Original Message-----
Sent: Wednesday, October 10, 2018 3:46 AM
Subject: Re: [OMPI users] ompio on Lustre

If it's not expected to work, could it be made to return a helpful error, rather
than just not working properly?

Dave Love

2018-10-15 11:21:41 UTC

Permalink

For what it's worth, I found the following from running ROMIO's tests
with OMPIO on Lustre mounted without flock (or localflock). I used 48
processes on two nodes with Lustre for tests which don't require a
specific number.

OMPIO fails tests atomicity, misc, and error on ext4; it additionally
fails noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.

On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
shared_fp, ordered_fp, and error.

Please can OMPIO be changed to fail in the same way as ROMIO (with a
clear message) for the operations it can't support without flock.
Otherwise it looks as if you can potentially get invalid data, or at
least waste time debugging other errors.

I'd debug the common failure on the "error" test, but ptrace is disabled
on the system.

In case anyone else is in the same boat and can't get mounts changed, I
suggested staging data to and from a PVFS2^WOrangeFS ephemeral
filesystem on jobs' TMPDIR local mounts if they will fit. Of course
other libraries will potentially corrupt data on nolock mounts.

Latham, Robert J.

2018-10-15 14:45:13 UTC

Permalink

Post by Dave Love
For what it's worth, I found the following from running ROMIO's tests
with OMPIO on Lustre mounted without flock (or localflock). I used 48
processes on two nodes with Lustre for tests which don't require a
specific number.
OMPIO fails tests atomicity, misc, and error on ext4; it additionally
fails noncontig_coll2, fp, shared_fp, and ordered_fp on
Lustre/noflock.
On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
shared_fp, ordered_fp, and error.
Please can OMPIO be changed to fail in the same way as ROMIO (with a
clear message) for the operations it can't support without flock.
Otherwise it looks as if you can potentially get invalid data, or at
least waste time debugging other errors.
I'd debug the common failure on the "error" test, but ptrace is disabled
on the system.
In case anyone else is in the same boat and can't get mounts changed, I
suggested staging data to and from a PVFS2^WOrangeFS ephemeral
filesystem on jobs' TMPDIR local mounts if they will fit. Of course
other libraries will potentially corrupt data on nolock mounts.

ROMIO uses fcntl locks for Atomic mode, Shared file pointer updates,
and to prefent false sharing in the data sieving optimization for
noncontiguous writes.

it's hard to implement fcntl-lock-free versions of Atomic mode and
Shared file pointer so file systems like PVFS don't support those modes
(and return an error indicating such at open time).

You can run lock-free for noncontiguous writes, though at a significant
performance cost. In ROMIO we can disable data sieving write by
setting the hint "romio_ds_write" to "disable", which will fall back to
piece-wise operations. Could be OK if you know your noncontiguous
accesses are only a little bit noncontiguous.

Perhaps OMPIO has a similar option, but I am not familiar with its
tuning knobs.

==rob

Dave Love

2018-10-16 14:58:37 UTC

Permalink

Post by Latham, Robert J.
it's hard to implement fcntl-lock-free versions of Atomic mode and
Shared file pointer so file systems like PVFS don't support those modes
(and return an error indicating such at open time).

Ah. For some reason I thought PVFS had the support to pass the tests
somehow, but it's been quite a while since I used it.

Post by Latham, Robert J.
You can run lock-free for noncontiguous writes, though at a significant
performance cost. In ROMIO we can disable data sieving write by
setting the hint "romio_ds_write" to "disable", which will fall back to
piece-wise operations. Could be OK if you know your noncontiguous
accesses are only a little bit noncontiguous.

Does that mean it could actually support more operations (without
failing due to missing flock)?

Of course, I realize one should just use flock mounts with Lustre, as I
used to. I don't remember this stuff being written down explicitly
anywhere, though -- is it somewhere?

Thanks for the info.

Gabriel, Edgar

2018-10-15 22:04:45 UTC

Permalink

Dave,
Thank you for your detailed report and testing, that is indeed very helpful. We will definitely have to do something.
Here is what I think would be potentially doable.

a) if we detect a Lustre file system without flock support, we can printout an error message. Completely disabling MPI I/O is on the ompio architecture not possible at the moment, since the Lustre component can disqualify itself, but the generic Unix FS component would kick in in that case, and still continue execution. To be more precise, the query function of the Lustre component has no way to return anything than "I am interested to run" or "I am not interested to run"

b) I can add an MCA parameter that would allow the Lustre component to abort execution of the job entirely. While this parameter would probably be by default set to 'false', a system administrator could configure it to be set to 'true' an particular platform.

I will discuss this also with a couple of other people in the next couple of days.
Thanks
Edgar

-----Original Message-----
Love
Sent: Monday, October 15, 2018 4:22 AM
Subject: Re: [OMPI users] ompio on Lustre
For what it's worth, I found the following from running ROMIO's tests with
OMPIO on Lustre mounted without flock (or localflock). I used 48 processes
on two nodes with Lustre for tests which don't require a specific number.
OMPIO fails tests atomicity, misc, and error on ext4; it additionally fails
noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.
On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
shared_fp, ordered_fp, and error.
Please can OMPIO be changed to fail in the same way as ROMIO (with a clear
message) for the operations it can't support without flock.
Otherwise it looks as if you can potentially get invalid data, or at least waste
time debugging other errors.
I'd debug the common failure on the "error" test, but ptrace is disabled on the
system.
In case anyone else is in the same boat and can't get mounts changed, I
suggested staging data to and from a PVFS2^WOrangeFS ephemeral
filesystem on jobs' TMPDIR local mounts if they will fit. Of course other
libraries will potentially corrupt data on nolock mounts.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Dave Love

2018-10-16 15:13:01 UTC

Permalink

Post by Gabriel, Edgar
a) if we detect a Lustre file system without flock support, we can
printout an error message. Completely disabling MPI I/O is on the
ompio architecture not possible at the moment, since the Lustre
component can disqualify itself, but the generic Unix FS component
would kick in in that case, and still continue execution. To be more
precise, the query function of the Lustre component has no way to
return anything than "I am interested to run" or "I am not interested
to run"
b) I can add an MCA parameter that would allow the Lustre component to
abort execution of the job entirely. While this parameter would
probably be by default set to 'false', a system administrator could
configure it to be set to 'true' an particular platform.

Assuming the operations which didn't fail for me are actually OK with
noflock (and maybe they're not in other circumstances), can't you just
do the same as ROMIO and fail with an explanation on just the ones that
will fail without flock? That seems the best from a user's point of
view if there's an advantage to using OMPIO rather than ROMIO.

I guess it might be clear which operations are problematic if I
understood what in fs/lustre requires flock mounts and what the full
semantics of the option are, which seem to be more than documented.

Thanks for looking into it, anyhow.