Discussion:
[OMPI users] MPI I/O gives undefined behavior if the amount of bytes described by a filetype reaches 2^32
Nils Moschuering
2017-04-28 10:51:50 UTC
Permalink
Dear OpenMPI Mailing List,

I have a problem with MPI I/O running on more than 1 rank using very
large filetypes. In order to reproduce the problem please take advantage
of the attached program "mpi_io_test.c". After compilation it should be
run on 2 nodes.

The program will do the following for a variety of different parameters:
1. Create an elementary datatype (commonly refered to as etype in the
MPI Standard) of a specific size given by the parameter bsize (in
multiple of bytes). This datatype is called blk_filetype.
2. Create a complex filetype, which is different for each rank. This
filetype divides the file into a number of blocks given by parameter
nr_blocks of size bsize. Each rank only gets access to a subarray containing
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The respective
subarray of each rank starts at
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype.
3. Allocate enough memory on each rank, in order to be able to write a
whole block.
4. Fill the allocated memory with the rank number to be able to check
the resulting file for correctness.
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype.
6. Write one block on each rank, using the collective routine.
7. Clean up.

The above will be repeated for different values of bsize and nr_blocks.
Please note, that there is no overflow of the used basic dataype int.
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive
equal lines in a file into one output line. The resulting output of a
call to hexdump is given by a structure comparable to the following
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
|................|
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
|................|
*
5dc00000
This example is to be read in the following manner:
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file
contains the value 01 in each byte.
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file
contains the value 00 in each byte.
-From byte 3e800000 to 5dc00000(which is equal to 1500 Mib) the file
contains the value 02 in each byte.
-The file ends here.
This is the correct output of the above outlined program with parameters
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for
different cases. These were made to pinpoint the source of the problem
and to exclude different other, potentially important, factors.
I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes.

The tests and their results are defined in comments in the source.
The final conclusions, I derive from the tests, are the following:

1. If the filetype used in the view is set in a way that it describes an
amount of bytes equaling or exceeding 2^32 = 4Gib the code produces
wrong output. For values slightly smaller (the second example with
fname="test_8_blocks" uses a total filetype size of 4000 MiB which is
smaller than 4Gib) the code works as expected.
2. The act of actually writing the described regions is not important.
When the filetype describes an area >= 4Gib but only writes to regions
much smaller than that, the code still produces undefined behavior
(please refer to the 6th example with fname="test_too_large_blocks" in
order to see an example).
3. It doesn't matter if the block size or the amount of blocks pushes
the filetype over the 4 Gib (refer to the 5th and 6th example, with
filenames "test_16_blocks" and "test_too_large_blocks" respectively).
4. If the binary is launched using only one rank, the output is always
as expected (refer to the 3rd and 4th example, with filenames
"test_too_large_blocks_single" and
"test_too_large_blocks_single_even_larger", respectively).

There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to compute
the byte composition inside the filetype. Since the filetype is defined
using two 32bit integer variables, this can easily lead to integer
overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.

I looked at ILP64 <https://software.intel.com/en-us/node/528914>
Support, but it only adapts the function parameters and not the
internally used variables and it is also not available for C.

I looked at integer overflow
<https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals>
(FPE_INTOVF_TRAP) trapping, which could help to verify the source of the
problem, but it doesn't seem to be possible for C. Intel does not
<https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156>
offer any built-in integer overflow trapping.

There are ways to circumvent this problem for most cases. It is only
unavoidable if the logic of a program contains complex, non-repeating
data structures with sizes of over (or equal) 4GiB. Even then, one could
split up the filetype and use a different displacement in two distinct
write calls.

Still, this problem violates the standard as it produces undefined
behavior even when using the API in a consistent way. The implementation
should at least provide a warning for the user, but should ideally use
larger datatypes in the filetype computations. When a user stumbles on
this problem, he will have a hard time to debug it.

Thank you very much for reading everything ;)

Kind Regards,

Nils
Christoph Niethammer
2017-04-28 11:14:35 UTC
Permalink
Hello,

Which MPI Version are you using?
This looks for me like it triggers https://github.com/open-mpi/ompi/issues/2399

You can check if you are running into this problem by playing around with the mca_io_ompio_cycle_buffer_size parameter.

Best
Christoph Niethammer

--

Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart

Tel: ++49(0)711-685-87203
email: ***@hlrs.de
http://www.hlrs.de/people/niethammer



----- Original Message -----
From: "Nils Moschuering" <***@ipp.mpg.de>
To: "Open MPI Users" <***@lists.open-mpi.org>
Sent: Friday, April 28, 2017 12:51:50 PM
Subject: [OMPI users] MPI I/O gives undefined behavior if the amount of bytes described by a filetype reaches 2^32

Dear OpenMPI Mailing List,

I have a problem with MPI I/O running on more than 1 rank using very large filetypes. In order to reproduce the problem please take advantage of the attached program "mpi_io_test.c". After compilation it should be run on 2 nodes.

The program will do the following for a variety of different parameters:
1. Create an elementary datatype (commonly refered to as etype in the MPI Standard) of a specific size given by the parameter bsize (in multiple of bytes). This datatype is called blk_filetype .
2. Create a complex filetype, which is different for each rank. This filetype divides the file into a number of blocks given by parameter nr_blocks of size bsize . Each rank only gets access to a subarray containing
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The respective subarray of each rank starts at
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype .
3. Allocate enough memory on each rank, in order to be able to write a whole block.
4. Fill the allocated memory with the rank number to be able to check the resulting file for correctness.
5. Open a file named fname and set the view using the previously generated blk_filetype and full_filetype .
6. Write one block on each rank, using the collective routine.
7. Clean up.

The above will be repeated for different values of bsize and nr_blocks . Please note, that there is no overflow of the used basic dataype int .
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive equal lines in a file into one output line. The resulting output of a call to hexdump is given by a structure comparable to the following
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |................|
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |................|
*
5dc00000
This example is to be read in the following manner:
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file contains the value 01 in each byte.
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file contains the value 00 in each byte.
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the file contains the value 02 in each byte.
-The file ends here.
This is the correct output of the above outlined program with parameters
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for different cases. These were made to pinpoint the source of the problem and to exclude different other, potentially important, factors.
I deem an output wrong if it doesn't follow from the parameters or if the program crashes on execution.
The only difference between OpenMPI and Intel MPI, according to my tests, is in the different behavior on error: OpenMPI will mostly write wrong data but won't crash, whereas Intel MPI mostly crashes.

The tests and their results are defined in comments in the source.
The final conclusions, I derive from the tests, are the following:

1. If the filetype used in the view is set in a way that it describes an amount of bytes equaling or exceeding 2^32 = 4Gib the code produces wrong output. For values slightly smaller (the second example with fname="test_8_blocks" uses a total filetype size of 4000 MiB which is smaller than 4Gib) the code works as expected.
2. The act of actually writing the described regions is not important. When the filetype describes an area >= 4Gib but only writes to regions much smaller than that, the code still produces undefined behavior (please refer to the 6th example with fname="test_too_large_blocks" in order to see an example).
3. It doesn't matter if the block size or the amount of blocks pushes the filetype over the 4 Gib (refer to the 5th and 6th example, with filenames "test_16_blocks" and "test_too_large_blocks" respectively).
4. If the binary is launched using only one rank, the output is always as expected (refer to the 3rd and 4th example, with filenames "test_too_large_blocks_single" and "test_too_large_blocks_single_even_larger", respectively).

There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to compute the byte composition inside the filetype. Since the filetype is defined using two 32bit integer variables, this can easily lead to integer overflows if the user supplies large values. It seems that no implementation expects this problem and therefore they do not act gracefully on its occurrence.

I looked at [ https://software.intel.com/en-us/node/528914 | ILP64 ] Support, but it only adapts the function parameters and not the internally used variables and it is also not available for C.

I looked at [ https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals | integer
overflow ] (FPE_INTOVF_TRAP) trapping, which could help to verify the source of the problem, but it doesn't seem to be possible for C. Intel does [ https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156 | not ] offer any built-in integer overflow trapping.

There are ways to circumvent this problem for most cases. It is only unavoidable if the logic of a program contains complex, non-repeating data structures with sizes of over (or equal) 4GiB. Even then, one could split up the filetype and use a different displacement in two distinct write calls.

Still, this problem violates the standard as it produces undefined behavior even when using the API in a consistent way. The implementation should at least provide a warning for the user, but should ideally use larger datatypes in the filetype computations. When a user stumbles on this problem, he will have a hard time to debug it.

Thank you very much for reading everything ;)

Kind Regards,

Nils
g***@rist.or.jp
2017-04-28 11:49:29 UTC
Permalink
Before v1.10, the default is ROMIO, and you can force OMPIO with
mpirun --mca io ompio ...

From v2, the default is OMPIO (unless you are running on lustre iirc),
and you can force ROMIO with
mpirun --mca io ^ompio ...

maybe that can help for the time being

Cheers,

Gilles

----- Original Message -----
Post by Christoph Niethammer
Hello,
Which MPI Version are you using?
This looks for me like it triggers https://github.com/open-mpi/ompi/issues/2399
You can check if you are running into this problem by playing around
with the mca_io_ompio_cycle_buffer_size parameter.
Post by Christoph Niethammer
Best
Christoph Niethammer
--
Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart
Tel: ++49(0)711-685-87203
http://www.hlrs.de/people/niethammer
----- Original Message -----
Sent: Friday, April 28, 2017 12:51:50 PM
Subject: [OMPI users] MPI I/O gives undefined behavior if the amount
of bytes described by a filetype reaches 2^32
Post by Christoph Niethammer
Dear OpenMPI Mailing List,
I have a problem with MPI I/O running on more than 1 rank using very
large filetypes. In order to reproduce the problem please take advantage
of the attached program "mpi_io_test.c". After compilation it should be
run on 2 nodes.
Post by Christoph Niethammer
The program will do the following for a variety of different
1. Create an elementary datatype (commonly refered to as etype in the
MPI Standard) of a specific size given by the parameter bsize (in
multiple of bytes). This datatype is called blk_filetype .
Post by Christoph Niethammer
2. Create a complex filetype, which is different for each rank. This
filetype divides the file into a number of blocks given by parameter nr_
blocks of size bsize . Each rank only gets access to a subarray
containing
Post by Christoph Niethammer
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The
respective subarray of each rank starts at
Post by Christoph Niethammer
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype .
3. Allocate enough memory on each rank, in order to be able to write a whole block.
4. Fill the allocated memory with the rank number to be able to check
the resulting file for correctness.
Post by Christoph Niethammer
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype .
Post by Christoph Niethammer
6. Write one block on each rank, using the collective routine.
7. Clean up.
The above will be repeated for different values of bsize and nr_blocks
. Please note, that there is no overflow of the used basic dataype int .
Post by Christoph Niethammer
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive
equal lines in a file into one output line. The resulting output of a
call to hexdump is given by a structure comparable to the following
Post by Christoph Niethammer
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |.............
...|
Post by Christoph Niethammer
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.............
...|
Post by Christoph Niethammer
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |.............
...|
Post by Christoph Niethammer
*
5dc00000
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file
contains the value 01 in each byte.
Post by Christoph Niethammer
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file
contains the value 00 in each byte.
Post by Christoph Niethammer
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the file
contains the value 02 in each byte.
Post by Christoph Niethammer
-The file ends here.
This is the correct output of the above outlined program with
parameters
Post by Christoph Niethammer
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for
different cases. These were made to pinpoint the source of the problem
and to exclude different other, potentially important, factors.
Post by Christoph Niethammer
I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
Post by Christoph Niethammer
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes.
Post by Christoph Niethammer
The tests and their results are defined in comments in the source.
1. If the filetype used in the view is set in a way that it describes
an amount of bytes equaling or exceeding 2^32 = 4Gib the code produces
wrong output. For values slightly smaller (the second example with fname
="test_8_blocks" uses a total filetype size of 4000 MiB which is smaller
than 4Gib) the code works as expected.
Post by Christoph Niethammer
2. The act of actually writing the described regions is not important.
When the filetype describes an area >= 4Gib but only writes to regions
much smaller than that, the code still produces undefined behavior (
please refer to the 6th example with fname="test_too_large_blocks" in
order to see an example).
Post by Christoph Niethammer
3. It doesn't matter if the block size or the amount of blocks pushes
the filetype over the 4 Gib (refer to the 5th and 6th example, with
filenames "test_16_blocks" and "test_too_large_blocks" respectively).
Post by Christoph Niethammer
4. If the binary is launched using only one rank, the output is always
as expected (refer to the 3rd and 4th example, with filenames "test_too_
large_blocks_single" and "test_too_large_blocks_single_even_larger",
respectively).
Post by Christoph Niethammer
There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to
compute the byte composition inside the filetype. Since the filetype is
defined using two 32bit integer variables, this can easily lead to
integer overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.
Post by Christoph Niethammer
I looked at [ https://software.intel.com/en-us/node/528914 | ILP64 ]
Support, but it only adapts the function parameters and not the
internally used variables and it is also not available for C.
Post by Christoph Niethammer
I looked at [ https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals
| integer
Post by Christoph Niethammer
overflow ] (FPE_INTOVF_TRAP) trapping, which could help to
verify the source of the problem, but it doesn't seem to be possible for
C. Intel does [ https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156
| not ] offer any built-in integer overflow trapping.
Post by Christoph Niethammer
There are ways to circumvent this problem for most cases. It is only
unavoidable if the logic of a program contains complex, non-repeating
data structures with sizes of over (or equal) 4GiB. Even then, one could
split up the filetype and use a different displacement in two distinct
write calls.
Post by Christoph Niethammer
Still, this problem violates the standard as it produces undefined
behavior even when using the API in a consistent way. The implementation
should at least provide a warning for the user, but should ideally use
larger datatypes in the filetype computations. When a user stumbles on
this problem, he will have a hard time to debug it.
Post by Christoph Niethammer
Thank you very much for reading everything ;)
Kind Regards,
Nils
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Edgar Gabriel
2017-04-28 13:26:22 UTC
Permalink
Thank you for the detailed analysis, I will have a look into that. It
would be really important to know which version of Open MPI triggers
this problem?

Christoph, I doubt that it is

https://github.com/open-mpi/ompi/issues/2399

due to the fact that the test uses collective I/O, which breaks down the operations internally cycles (typically 32 MB), so that issue should not be triggered. If it is OMPIO, I would suspect more that it has to do with something on how we treat/analyze the fileview. We did have test cases exceeding 100GB file size overall that worked correctly, but I am not sure on whether we exceed 4GB 'portion' of a file view per rank, I will look into that.

Thanks
Edgar
Post by g***@rist.or.jp
Before v1.10, the default is ROMIO, and you can force OMPIO with
mpirun --mca io ompio ...
From v2, the default is OMPIO (unless you are running on lustre iirc),
and you can force ROMIO with
mpirun --mca io ^ompio ...
maybe that can help for the time being
Cheers,
Gilles
----- Original Message -----
Post by Christoph Niethammer
Hello,
Which MPI Version are you using?
This looks for me like it triggers https://github.com/open-mpi/ompi/issues/2399
You can check if you are running into this problem by playing around
with the mca_io_ompio_cycle_buffer_size parameter.
Post by Christoph Niethammer
Best
Christoph Niethammer
--
Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart
Tel: ++49(0)711-685-87203
http://www.hlrs.de/people/niethammer
----- Original Message -----
Sent: Friday, April 28, 2017 12:51:50 PM
Subject: [OMPI users] MPI I/O gives undefined behavior if the amount
of bytes described by a filetype reaches 2^32
Post by Christoph Niethammer
Dear OpenMPI Mailing List,
I have a problem with MPI I/O running on more than 1 rank using very
large filetypes. In order to reproduce the problem please take advantage
of the attached program "mpi_io_test.c". After compilation it should be
run on 2 nodes.
Post by Christoph Niethammer
The program will do the following for a variety of different
1. Create an elementary datatype (commonly refered to as etype in the
MPI Standard) of a specific size given by the parameter bsize (in
multiple of bytes). This datatype is called blk_filetype .
Post by Christoph Niethammer
2. Create a complex filetype, which is different for each rank. This
filetype divides the file into a number of blocks given by parameter nr_
blocks of size bsize . Each rank only gets access to a subarray
containing
Post by Christoph Niethammer
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The
respective subarray of each rank starts at
Post by Christoph Niethammer
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype .
3. Allocate enough memory on each rank, in order to be able to write a
whole block.
Post by Christoph Niethammer
4. Fill the allocated memory with the rank number to be able to check
the resulting file for correctness.
Post by Christoph Niethammer
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype .
Post by Christoph Niethammer
6. Write one block on each rank, using the collective routine.
7. Clean up.
The above will be repeated for different values of bsize and nr_blocks
. Please note, that there is no overflow of the used basic dataype int .
Post by Christoph Niethammer
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive
equal lines in a file into one output line. The resulting output of a
call to hexdump is given by a structure comparable to the following
Post by Christoph Niethammer
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |.............
...|
Post by Christoph Niethammer
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.............
...|
Post by Christoph Niethammer
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |.............
...|
Post by Christoph Niethammer
*
5dc00000
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file
contains the value 01 in each byte.
Post by Christoph Niethammer
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file
contains the value 00 in each byte.
Post by Christoph Niethammer
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the file
contains the value 02 in each byte.
Post by Christoph Niethammer
-The file ends here.
This is the correct output of the above outlined program with
parameters
Post by Christoph Niethammer
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for
different cases. These were made to pinpoint the source of the problem
and to exclude different other, potentially important, factors.
Post by Christoph Niethammer
I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
Post by Christoph Niethammer
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes.
Post by Christoph Niethammer
The tests and their results are defined in comments in the source.
1. If the filetype used in the view is set in a way that it describes
an amount of bytes equaling or exceeding 2^32 = 4Gib the code produces
wrong output. For values slightly smaller (the second example with fname
="test_8_blocks" uses a total filetype size of 4000 MiB which is smaller
than 4Gib) the code works as expected.
Post by Christoph Niethammer
2. The act of actually writing the described regions is not important.
When the filetype describes an area >= 4Gib but only writes to regions
much smaller than that, the code still produces undefined behavior (
please refer to the 6th example with fname="test_too_large_blocks" in
order to see an example).
Post by Christoph Niethammer
3. It doesn't matter if the block size or the amount of blocks pushes
the filetype over the 4 Gib (refer to the 5th and 6th example, with
filenames "test_16_blocks" and "test_too_large_blocks" respectively).
Post by Christoph Niethammer
4. If the binary is launched using only one rank, the output is always
as expected (refer to the 3rd and 4th example, with filenames "test_too_
large_blocks_single" and "test_too_large_blocks_single_even_larger",
respectively).
Post by Christoph Niethammer
There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to
compute the byte composition inside the filetype. Since the filetype is
defined using two 32bit integer variables, this can easily lead to
integer overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.
Post by Christoph Niethammer
I looked at [ https://software.intel.com/en-us/node/528914 | ILP64 ]
Support, but it only adapts the function parameters and not the
internally used variables and it is also not available for C.
Post by Christoph Niethammer
I looked at [ https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals
| integer
Post by Christoph Niethammer
overflow ] (FPE_INTOVF_TRAP) trapping, which could help to
verify the source of the problem, but it doesn't seem to be possible for
C. Intel does [ https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156
| not ] offer any built-in integer overflow trapping.
Post by Christoph Niethammer
There are ways to circumvent this problem for most cases. It is only
unavoidable if the logic of a program contains complex, non-repeating
data structures with sizes of over (or equal) 4GiB. Even then, one could
split up the filetype and use a different displacement in two distinct
write calls.
Post by Christoph Niethammer
Still, this problem violates the standard as it produces undefined
behavior even when using the API in a consistent way. The implementation
should at least provide a warning for the user, but should ideally use
larger datatypes in the filetype computations. When a user stumbles on
this problem, he will have a hard time to debug it.
Post by Christoph Niethammer
Thank you very much for reading everything ;)
Kind Regards,
Nils
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
Edgar Gabriel
2017-04-28 13:46:19 UTC
Permalink
actually, reading through the email in more details, I actually doubt
that it is OMPIO.

"I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes."

I will still look into that though.

THanks

Edgar
Post by Edgar Gabriel
Thank you for the detailed analysis, I will have a look into that. It
would be really important to know which version of Open MPI triggers
this problem?
Christoph, I doubt that it is
https://github.com/open-mpi/ompi/issues/2399
due to the fact that the test uses collective I/O, which breaks down the operations internally cycles (typically 32 MB), so that issue should not be triggered. If it is OMPIO, I would suspect more that it has to do with something on how we treat/analyze the fileview. We did have test cases exceeding 100GB file size overall that worked correctly, but I am not sure on whether we exceed 4GB 'portion' of a file view per rank, I will look into that.
Thanks
Edgar
Post by g***@rist.or.jp
Before v1.10, the default is ROMIO, and you can force OMPIO with
mpirun --mca io ompio ...
From v2, the default is OMPIO (unless you are running on lustre iirc),
and you can force ROMIO with
mpirun --mca io ^ompio ...
maybe that can help for the time being
Cheers,
Gilles
----- Original Message -----
Post by Christoph Niethammer
Hello,
Which MPI Version are you using?
This looks for me like it triggers https://github.com/open-mpi/ompi/issues/2399
You can check if you are running into this problem by playing around
with the mca_io_ompio_cycle_buffer_size parameter.
Post by Christoph Niethammer
Best
Christoph Niethammer
--
Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart
Tel: ++49(0)711-685-87203
http://www.hlrs.de/people/niethammer
----- Original Message -----
Sent: Friday, April 28, 2017 12:51:50 PM
Subject: [OMPI users] MPI I/O gives undefined behavior if the amount
of bytes described by a filetype reaches 2^32
Post by Christoph Niethammer
Dear OpenMPI Mailing List,
I have a problem with MPI I/O running on more than 1 rank using very
large filetypes. In order to reproduce the problem please take advantage
of the attached program "mpi_io_test.c". After compilation it should be
run on 2 nodes.
Post by Christoph Niethammer
The program will do the following for a variety of different
1. Create an elementary datatype (commonly refered to as etype in the
MPI Standard) of a specific size given by the parameter bsize (in
multiple of bytes). This datatype is called blk_filetype .
Post by Christoph Niethammer
2. Create a complex filetype, which is different for each rank. This
filetype divides the file into a number of blocks given by parameter nr_
blocks of size bsize . Each rank only gets access to a subarray
containing
Post by Christoph Niethammer
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The
respective subarray of each rank starts at
Post by Christoph Niethammer
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype .
3. Allocate enough memory on each rank, in order to be able to write a
whole block.
Post by Christoph Niethammer
4. Fill the allocated memory with the rank number to be able to check
the resulting file for correctness.
Post by Christoph Niethammer
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype .
Post by Christoph Niethammer
6. Write one block on each rank, using the collective routine.
7. Clean up.
The above will be repeated for different values of bsize and nr_blocks
. Please note, that there is no overflow of the used basic dataype int .
Post by Christoph Niethammer
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive
equal lines in a file into one output line. The resulting output of a
call to hexdump is given by a structure comparable to the following
Post by Christoph Niethammer
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |.............
...|
Post by Christoph Niethammer
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.............
...|
Post by Christoph Niethammer
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |.............
...|
Post by Christoph Niethammer
*
5dc00000
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file
contains the value 01 in each byte.
Post by Christoph Niethammer
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file
contains the value 00 in each byte.
Post by Christoph Niethammer
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the file
contains the value 02 in each byte.
Post by Christoph Niethammer
-The file ends here.
This is the correct output of the above outlined program with
parameters
Post by Christoph Niethammer
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for
different cases. These were made to pinpoint the source of the problem
and to exclude different other, potentially important, factors.
Post by Christoph Niethammer
I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
Post by Christoph Niethammer
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes.
Post by Christoph Niethammer
The tests and their results are defined in comments in the source.
1. If the filetype used in the view is set in a way that it describes
an amount of bytes equaling or exceeding 2^32 = 4Gib the code produces
wrong output. For values slightly smaller (the second example with fname
="test_8_blocks" uses a total filetype size of 4000 MiB which is smaller
than 4Gib) the code works as expected.
Post by Christoph Niethammer
2. The act of actually writing the described regions is not important.
When the filetype describes an area >= 4Gib but only writes to regions
much smaller than that, the code still produces undefined behavior (
please refer to the 6th example with fname="test_too_large_blocks" in
order to see an example).
Post by Christoph Niethammer
3. It doesn't matter if the block size or the amount of blocks pushes
the filetype over the 4 Gib (refer to the 5th and 6th example, with
filenames "test_16_blocks" and "test_too_large_blocks" respectively).
Post by Christoph Niethammer
4. If the binary is launched using only one rank, the output is always
as expected (refer to the 3rd and 4th example, with filenames "test_too_
large_blocks_single" and "test_too_large_blocks_single_even_larger",
respectively).
Post by Christoph Niethammer
There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to
compute the byte composition inside the filetype. Since the filetype is
defined using two 32bit integer variables, this can easily lead to
integer overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.
Post by Christoph Niethammer
I looked at [ https://software.intel.com/en-us/node/528914 | ILP64 ]
Support, but it only adapts the function parameters and not the
internally used variables and it is also not available for C.
Post by Christoph Niethammer
I looked at [ https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals
| integer
Post by Christoph Niethammer
overflow ] (FPE_INTOVF_TRAP) trapping, which could help to
verify the source of the problem, but it doesn't seem to be possible for
C. Intel does [ https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156
| not ] offer any built-in integer overflow trapping.
Post by Christoph Niethammer
There are ways to circumvent this problem for most cases. It is only
unavoidable if the logic of a program contains complex, non-repeating
data structures with sizes of over (or equal) 4GiB. Even then, one could
split up the filetype and use a different displacement in two distinct
write calls.
Post by Christoph Niethammer
Still, this problem violates the standard as it produces undefined
behavior even when using the API in a consistent way. The implementation
should at least provide a warning for the user, but should ideally use
larger datatypes in the filetype computations. When a user stumbles on
this problem, he will have a hard time to debug it.
Post by Christoph Niethammer
Thank you very much for reading everything ;)
Kind Regards,
Nils
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
Edgar Gabriel
2017-04-28 14:52:24 UTC
Permalink
short update on this: master does not finish with either OMPIO or ROMIO.
It admittedly segfaults earlier with OMPIO than with ROMIO, but with one
little tweak I can make them fail both at the same spot. There is
clearly a memory corruption going on for the larger cases, I will try to
further narrow it down.
Post by Edgar Gabriel
actually, reading through the email in more details, I actually doubt
that it is OMPIO.
"I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes."
I will still look into that though.
THanks
Edgar
Post by Edgar Gabriel
Thank you for the detailed analysis, I will have a look into that. It
would be really important to know which version of Open MPI triggers
this problem?
Christoph, I doubt that it is
https://github.com/open-mpi/ompi/issues/2399
due to the fact that the test uses collective I/O, which breaks down the operations internally cycles (typically 32 MB), so that issue should not be triggered. If it is OMPIO, I would suspect more that it has to do with something on how we treat/analyze the fileview. We did have test cases exceeding 100GB file size overall that worked correctly, but I am not sure on whether we exceed 4GB 'portion' of a file view per rank, I will look into that.
Thanks
Edgar
Post by g***@rist.or.jp
Before v1.10, the default is ROMIO, and you can force OMPIO with
mpirun --mca io ompio ...
From v2, the default is OMPIO (unless you are running on lustre iirc),
and you can force ROMIO with
mpirun --mca io ^ompio ...
maybe that can help for the time being
Cheers,
Gilles
----- Original Message -----
Post by Christoph Niethammer
Hello,
Which MPI Version are you using?
This looks for me like it triggers https://github.com/open-mpi/ompi/issues/2399
You can check if you are running into this problem by playing around
with the mca_io_ompio_cycle_buffer_size parameter.
Post by Christoph Niethammer
Best
Christoph Niethammer
--
Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart
Tel: ++49(0)711-685-87203
http://www.hlrs.de/people/niethammer
----- Original Message -----
Sent: Friday, April 28, 2017 12:51:50 PM
Subject: [OMPI users] MPI I/O gives undefined behavior if the amount
of bytes described by a filetype reaches 2^32
Post by Christoph Niethammer
Dear OpenMPI Mailing List,
I have a problem with MPI I/O running on more than 1 rank using very
large filetypes. In order to reproduce the problem please take advantage
of the attached program "mpi_io_test.c". After compilation it should be
run on 2 nodes.
Post by Christoph Niethammer
The program will do the following for a variety of different
1. Create an elementary datatype (commonly refered to as etype in the
MPI Standard) of a specific size given by the parameter bsize (in
multiple of bytes). This datatype is called blk_filetype .
Post by Christoph Niethammer
2. Create a complex filetype, which is different for each rank. This
filetype divides the file into a number of blocks given by parameter nr_
blocks of size bsize . Each rank only gets access to a subarray
containing
Post by Christoph Niethammer
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The
respective subarray of each rank starts at
Post by Christoph Niethammer
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype .
3. Allocate enough memory on each rank, in order to be able to write a
whole block.
Post by Christoph Niethammer
4. Fill the allocated memory with the rank number to be able to check
the resulting file for correctness.
Post by Christoph Niethammer
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype .
Post by Christoph Niethammer
6. Write one block on each rank, using the collective routine.
7. Clean up.
The above will be repeated for different values of bsize and nr_blocks
. Please note, that there is no overflow of the used basic dataype int .
Post by Christoph Niethammer
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive
equal lines in a file into one output line. The resulting output of a
call to hexdump is given by a structure comparable to the following
Post by Christoph Niethammer
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |.............
...|
Post by Christoph Niethammer
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.............
...|
Post by Christoph Niethammer
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |.............
...|
Post by Christoph Niethammer
*
5dc00000
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file
contains the value 01 in each byte.
Post by Christoph Niethammer
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file
contains the value 00 in each byte.
Post by Christoph Niethammer
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the file
contains the value 02 in each byte.
Post by Christoph Niethammer
-The file ends here.
This is the correct output of the above outlined program with
parameters
Post by Christoph Niethammer
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for
different cases. These were made to pinpoint the source of the problem
and to exclude different other, potentially important, factors.
Post by Christoph Niethammer
I deem an output wrong if it doesn't follow from the parameters or if
the program crashes on execution.
Post by Christoph Niethammer
The only difference between OpenMPI and Intel MPI, according to my
tests, is in the different behavior on error: OpenMPI will mostly write
wrong data but won't crash, whereas Intel MPI mostly crashes.
Post by Christoph Niethammer
The tests and their results are defined in comments in the source.
1. If the filetype used in the view is set in a way that it describes
an amount of bytes equaling or exceeding 2^32 = 4Gib the code produces
wrong output. For values slightly smaller (the second example with fname
="test_8_blocks" uses a total filetype size of 4000 MiB which is smaller
than 4Gib) the code works as expected.
Post by Christoph Niethammer
2. The act of actually writing the described regions is not important.
When the filetype describes an area >= 4Gib but only writes to regions
much smaller than that, the code still produces undefined behavior (
please refer to the 6th example with fname="test_too_large_blocks" in
order to see an example).
Post by Christoph Niethammer
3. It doesn't matter if the block size or the amount of blocks pushes
the filetype over the 4 Gib (refer to the 5th and 6th example, with
filenames "test_16_blocks" and "test_too_large_blocks" respectively).
Post by Christoph Niethammer
4. If the binary is launched using only one rank, the output is always
as expected (refer to the 3rd and 4th example, with filenames "test_too_
large_blocks_single" and "test_too_large_blocks_single_even_larger",
respectively).
Post by Christoph Niethammer
There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to
compute the byte composition inside the filetype. Since the filetype is
defined using two 32bit integer variables, this can easily lead to
integer overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.
Post by Christoph Niethammer
I looked at [ https://software.intel.com/en-us/node/528914 | ILP64 ]
Support, but it only adapts the function parameters and not the
internally used variables and it is also not available for C.
Post by Christoph Niethammer
I looked at [ https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals
| integer
Post by Christoph Niethammer
overflow ] (FPE_INTOVF_TRAP) trapping, which could help to
verify the source of the problem, but it doesn't seem to be possible for
C. Intel does [ https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156
| not ] offer any built-in integer overflow trapping.
Post by Christoph Niethammer
There are ways to circumvent this problem for most cases. It is only
unavoidable if the logic of a program contains complex, non-repeating
data structures with sizes of over (or equal) 4GiB. Even then, one could
split up the filetype and use a different displacement in two distinct
write calls.
Post by Christoph Niethammer
Still, this problem violates the standard as it produces undefined
behavior even when using the API in a consistent way. The implementation
should at least provide a warning for the user, but should ideally use
larger datatypes in the filetype computations. When a user stumbles on
this problem, he will have a hard time to debug it.
Post by Christoph Niethammer
Thank you very much for reading everything ;)
Kind Regards,
Nils
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
--
Jeff Hammond
2017-05-02 21:23:58 UTC
Permalink
Post by Nils Moschuering
Dear OpenMPI Mailing List,
I have a problem with MPI I/O running on more than 1 rank using very large
filetypes. In order to reproduce the problem please take advantage of the
attached program "mpi_io_test.c". After compilation it should be run on 2
nodes.
1. Create an elementary datatype (commonly refered to as etype in the MPI
Standard) of a specific size given by the parameter bsize (in multiple of
bytes). This datatype is called blk_filetype.
2. Create a complex filetype, which is different for each rank. This
filetype divides the file into a number of blocks given by parameter
nr_blocks of size bsize. Each rank only gets access to a subarray containing
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The respective
subarray of each rank starts at
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't overlap.
The resulting datatype is called full_filetype.
3. Allocate enough memory on each rank, in order to be able to write a
whole block.
4. Fill the allocated memory with the rank number to be able to check the
resulting file for correctness.
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype.
6. Write one block on each rank, using the collective routine.
7. Clean up.
The above will be repeated for different values of bsize and nr_blocks.
Please note, that there is no overflow of the used basic dataype int.
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects consecutive equal
lines in a file into one output line. The resulting output of a call to
hexdump is given by a structure comparable to the following
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
|................|
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
|................|
*
5dc00000
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the file
contains the value 01 in each byte.
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the file
contains the value 00 in each byte.
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the file
contains the value 02 in each byte.
-The file ends here.
This is the correct output of the above outlined program with parameters
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests for
different cases. These were made to pinpoint the source of the problem and
to exclude different other, potentially important, factors.
I deem an output wrong if it doesn't follow from the parameters or if the
program crashes on execution.
The only difference between OpenMPI and Intel MPI, according to my tests,
is in the different behavior on error: OpenMPI will mostly write wrong data
but won't crash, whereas Intel MPI mostly crashes.
Intel MPI is based on MPICH so you should verify that this bug appears in
MPICH and then report it here: https://github.com/pmodels/mpich/issues.
This is particularly useful because the person most responsible for MPI-IO
in MPICH (Rob Latham) also happens to be interested in integer-overflow
issues.
Post by Nils Moschuering
The tests and their results are defined in comments in the source.
1. If the filetype used in the view is set in a way that it describes an
amount of bytes equaling or exceeding 2^32 = 4Gib the code produces wrong
output. For values slightly smaller (the second example with
fname="test_8_blocks" uses a total filetype size of 4000 MiB which is
smaller than 4Gib) the code works as expected.
2. The act of actually writing the described regions is not important.
When the filetype describes an area >= 4Gib but only writes to regions much
smaller than that, the code still produces undefined behavior (please refer
to the 6th example with fname="test_too_large_blocks" in order to see an
example).
3. It doesn't matter if the block size or the amount of blocks pushes the
filetype over the 4 Gib (refer to the 5th and 6th example, with filenames
"test_16_blocks" and "test_too_large_blocks" respectively).
4. If the binary is launched using only one rank, the output is always as
expected (refer to the 3rd and 4th example, with filenames
"test_too_large_blocks_single" and "test_too_large_blocks_single_even_larger",
respectively).
There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to compute
the byte composition inside the filetype. Since the filetype is defined
using two 32bit integer variables, this can easily lead to integer
overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.
I looked at ILP64 <https://software.intel.com/en-us/node/528914> Support,
but it only adapts the function parameters and not the internally used
variables and it is also not available for C.
As far as I know, this won't fix anything, because it will run into all the
internal implementation issues with overflow. The ILP64 feature for
Fortran is just to workaround the horrors of default integer width
promotion by Fortran compilers.
Post by Nils Moschuering
I looked at integer overflow
<https://www.gnu.org/software/libc/manual/html_node/Program-Error-Signals.html#Program%20Error%20Signals>
(FPE_INTOVF_TRAP) trapping, which could help to verify the source of the
problem, but it doesn't seem to be possible for C. Intel does not
<https://software.intel.com/en-us/forums/intel-c-compiler/topic/306156>
offer any built-in integer overflow trapping.
You might be interested in http://blog.regehr.org/archives/1154 and linked
material therein. I think it's possible to implement the effective
equivalent of a hardware trap using the compiler, although I don't know any
(production) compiler that supports this.
Post by Nils Moschuering
There are ways to circumvent this problem for most cases. It is only
unavoidable if the logic of a program contains complex, non-repeating data
structures with sizes of over (or equal) 4GiB. Even then, one could split
up the filetype and use a different displacement in two distinct write
calls.
Still, this problem violates the standard as it produces undefined
behavior even when using the API in a consistent way. The implementation
should at least provide a warning for the user, but should ideally use
larger datatypes in the filetype computations. When a user stumbles on this
problem, he will have a hard time to debug it.
Indeed, this is a problem. There is an effort to fix the API in MPI-4 (see
https://github.com/jeffhammond/bigmpi-paper) but as you know, there are
implementation defects that break correct MPI-3 programs that use datatypes
to workaround the limits of C int. We were able to find a bunch of
problems in MPICH using BigMPI but clearly not all of them.

Jeff
Post by Nils Moschuering
Thank you very much for reading everything ;)
Kind Regards,
Nils
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
g***@rist.or.jp
2017-05-03 01:55:57 UTC
Permalink
Jeff and all,

i already reported the issue and posted a patch for ad_nfs at
https://github.com/pmodels/mpich/pull/2617

a bug was also identified in Open MPI (related to datatype handling) and
a first draft is available at https://github.com/open-mpi/ompi/pull/3439

Cheers,

Gilles

----- Original Message -----



On Fri, Apr 28, 2017 at 3:51 AM, Nils Moschuering <***@ipp.mpg.de>
wrote:

Dear OpenMPI Mailing List,

I have a problem with MPI I/O running on more than 1 rank using
very large filetypes. In order to reproduce the problem please take
advantage of the attached program "mpi_io_test.c". After compilation it
should be run on 2 nodes.

The program will do the following for a variety of different
parameters:
1. Create an elementary datatype (commonly refered to as etype
in the MPI Standard) of a specific size given by the parameter bsize (in
multiple of bytes). This datatype is called blk_filetype.
2. Create a complex filetype, which is different for each rank.
This filetype divides the file into a number of blocks given by
parameter nr_blocks of size bsize. Each rank only gets access to a
subarray containing
nr_blocks_per_rank = nr_blocks / size
blocks (where size is the number of participating ranks). The
respective subarray of each rank starts at
rank * nr_blocks_per_rank
This guarantees that the regions of the different ranks don't
overlap.
The resulting datatype is called full_filetype.
3. Allocate enough memory on each rank, in order to be able to
write a whole block.
4. Fill the allocated memory with the rank number to be able to
check the resulting file for correctness.
5. Open a file named fname and set the view using the previously
generated blk_filetype and full_filetype.
6. Write one block on each rank, using the collective routine.
7. Clean up.

The above will be repeated for different values of bsize and nr_
blocks. Please note, that there is no overflow of the used basic dataype
int.
The output is verified using
hexdump fname
which performs a hexdump of the file. This tool collects
consecutive equal lines in a file into one output line. The resulting
output of a call to hexdump is given by a structure comparable to the
following
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |....
............|
*
1f400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |....
............|
*
3e800000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |....
............|
*
5dc00000
This example is to be read in the following manner:
-From byte 00000000 to 1f400000 (which is equal to 500 Mib) the
file contains the value 01 in each byte.
-From byte 1f400000 to 3e800000 (which is equal to 1000 Mib) the
file contains the value 00 in each byte.
-From byte 3e800000 to 5dc00000 (which is equal to 1500 Mib) the
file contains the value 02 in each byte.
-The file ends here.
This is the correct output of the above outlined program with
parameters
bsize=500*1023*1024
nr_blocks=4
running on 2 ranks. The attached file contains a lot of tests
for different cases. These were made to pinpoint the source of the
problem and to exclude different other, potentially important, factors.
I deem an output wrong if it doesn't follow from the parameters
or if the program crashes on execution.
The only difference between OpenMPI and Intel MPI, according to
my tests, is in the different behavior on error: OpenMPI will mostly
write wrong data but won't crash, whereas Intel MPI mostly crashes.


Intel MPI is based on MPICH so you should verify that this bug
appears in MPICH and then report it here:
https://github.com/pmodels/mpich/issues. This is particularly useful
because the person most responsible for MPI-IO in MPICH (Rob Latham)
also happens to be interested in integer-overflow issues.


The tests and their results are defined in comments in the
source.
The final conclusions, I derive from the tests, are the
following:

1. If the filetype used in the view is set in a way that it
describes an amount of bytes equaling or exceeding 2^32 = 4Gib the code
produces wrong output. For values slightly smaller (the second example
with fname="test_8_blocks" uses a total filetype size of 4000 MiB which
is smaller than 4Gib) the code works as expected.
2. The act of actually writing the described regions is not
important. When the filetype describes an area >= 4Gib but only writes
to regions much smaller than that, the code still produces undefined
behavior (please refer to the 6th example with fname="test_too_large_
blocks" in order to see an example).
3. It doesn't matter if the block size or the amount of blocks
pushes the filetype over the 4 Gib (refer to the 5th and 6th example,
with filenames "test_16_blocks" and "test_too_large_blocks" respectively)
.
4. If the binary is launched using only one rank, the output is
always as expected (refer to the 3rd and 4th example, with filenames "
test_too_large_blocks_single" and "test_too_large_blocks_single_even_
larger", respectively).

There are, of course, many other things one could test.
It seems that the implementations use 32bit integer variables to
compute the byte composition inside the filetype. Since the filetype is
defined using two 32bit integer variables, this can easily lead to
integer overflows if the user supplies large values. It seems that no
implementation expects this problem and therefore they do not act
gracefully on its occurrence.

I looked at ILP64 Support, but it only adapts the function
parameters and not the internally used variables and it is also not
available for C.


As far as I know, this won't fix anything, because it will run into
all the internal implementation issues with overflow. The ILP64 feature
for Fortran is just to workaround the horrors of default integer width
promotion by Fortran compilers.


I looked at integer overflow (FPE_INTOVF_TRAP) trapping, which
could help to verify the source of the problem, but it doesn't seem to
be possible for C. Intel does not offer any built-in integer overflow
trapping.


You might be interested in http://blog.regehr.org/archives/1154 and
linked material therein. I think it's possible to implement the
effective equivalent of a hardware trap using the compiler, although I
don't know any (production) compiler that supports this.


There are ways to circumvent this problem for most cases. It is
only unavoidable if the logic of a program contains complex, non-
repeating data structures with sizes of over (or equal) 4GiB. Even then,
one could split up the filetype and use a different displacement in two
distinct write calls.

Still, this problem violates the standard as it produces
undefined behavior even when using the API in a consistent way. The
implementation should at least provide a warning for the user, but
should ideally use larger datatypes in the filetype computations. When a
user stumbles on this problem, he will have a hard time to debug it.


Indeed, this is a problem. There is an effort to fix the API in MPI
-4 (see https://github.com/jeffhammond/bigmpi-paper) but as you know,
there are implementation defects that break correct MPI-3 programs that
use datatypes to workaround the limits of C int. We were able to find a
bunch of problems in MPICH using BigMPI but clearly not all of them.

Jeff


Thank you very much for reading everything ;)

Kind Regards,

Nils

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Nils Moschuering
2017-05-03 14:17:08 UTC
Permalink
Hello,

thank you very much for your interest in this problem. Please excuse the
late response. I'm encountering the error on version 1.10.2 and 1.10.3 .
Should I open a ticket on the github issue tracker?

Kind Regards,

Nils

Loading...