Discussion:
[OMPI users] sharedfp/lockedfile collision between multiple program instances
Nicolas Joly
2017-03-03 13:36:27 UTC
Permalink
Hi,

We just got hit by a problem with sharedfp/lockedfile component under
v2.0.1 (should be identical with v2.0.2). We had 2 instances of an MPI
program running conccurrently on the same input file and using
MPI_File_read_shared() function ...

If the shared file pointer is maintained with the lockedfile
component, a "XXX.lockedfile" is created near to the data
file. Unfortunately, this fixed name will collide with multiple tools
instances ;)

Running 2 instances of the following command line (source code
attached) on the same machine will show the problematic behaviour.

mpirun -n 1 --mca sharedfp lockedfile ./shrread -v input.dat

Confirmed with lsof(8) output :

***@tars [~]> lsof input.dat.lockedfile
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
shrread 5876 njoly 21w REG 0,30 8 13510798885996031 input.dat.lockedfile
shrread 5884 njoly 21w REG 0,30 8 13510798885996031 input.dat.lockedfile

Thanks in advance.
--
Nicolas Joly

Cluster & Computing Group
Biology IT Center
Institut Pasteur, Paris.
Edgar Gabriel
2017-03-03 14:44:11 UTC
Permalink
Nicolas,

thank you for the bug report, I can confirm the behavior. I will work on
a patch and will try to get that into the next release, should hopefully
not be too complicated.

Thanks

Edgar
Post by Nicolas Joly
Hi,
We just got hit by a problem with sharedfp/lockedfile component under
v2.0.1 (should be identical with v2.0.2). We had 2 instances of an MPI
program running conccurrently on the same input file and using
MPI_File_read_shared() function ...
If the shared file pointer is maintained with the lockedfile
component, a "XXX.lockedfile" is created near to the data
file. Unfortunately, this fixed name will collide with multiple tools
instances ;)
Running 2 instances of the following command line (source code
attached) on the same machine will show the problematic behaviour.
mpirun -n 1 --mca sharedfp lockedfile ./shrread -v input.dat
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
shrread 5876 njoly 21w REG 0,30 8 13510798885996031 input.dat.lockedfile
shrread 5884 njoly 21w REG 0,30 8 13510798885996031 input.dat.lockedfile
Thanks in advance.
Howard Pritchard
2017-03-03 15:00:10 UTC
Permalink
Hi Edgar

Please open an issue too so we can track the fix.

Howard
Post by Edgar Gabriel
Nicolas,
thank you for the bug report, I can confirm the behavior. I will work on
a patch and will try to get that into the next release, should hopefully
not be too complicated.
Thanks
Edgar
Post by Nicolas Joly
Hi,
We just got hit by a problem with sharedfp/lockedfile component under
v2.0.1 (should be identical with v2.0.2). We had 2 instances of an MPI
program running conccurrently on the same input file and using
MPI_File_read_shared() function ...
If the shared file pointer is maintained with the lockedfile
component, a "XXX.lockedfile" is created near to the data
file. Unfortunately, this fixed name will collide with multiple tools
instances ;)
Running 2 instances of the following command line (source code
attached) on the same machine will show the problematic behaviour.
mpirun -n 1 --mca sharedfp lockedfile ./shrread -v input.dat
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
shrread 5876 njoly 21w REG 0,30 8 13510798885996031
input.dat.lockedfile
Post by Nicolas Joly
shrread 5884 njoly 21w REG 0,30 8 13510798885996031
input.dat.lockedfile
Post by Nicolas Joly
Thanks in advance.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Edgar Gabriel
2017-03-03 15:32:58 UTC
Permalink
done. I added 2.0.3 as the milestone, since I am not sure what the
timeline for 2.1.0 is. I will try to get the fix in over the weekened,
and we can go from there.

Tanks

Edga
Post by Howard Pritchard
Hi Edgar
Please open an issue too so we can track the fix.
Howard
Nicolas,
thank you for the bug report, I can confirm the behavior. I will work on
a patch and will try to get that into the next release, should hopefully
not be too complicated.
Thanks
Edgar
Post by Nicolas Joly
Hi,
We just got hit by a problem with sharedfp/lockedfile component
under
Post by Nicolas Joly
v2.0.1 (should be identical with v2.0.2). We had 2 instances of
an MPI
Post by Nicolas Joly
program running conccurrently on the same input file and using
MPI_File_read_shared() function ...
If the shared file pointer is maintained with the lockedfile
component, a "XXX.lockedfile" is created near to the data
file. Unfortunately, this fixed name will collide with multiple
tools
Post by Nicolas Joly
instances ;)
Running 2 instances of the following command line (source code
attached) on the same machine will show the problematic behaviour.
mpirun -n 1 --mca sharedfp lockedfile ./shrread -v input.dat
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
shrread 5876 njoly 21w REG 0,30 8 13510798885996031
input.dat.lockedfile
Post by Nicolas Joly
shrread 5884 njoly 21w REG 0,30 8 13510798885996031
input.dat.lockedfile
Post by Nicolas Joly
Thanks in advance.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...