README.md 6.36 KB
Newer Older
d.bertini's avatar
d.bertini committed
1
2
3
4
# Artis Virgo

Submit scripts using local storage for the 3D Monte Carlo radiative transfer code [Artis] (https://github.com/artis-mcrt/artis).  

5
## Rationale
6
The MPI code [Artis] (https://github.com/artis-mcrt/artis) uses a file-per-process I/O pattern which tends to be
7
inefficient and unstable on  `Lustre` shared file system. 
d.bertini's avatar
d.bertini committed
8

9
10
Additionally the codes generate a lot of small output files and performs `write` (using C I/O commands `fprintf(...)`, `fwrite(...)`) but
also `read`  (using C I/O commands `fread(...) during runtime.
d.bertini's avatar
d.bertini committed
11

12
13
The code itself shows excellent scalability properties but suffers from sporadical drop of performance caused by the dedicated I/O pattern used
internally.
d.bertini's avatar
d.bertini committed
14
15
16



17
## Submit scipts
18
19
New submit scripts have been written in order to make use of the local storage `/tmp` on the nodes participating in the MPI job.
This is possible since
20
21
the Artis code do not use parallel I/O (MPI I/O) and do not share files between processes.
The scripts are available on `/scripts` directory in this repository:
d.bertini's avatar
d.bertini committed
22

23
24
25
26
```
- set_packages.sh
- artis-local.sh
- artis-local-submit.sh
d.bertini's avatar
d.bertini committed
27

28
```
d.bertini's avatar
d.bertini committed
29

30
## artis-local.sh
31
32
33
34
This script defines a local storage `$MYTMP` as `/tmp/$USER/$SLURM_JOB_ID` in which all the relevant input files and executables for Artis to run will be copied.
When submitted to the queue system, variable $USER will be translated into your user name, and `$SLURM_JOB_ID` will get the job ID number.
For example, when a user `collins` submits a new job that gets `job ID 4423` assigned by `SLURM`, the `$MYTMP` on the computational node becomes `/tmp/collins/4423`.
This would eliminate a chance of overwriting the `$MYTMP` content by another job that user `collins` may submit later.
35
When Artis completes, all the results files i.e
36

37
38
39
40
41
42
```
- estimators*.out`
- output_*.out
- packets00*.out
- packets0*.tmp
```
43
44
will be copied back the a new created directory named `output_$SLURM_JOB_ID` which will then contain all the Artis output files with exactly the same file layout.

45
A post-analysis can then be done without any code modification.
d.bertini's avatar
d.bertini committed
46

47
## artis-local.sh copy mechanism
48

49
50
51
52
53
54
55
56
57
58
59
60
61
Two copy options are supported

```bash
- ./artis-local.sh -c cp # direct copy
- ./artis-local.sh -c tar # archive + copy
```

- `Direct copy mechanism`: all Artis ouput files will be sequentially copied to the newly
created `output_$SLURM_JOB_ID` Lustre directory.

- `Archiving + copy`: the local `/tmp/$USER/$SLURM_JOB_ID` is first archived using the `tar` mechanism and then only the archive is copied to the `Lustre` output directory. This option could be usefull when the job is running on  mutiple nodes to avoid too much I/O traffic from local to share filesystem.    


62
63
64
## artis-local-submit.sh
The script can be submitted to the cluster queue with `SLURM` command `sbatch` using the main submit scripts `artis-local-submit.sh`.
This scripts sets once the software dependencies i.e mainly `gcc`,`openMPI`, `gsl` external libraries.
d.bertini's avatar
d.bertini committed
65

66
67
68
69
## Integrate with Artis
To integrate with your Artis program, simply copy all the scripts on the main Artis working directory on `Lustre` and submit from this
directory.
SLURM output and error will be  redirected as
d.bertini's avatar
d.bertini committed
70

71
72
73
74
75
```
- $SLURM_JOB_ID.out.log
- $SLURM_JOB_ID.err.log
```
When everything works, a typical ouput for `10 nodes` would be:
d.bertini's avatar
d.bertini committed
76

77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
```
spack loading openmpi
spack loading gcc
spack loading gsl
gcc is /cvmfs/vae.gsi.de/centos7/spack-0.17/opt/linux-centos7-x86_64/gcc-4.8.5/gcc-8.1.0-nswpump2zjkpne3ipmxkqt75dq6s2g7w/bin/gcc
mpicc is /cvmfs/vae.gsi.de/centos7/spack-0.17/opt/linux-centos7-x86_64/gcc-8.1.0/openmpi-3.1.6-55d4p7423fcg6figm3s67efkv5vlefcc/bin/mpicc
creating dir:  /tmp/dbertini/51281895  on node: lxbk1034
creating dir:  /tmp/dbertini/51281895  on node: lxbk1036
creating dir:  /tmp/dbertini/51281895  on node: lxbk1037
creating dir:  /tmp/dbertini/51281895  on node: lxbk1050
creating dir:  /tmp/dbertini/51281895  on node: lxbk1075
creating dir:  /tmp/dbertini/51281895  on node: lxbk1048
creating dir:  /tmp/dbertini/51281895  on node: lxbk1035
creating dir:  /tmp/dbertini/51281895  on node: lxbk1073
creating dir:  /tmp/dbertini/51281895  on node: lxbk1074
creating dir:  /tmp/dbertini/51281895  on node: lxbk1049
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1034
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1075
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1074
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1048
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1050
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1036
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1037
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1035
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1073
copying bak to dir:  /lustre/rz/dbertini/ccollins/test_run/output_51281895  from node:  lxbk1049
removing local directory: /tmp/dbertini/51281895  on node: lxbk1075
removing local directory: /tmp/dbertini/51281895  on node: lxbk1035
removing local directory: /tmp/dbertini/51281895  on node: lxbk1073
removing local directory: /tmp/dbertini/51281895  on node: lxbk1036
removing local directory: /tmp/dbertini/51281895  on node: lxbk1074
removing local directory: /tmp/dbertini/51281895  on node: lxbk1050
removing local directory: /tmp/dbertini/51281895  on node: lxbk1048
removing local directory: /tmp/dbertini/51281895  on node: lxbk1049
removing local directory: /tmp/dbertini/51281895  on node: lxbk1037
removing local directory: /tmp/dbertini/51281895  on node: lxbk1034
```
d.bertini's avatar
d.bertini committed
114

115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

## Cleanup scripts
If for any reason your job crashed or has been cancelled, etc.., you will need to clean up the
`/tmp` directories on the cluster nodes which were used.
To ease this process one can find in the `utils` directory a cleanup script.
To use it you will need to give as arguments

- how many `days` to look back in time

- the `job_name`

For example to cleanup the `/tmp` used by all jobs with name `artis_l` during the last 2 days:

```
./cleanup.sh -d 2 -j artis_l

```
giving the output:

```
Cleanup jobs with job_name:  artis-l from date:  2022-05-28
cleanup will execute on nodelist:  lxbk[1047-1056]  corresponding to:  10  nodes.
Submitted batch job 52711719
```