README.md

# pp-ana

[OpenPMD](https://github.com/openPMD/openPMD-standard) python post processor using parallel I/O. 

## Motivations

Realistic pic simulations, involving  3D grids or/and  collisions for instance, produce usually large datasets that do not fit into memory on a single machine.     
**pp-ana** uses parallel computing techniques to efficiently process arbitrary large datasets produced 
by any [OpenPMD](https://github.com/openPMD/openPMD-standard) compatible PIC codes such as
[WarpX](https://github.com/ECP-WarpX/WarpX), [PiconGPU](https://github.com/ComputationalRadiationPhysics/picongpu), [fbic](https://github.com/fbpic/fbpic).
By leveraging [MPI(Message Passing Interface)](https://www.open-mpi.org/) and the [openpmd-api](https://openpmd-api.readthedocs.io/en/0.15.2/) library , 
**pp-ana** reads input datasets in chunks, which helps to mitigate memory limitations on a single node.    
Furthermore **pp-ana** works perfectly on High-Performance Computing clusters, allowing for scalable and efficient data processing.    

## Overview

The **pp-ana** design follow a 2 steps approach:
- 1) the simulation input openpmd data, which usually consist of large datasets, are reduced using selections (kinematical, geometrical ... ) , re-sampling or 
particle merging methods. The data is then stored as reduced datasets. This step is done once or only few times.

- 2) The post-processing (analysis),  which usually consists in visualizing the data using differents king of plots, can be performed by reading directly the reduced
datasets provide by te first step. This step is usually done many times and is then naturally optimized by the reduced datasets to be read. 

To perform this 2-steps approach  **pp-ana** contains two main post-processing components: 
- (`opmd_filter.py`) the filtering program that processes in parallel OpenPMD data files and generates [Parquet files](https://parquet.apache.org/) 
for both field and particle data
- (`opmd_pq_reader.py`) the main analysis program that reads the generated [Parquet files](https://parquet.apache.org/)  
to produce histograms and visualizations ([matplotlib](https://matplotlib.org/)). 

The code is designed to leverage parallel processing using [MPI (Message Passing Interface)](https://www.open-mpi.org/) via the [mpi4py python interface](https://mpi4py.readthedocs.io/en/stable/) and  [Dask](https://www.dask.org/) for efficient data handling.

## Workflow

1. **Data Processing**: The filtering script reads in parallel the simulation data from  OpenPMD files, extracts/select relevant fields and particle information, and saves them in a efficient structured columnar format [Parquet](https://parquet.apache.org/)  for further analysis.

2. **Parallel Computing**: The code utilizes MPI for parallel processing, allowing multiple processes to work on different parts of the data simultaneously. This is particularly useful for large datasets where reducing memory usage per node is essential.

3. **Data Storage**: The use of [Parquet files](https://parquet.apache.org/)  provides an efficient way to store large amounts of data with support for compression, making it easier to read and write data in a distributed environment. This format has been prouven to be very efficient on the [lustre filesystem](https://www.lustre.org/) installed on the [gsi virgo cluster](https://hpc.gsi.de/virgo/)

4. **Data Analysis**: The analysis code reads the Parquet files and performs various analyses, including generating histograms of particle energy and field data visualizations.

## Main components

### Filtering Script (`opmd_filter.py`)

- **Command Line Arguments**:
  - `--opmd_dir` or `-d`: Directory containing OpenPMD input files.
  - `--opmd_file` or `-f`: Specific OpenPMD file to process (optional).
  - `--output_dir` or `-o`: Directory to save the output [Parquet files](https://parquet.apache.org/) 
  - `--species` or `-s`: Particle species name (default: "electrons").

- **Implemented Features**:
  - Traverses the specified OpenPMD directory to find relevant simulation files.
  - Reads electric field data (Ex, Ey, Ez) and particle data (positions and momenta).
  - Normalizes and filters particle data based on energy thresholds.
  - Saves electric field and particle data as [Parquet files](https://parquet.apache.org/) , with metadata for field information.

### Analysis Script (`opmd_pq_reader.py`)

- **Command Line Arguments**:
  - `--pq_dir` or `-d`: Directory containing the Parquet files.
  - `--output_dir` or `-o`: Directory to save the output plots.
  - `--opmd_file` or `-f`: Specific OpenPMD file to analyze.
  - `--species` or `-s`: Particle species name (default: "electrons").
  - `--analyze` or `-a`: Type of analysis to perform: 'field', 'particle', or 'full' (default: 'full').

- **Implemented Features**:
  - Initializes analyzers for field and particle data based on user input.
  - Reads particle data and calculates energy, generating 2D/1D histograms of particle distributions.
  - Analyzes divergence of particle momenta.
  - Reads electric/magnetic field data and generates visualizationsin any 2D projections located in the middle of the non-visible direction.

## Requirements

- Python 3.x
- Required libraries:
  - `numpy`
  - `pandas`
  - `dask`
  - `mpi4py`
  - `pyarrow`
  - `scipy`
  - `openpmd-api`
  - `openpmd-viewer`
  
   
You can install the required libraries using pip:

```bash
pip install numpy pandas dask mpi4py pyarrow scipy openpmd-api openpmd-viewer
```

## Installation

### Clone the repository:
   ```bash
   git clone https://git.gsi.de/d.bertini/pp-ana
   cd pp-ana
   ```
The main components are located on the `/analysis` directory

## Usage

### Run the filtering process

```bash
mpirun -np <num_processes> python opmd_filter.py -d <opmd_directory> -f <opmd_file> -o <output_directory> 
```
  
- `mpirun -np <num_processes>`: Specifies the number of parallel processes to run. 
  	Replace <num_processes> with the desired number of MPI processes.

- `python opmd_filter.py`: The command to execute the filtering script.

- `-d <opmd_directory>` or `--opmd_dir <opmd_directory>`: The directory containing the OpenPMD input files. Replace <opmd_directory> with the path to your OpenPMD data.

- `-f <opmd_file>` or `--opmd_file <opmd_file>`: (Optional) The specific OpenPMD file to process. If not provided, the script will process all files in the specified directory.

- `-o <output_directory>` or `--output_dir <output_directory>`: The directory where the output Parquet files will be saved. Replace <output_directory> with the desired output path.

- `-s <species>` or `--species <species>`: (Optional) The particle species name to filter (default is "electrons"). Replace `<species>` with the desired species name. 
   
   
### Run the analysis process

```bash
mpirun -np <num_processes> python opmd_pq_reader.py -d <parquet_directory> -o <output_directory> -f <opmd_file> -a <analysis_type>
```
	
- `mpirun -np <num_processes>`: Specifies the number of parallel processes to run. Replace <num_processes> with the desired number of MPI processes.

- `python opmd_pq_reader.py`: The command to execute the analysis script.

- `-d <parquet_directory>` or `--pq_dir <parquet_directory>:` The directory containing the Parquet files generated by the filtering script. Replace <parquet_directory> with the path to your Parquet files.

- `-o <output_directory>` or `--output_dir <output_directory>`: The directory where the output plots will be saved. Replace `<output_directory>` with the desired output path.

- `-f <opmd_file>` or `--opmd_file <opmd_file>`: The specific OpenPMD file to analyze. This should match the file processed in the filtering step.

- `-s <species>` or `--species <species>`: (Optional) The particle species name to analyze (default is "electrons"). Replace <species> with the desired species name.

-  `-a <analysis_type>` or `--analyze <analysis_type>`: Specifies which type of analysis to run. Options include:
	- `field`: Analyze only the electric field data.
	- `particle`: Analyze only the particle data.
	- `full`: Perform both field and particle analyses (default).

### Examples

- To filter data from a specific OpenPMD file:

	```bash
	mpirun -np 4 python opmd_filter.py -d /path/to/opmd_data -f simulation.bp -o /path/to/output
	```

- To analyze the generated Parquet files and create histograms:

	```bash
	mpirun -np 4 python opmd_pq_reader.py -d /path/to/output/simulation/ -o /path/to/plots/ -f simulation.bp -a full
	```

## Acknowledgments

Special thanks to the [openpmd](https://github.com/openPMD) community and particularly the [openpmd-api](https://github.com/openPMD/openPMD-api) 
developpers for their support and feedback.

## Contact

For any questions or inquiries, please contact [d.bertini@gsi.de](mailto:D.Bertini@gsi.de) [j.hornung@gsi.de](mailto:J.Hornung@gsi.de).