Merge branch 'development' into 'main'

New design using analyse steps 1 - reducing openpmd files to columnar parquet... See merge request !2

Merge branch 'development' into 'main'
New design using analyse steps 1 - reducing openpmd files to columnar parquet... See merge request !2
96b47517 · d.bertini · 382c0cd8 · 0afe1840 · 96b47517
Commit 96b47517 authored 3 months ago by d.bertini
--- a/README.md
+++ b/README.md
-# PP Ana
+# pp-ana
+
 [OpenPMD](https://github.com/openPMD/openPMD-standard) python post processor using parallel I/O.

-It enables the reading of large PIC simulation [openPMD](https://github.com/openPMD/openPMD-standard) datasets.
+## Overview
+**pp-ana** enables the reading of large PIC simulation [openPMD](https://github.com/openPMD/openPMD-standard) datasets.
 It uses internally the [openpmd-api](https://openpmd-api.readthedocs.io/en/0.15.2/) for parallel
 reading/writing. 

+The **pp-ana** design follow a 2 steps approach:
+- 1) the simulation input openpmd data, which usually consist of large datasets, are reduced using selections (kinematical, geometrical ... ) , re-sampling or 
+particle merging methods. The data is then stored as reduced datasets. This step is done once or only few times.
+
+- 2) The post-processing (analysis),  which usually consists in visualizing the data using differents king of plots, can be performed by reading directly the reduced
+datasets provide by te first step. This step is usually done many times and is then naturally optimized by the reduced datasets to be read. 
+
+To perform this 2-steps approach  **pp-ana** contains two main post-processing components: 
+- (`opmdfilter.py`) the filtering program that processes in parallel OpenPMD data files and generates [Parquet files](https://parquet.apache.org/) 
+for both field and particle data
+- (`analyze.py`) the main analysis program that reads the generated [Parquet files](https://parquet.apache.org/)  
+to produce histograms and visualizations ([matplotlib](https://matplotlib.org/)). 
+
+The code is designed to leverage parallel processing using [MPI (Message Passing Interface)](https://www.open-mpi.org/) via the [mpi4py python interface](https://mpi4py.readthedocs.io/en/stable/) and  [Dask](https://www.dask.org/) for efficient data handling.
+
+## Workflow
+
+1. **Data Processing**: The filtering script reads in parallel the simulation data from  OpenPMD files, extracts/select relevant fields and particle information, and saves them in a efficient structured columnar format [Parquet](https://parquet.apache.org/)  for further analysis.
+
+2. **Parallel Computing**: The code utilizes MPI for parallel processing, allowing multiple processes to work on different parts of the data simultaneously. This is particularly useful for large datasets where reducing memory usage per node is essential.
+
+3. **Data Storage**: The use of [Parquet files](https://parquet.apache.org/)  provides an efficient way to store large amounts of data with support for compression, making it easier to read and write data in a distributed environment. This format has been prouven to be very efficient on the [lustre filesystem](https://www.lustre.org/) installed on the [gsi virgo cluster](https://hpc.gsi.de/virgo/)
+
+4. **Data Analysis**: The analysis code reads the Parquet files and performs various analyses, including generating histograms of particle energy and field data visualizations.
+
+## Main components
+
+### Filtering Script (`opmdfilter.py`)
+
+- **Command Line Arguments**:
+  - `--opmd_dir` or `-d`: Directory containing OpenPMD input files.
+  - `--opmd_file` or `-f`: Specific OpenPMD file to process (optional).
+  - `--output_dir` or `-o`: Directory to save the output [Parquet files](https://parquet.apache.org/) 
+  - `--species` or `-s`: Particle species name (default: "electrons").
+
+- **Implemented Features**:
+  - Traverses the specified OpenPMD directory to find relevant simulation files.
+  - Reads electric field data (Ex, Ey, Ez) and particle data (positions and momenta).
+  - Normalizes and filters particle data based on energy thresholds.
+  - Saves electric field and particle data as [Parquet files](https://parquet.apache.org/) , with metadata for field information.
+
+### Analysis Script (`analyze.py`)
+
+- **Command Line Arguments**:
+  - `--pq_dir` or `-d`: Directory containing the Parquet files.
+  - `--output_dir` or `-o`: Directory to save the output plots.
+  - `--opmd_file` or `-f`: Specific OpenPMD file to analyze.
+  - `--species` or `-s`: Particle species name (default: "electrons").
+  - `--analyze` or `-a`: Type of analysis to perform: 'field', 'particle', or 'full' (default: 'full').
+
+- **Implemented Features**:
+  - Initializes analyzers for field and particle data based on user input.
+  - Reads particle data and calculates energy, generating 2D/1D histograms of particle distributions.
+  - Analyzes divergence of particle momenta.
+  - Reads electric/magnetic field data and generates visualizationsin any 2D projections located in the middle of the non-visible direction.
+
+## Requirements
+
+- Python 3.x
+- Required libraries:
+  - `numpy`
+  - `pandas`
+  - `dask`
+  - `mpi4py`
+  - `pyarrow`
+  - `scipy`
+  - `openpmd-api`
+  - `openpmd-viewer`
+  
+   
+You can install the required libraries using pip:
+
+```bash
+pip install numpy pandas dask mpi4py pyarrow scipy openpmd-api openpmd-viewer
+```
+
+## Installation
+
+### Clone the repository:
+   ```bash
+   git clone https://git.gsi.de/d.bertini/pp-ana
+   cd pp-ana
+   ```
+The main components are located on the `/analysis` directory
+
+## Usage
+
+### Run the filtering process
+
+```bash
+mpirun -np <num_processes> python opmd_filter.py -d <opmd_directory> -f <opmd_file> -o <output_directory> 
+```
+  
+- `mpirun -np <num_processes>`: Specifies the number of parallel processes to run. 
+  	Replace <num_processes> with the desired number of MPI processes.
+
+- `python opmd_filter.py`: The command to execute the filtering script.
+
+- `-d <opmd_directory>` or `--opmd_dir <opmd_directory>`: The directory containing the OpenPMD input files. Replace <opmd_directory> with the path to your OpenPMD data.
+
+- `-f <opmd_file>` or `--opmd_file <opmd_file>`: (Optional) The specific OpenPMD file to process. If not provided, the script will process all files in the specified directory.
+
+- `-o <output_directory>` or `--output_dir <output_directory>`: The directory where the output Parquet files will be saved. Replace <output_directory> with the desired output path.
+
+- `-s <species>` or `--species <species>`: (Optional) The particle species name to filter (default is "electrons"). Replace `<species>` with the desired species name. 
+   
+   
+### Run the analysis process
+
+```bash
+mpirun -np <num_processes> python opmd_pq_reader.py -d <parquet_directory> -o <output_directory> -f <opmd_file> -a <analysis_type>
+```
+	
+- `mpirun -np <num_processes>`: Specifies the number of parallel processes to run. Replace <num_processes> with the desired number of MPI processes.
+
+- `python opmd_pq_reader.py`: The command to execute the analysis script.
+
+- `-d <parquet_directory>` or `--pq_dir <parquet_directory>:` The directory containing the Parquet files generated by the filtering script. Replace <parquet_directory> with the path to your Parquet files.
+
+- `-o <output_directory>` or `--output_dir <output_directory>`: The directory where the output plots will be saved. Replace `<output_directory>` with the desired output path.
+
+- `-f <opmd_file>` or `--opmd_file <opmd_file>`: The specific OpenPMD file to analyze. This should match the file processed in the filtering step.
+
+- `-s <species>` or `--species <species>`: (Optional) The particle species name to analyze (default is "electrons"). Replace <species> with the desired species name.
+
+-  `-a <analysis_type>` or `--analyze <analysis_type>`: Specifies which type of analysis to run. Options include:
+	- `field`: Analyze only the electric field data.
+	- `particle`: Analyze only the particle data.
+	- `full`: Perform both field and particle analyses (default).
+
+### Examples
+
+- To filter data from a specific OpenPMD file:
+
+	```bash
+	mpirun -np 4 python opmd_filter.py -d /path/to/opmd_data -f simulation.bp -o /path/to/output
+	```
+
+- To analyze the generated Parquet files and create histograms:
+
+	```bash
+	mpirun -np 4 python opmd_pq_reader.py -d /path/to/output/simulation/ -o /path/to/plots/ -f simulation.bp -a full
+	```
+
+## Acknowledgments
+
+Special thanks to the [openpmd](https://github.com/openPMD) community and particularly the [openpmd-api](https://github.com/openPMD/openPMD-api) 
+developpers for their support and feedback.
+
+## Contact
+
+For any questions or inquiries, please contact [d.bertini@gsi.de](mailto:D.Bertini@gsi.de) [j.hornung@gsi.de](mailto:J.Hornung@gsi.de).