Complete Merging and Aggregation Documentation

This document provides comprehensive documentation for data merging, aggregation, and combination operations in the AnalysisG framework, including template functions, selection merging, and multi-sample aggregation strategies.

Overview

The merging and aggregation system provides: - Template functions for combining data structures - Selection template merging across samples - Event and graph aggregation - Hierarchical data combination strategies

Primary Module: typecasting (modules/typecasting/include/tools/merge_cast.h)

Related Modules: - selection - Selection template merging (selection_template::merge(), selection_template::merger()) - container - Multi-sample data management - analysis - Orchestration of merging operations

Template Merge Functions

Core Merging Templates

The merge_cast.h header provides generic template functions for merging different data types:

Location: src/AnalysisG/modules/typecasting/include/tools/merge_cast.h

1. merge_data() - Combine Data Structures

Merges data from one container into another:

// Vector merging - concatenates vectors
template <typename G>
void merge_data(std::vector<G>* out, std::vector<G>* p2) {
    out->insert(out->end(), p2->begin(), p2->end());
}

// Scalar merging - overwrites
template <typename G>
void merge_data(G* out, G* p2) {
    (*out) = *p2;
}

// Map merging - recursive merge of values
template <typename g, typename G>
void merge_data(std::map<g, G>* out, std::map<g, G>* p2) {
    typename std::map<g, G>::iterator itr = p2->begin();
    for (; itr != p2->end(); ++itr) {
        merge_data(&(*out)[itr->first], &itr->second);
    }
}

Use Cases: - Combining selections across multiple samples - Aggregating event data from different sources - Merging histogram results

Example - Merging Vectors:

#include <tools/merge_cast.h>

std::vector<float> sample1_jets = {45.2, 67.8, 23.1};
std::vector<float> sample2_jets = {89.4, 12.7};

// Merge sample2 into sample1
merge_data(&sample1_jets, &sample2_jets);
// Result: sample1_jets = {45.2, 67.8, 23.1, 89.4, 12.7}

Example - Merging Maps:

std::map<std::string, std::vector<float>> sample1_data;
sample1_data["jet_pt"] = {45.2, 67.8};
sample1_data["jet_eta"] = {-1.2, 0.5};

std::map<std::string, std::vector<float>> sample2_data;
sample2_data["jet_pt"] = {23.1};
sample2_data["jet_eta"] = {2.3};

// Recursive merge
merge_data(&sample1_data, &sample2_data);
// Result:
// sample1_data["jet_pt"] = {45.2, 67.8, 23.1}
// sample1_data["jet_eta"] = {-1.2, 0.5, 2.3}

2. sum_data() - Accumulate Data

Accumulates data with addition operation:

// Scalar summation
template <typename G>
void sum_data(G* out, G* p2) {
    (*out) += (*p2);
}

// Vector concatenation (same as merge_data for vectors)
template <typename G>
void sum_data(std::vector<G>* out, std::vector<G>* p2) {
    out->insert(out->end(), p2->begin(), p2->end());
}

// Map recursive summation
template <typename g, typename G>
void sum_data(std::map<g, G>* out, std::map<g, G>* p2) {
    typename std::map<g, G>::iterator itr = p2->begin();
    for (; itr != p2->end(); ++itr) {
        sum_data(&(*out)[itr->first], &itr->second);
    }
}

Use Cases: - Accumulating event counts - Summing weights across samples - Combining histograms

Example - Sum of Weights:

float total_weight = 1523.4;
float sample_weight = 876.2;

sum_data(&total_weight, &sample_weight);
// Result: total_weight = 2399.6

Example - Accumulating Histograms:

std::map<int, int> hist1;  // Bin -> Count
hist1[0] = 120;
hist1[1] = 450;
hist1[2] = 230;

std::map<int, int> hist2;
hist2[0] = 80;
hist2[1] = 310;
hist2[2] = 190;

sum_data(&hist1, &hist2);
// Result: hist1[0] = 200, hist1[1] = 760, hist1[2] = 420

3. contract_data() - Flatten Nested Structures

Converts nested structures into flat vectors:

// Add single element
template <typename g>
void contract_data(std::vector<g>* out, g* p2) {
    out->push_back(*p2);
}

// Flatten 1D vector
template <typename g>
void contract_data(std::vector<g>* out, std::vector<g>* p2) {
    for (size_t i(0); i < p2->size(); ++i) {
        contract_data(out, &p2->at(i));
    }
}

// Flatten 2D vector with reservation
template <typename g>
void contract_data(std::vector<g>* out, std::vector<std::vector<g>>* p2) {
    long ix = 0;
    reserve_count(p2, &ix);
    out->reserve(ix);
    for (size_t i(0); i < p2->size(); ++i) {
        contract_data(out, &p2->at(i));
    }
}

Use Cases: - Converting event-wise data to flat arrays - Preparing data for machine learning - Flattening jet collections across events

Example - Flatten Jet Collections:

// Per-event jet pT collections
std::vector<std::vector<float>> event_jets = {
    {45.2, 67.8, 23.1},   // Event 1: 3 jets
    {89.4, 12.7},         // Event 2: 2 jets
    {34.5, 56.7, 78.9, 90.1}  // Event 3: 4 jets
};

// Flatten to single vector
std::vector<float> all_jets;
contract_data(&all_jets, &event_jets);
// Result: all_jets = {45.2, 67.8, 23.1, 89.4, 12.7, 34.5, 56.7, 78.9, 90.1}

4. reserve_count() - Pre-calculate Size

Recursively counts elements for vector reservation:

template <typename g>
void reserve_count(g* inp, long* ix) {
    *ix += 1;
}

template <typename g>
void reserve_count(std::vector<g>* inp, long* ix) {
    for (size_t x(0); x < inp->size(); ++x) {
        reserve_count(&inp->at(x), ix);
    }
}

Use Cases: - Optimizing memory allocation - Pre-calculating total element count

Example:

std::vector<std::vector<int>> nested = {{1, 2}, {3}, {4, 5, 6}};
long count = 0;
reserve_count(&nested, &count);
// Result: count = 6

Selection Template Merging

Overview

The selection_template class provides two merging methods: 1. merge() - User-overridable method for custom merging logic 2. merger() - Internal method that calls merge() and handles bookkeeping

Location: src/AnalysisG/modules/selection/include/templates/selection_template.h

merge() Method - User Interface

Signature:

virtual void merge(selection_template* sel);

Purpose: User-defined logic for merging selection results from another selection instance

Override Pattern:

class MySelection : public selection_template {
    public:
        // Custom merging logic
        void merge(selection_template* other) override {
            MySelection* other_sel = dynamic_cast<MySelection*>(other);
            if (!other_sel) return;

            // Merge your custom data
            merge_data(&this->my_jets, &other_sel->my_jets);
            merge_data(&this->my_leptons, &other_sel->my_leptons);
            sum_data(&this->total_weight, &other_sel->total_weight);
        }

        std::vector<Jet*> my_jets;
        std::vector<Lepton*> my_leptons;
        float total_weight = 0.0;
};

When Called: Automatically invoked by the analysis framework when combining selections across samples

merger() Method - Internal Framework

Signature:

void merger(selection_template* sl2);

Purpose: Internal method that: 1. Calls user’s merge() method 2. Handles internal bookkeeping 3. Manages sequence tracking 4. Coordinates with write operations

Not User-Overridable: This method handles framework internals

Complete Merging Workflow

Multi-Sample Analysis Merging

When analyzing multiple samples, the framework merges results:

1. Per-Sample Processing:

// Sample 1: ttbar
MySelection* ttbar_sel = new MySelection();
ttbar_sel->process_sample("ttbar.root");

// Sample 2: signal
MySelection* signal_sel = new MySelection();
signal_sel->process_sample("signal.root");

2. Automatic Merging:

// Framework internally calls:
ttbar_sel->merger(signal_sel);
// Which calls:
ttbar_sel->merge(signal_sel);  // User-defined logic

3. Result Combination:

// After merge, ttbar_sel contains combined results:
// - All jets from both samples
// - All leptons from both samples
// - Sum of weights from both samples

Example: Multi-Sample Selection Merging

Complete example showing selection merging across samples:

#include <templates/selection_template.h>
#include <tools/merge_cast.h>

class TopAnalysisSelection : public selection_template {
    public:
        void merge(selection_template* other) override {
            auto* other_top = dynamic_cast<TopAnalysisSelection*>(other);
            if (!other_top) return;

            // Merge event counts
            sum_data(&this->n_events_processed, &other_top->n_events_processed);
            sum_data(&this->n_events_passed, &other_top->n_events_passed);

            // Merge selected objects
            merge_data(&this->selected_tops, &other_top->selected_tops);
            merge_data(&this->selected_jets, &other_top->selected_jets);
            merge_data(&this->selected_leptons, &other_top->selected_leptons);

            // Merge histograms (maps)
            merge_data(&this->top_mass_hist, &other_top->top_mass_hist);
            merge_data(&this->jet_pt_hist, &other_top->jet_pt_hist);

            // Sum weights
            sum_data(&this->total_weight, &other_top->total_weight);
        }

        // Event statistics
        long n_events_processed = 0;
        long n_events_passed = 0;

        // Selected objects
        std::vector<Top*> selected_tops;
        std::vector<Jet*> selected_jets;
        std::vector<Lepton*> selected_leptons;

        // Histograms
        std::map<int, float> top_mass_hist;  // Bin -> Weight
        std::map<int, float> jet_pt_hist;

        // Weights
        float total_weight = 0.0;
};

// Usage in analysis
void run_multi_sample_analysis() {
    TopAnalysisSelection* combined = new TopAnalysisSelection();

    // Process ttbar sample
    TopAnalysisSelection* ttbar = new TopAnalysisSelection();
    // ... process ttbar events ...
    ttbar->n_events_processed = 100000;
    ttbar->n_events_passed = 5230;
    ttbar->total_weight = 15432.5;

    // Merge ttbar into combined
    combined->merger(ttbar);

    // Process signal sample
    TopAnalysisSelection* signal = new TopAnalysisSelection();
    // ... process signal events ...
    signal->n_events_processed = 50000;
    signal->n_events_passed = 1245;
    signal->total_weight = 8234.7;

    // Merge signal into combined
    combined->merger(signal);

    // Result: combined now contains:
    // n_events_processed = 150000
    // n_events_passed = 6475
    // total_weight = 23667.2
    // All selected objects from both samples
}

Advanced Merging Patterns

Conditional Merging

Only merge if certain conditions are met:

void merge(selection_template* other) override {
    auto* other_sel = dynamic_cast<MySelection*>(other);
    if (!other_sel) return;

    // Only merge if selections are compatible
    if (this->analysis_mode != other_sel->analysis_mode) {
        return;  // Different modes, don't merge
    }

    // Conditional data merging
    if (this->include_systematics && other_sel->include_systematics) {
        merge_data(&this->systematic_variations,
                  &other_sel->systematic_variations);
    }

    // Always merge baseline results
    merge_data(&this->baseline_results, &other_sel->baseline_results);
}

Weighted Merging

Merge with sample-specific weighting:

void merge(selection_template* other) override {
    auto* other_sel = dynamic_cast<MySelection*>(other);
    if (!other_sel) return;

    // Get relative weights
    float this_weight = this->total_weight;
    float other_weight = other_sel->total_weight;
    float total = this_weight + other_weight;

    // Weighted average for central values
    this->avg_jet_pt = (this->avg_jet_pt * this_weight +
                       other_sel->avg_jet_pt * other_weight) / total;

    // Sum for counts and total weight
    sum_data(&this->total_weight, &other_sel->total_weight);
    merge_data(&this->all_jets, &other_sel->all_jets);
}

Hierarchical Merging

Merge nested data structures:

void merge(selection_template* other) override {
    auto* other_sel = dynamic_cast<MySelection*>(other);
    if (!other_sel) return;

    // Merge maps of vectors
    merge_data(&this->category_events, &other_sel->category_events);
    // Result: Each category's event list is concatenated

    // Merge maps of maps
    merge_data(&this->region_histograms, &other_sel->region_histograms);
    // Result: Each region's histograms are recursively merged
}

Integration with Analysis Pipeline

The analysis class orchestrates merging operations:

// In analysis::build_selections()
void build_selections() {
    selection_template* combined = nullptr;

    // Process each sample
    for (auto& [sample_name, sample_path] : file_labels) {
        // Create selection for this sample
        selection_template* sel = selection_names[sample_name]->clone();

        // Process sample events
        process_sample(sel, sample_path);

        // Merge into combined results
        if (!combined) {
            combined = sel;
        } else {
            combined->merger(sel);
            delete sel;
        }
    }

    // combined now contains results from all samples
}

Best Practices

1. Always Use merge_data() for Standard Types:

// Good
merge_data(&this->jets, &other->jets);

// Bad (manual iteration)
for (auto& jet : other->jets) {
    this->jets.push_back(jet);
}

2. Use sum_data() for Accumulation:

// Event counts, weights, statistics
sum_data(&this->n_events, &other->n_events);
sum_data(&this->total_weight, &other->total_weight);

3. Dynamic Cast for Safety:

void merge(selection_template* other) override {
    auto* typed_other = dynamic_cast<MySelection*>(other);
    if (!typed_other) {
        std::cerr << "Type mismatch in merge!" << std::endl;
        return;
    }
    // Safe to use typed_other now
}

4. Document Merge Behavior:

/**
 * Merges another TopSelection into this one.
 *
 * Merge behavior:
 * - Event counts: summed
 * - Selected objects: concatenated
 * - Histograms: bin-wise summed
 * - Weights: summed
 */
void merge(selection_template* other) override;

5. Test Merging Logic:

void test_merge() {
    MySelection sel1, sel2;

    // Setup test data
    sel1.jets = {jet1, jet2};
    sel1.total_weight = 100.0;

    sel2.jets = {jet3};
    sel2.total_weight = 50.0;

    // Merge
    sel1.merge(&sel2);

    // Verify
    assert(sel1.jets.size() == 3);
    assert(sel1.total_weight == 150.0);
}