sparrow 2.0.0
C++20 idiomatic APIs for the Apache Arrow Columnar Format
Loading...
Searching...
No Matches
Record Batch

Overview

The record_batch class provides a table-like data structure for storing columnar data with named fields. It represents a collection of equal-length arrays mapped to unique column names, similar to a database table or DataFrame where each array forms a column.

This class is particularly useful for:

  • Working with tabular data in Arrow format
  • Interoperating with Arrow C structures
  • Batch processing of columnar data
  • Serialization and data exchange

Key Features

  • Column-oriented storage: Data is stored as arrays with associated names
  • Arrow compatibility: Direct integration with Arrow C structures (ArrowArray and ArrowSchema)
  • Type safety: Built on top of the struct_array class
  • Efficient access: O(1) column access by index, O(1) average access by name
  • Validation: Ensures all arrays have the same length and column names are unique
  • Flexible construction: Multiple ways to create record batches

Basic Usage

Creating a Record Batch

From Separate Names and Arrays

//Create arrays
sparrow::primitive_array<int32_t> id_array({1, 2, 3, 4, 5});
sparrow::primitive_array<std::string> name_array({"Alice", "Bob", "Charlie", "David", "Eve"});
sparrow::primitive_array<int32_t> age_array({25, 30, 35, 40, 45});
//Create column names
std::vector<std::string> names = {"id", "name", "age"};
//Create arrays list
std::vector<sparrow::array> columns = {
sparrow::array(std::move(id_array)),
sparrow::array(std::move(name_array)),
sparrow::array(std::move(age_array))
};
//Create record batch
sparrow::record_batch batch(names, columns, "employee_data");
Dynamically typed array encapsulating an Arrow layout.
Definition array_api.hpp:43
primitive_array_impl< T, Ext, T2 > primitive_array
Array of values of whose type has fixed binary size.

From Named Arrays

//Create arrays with names
auto iota = std::ranges::iota_view{0, 10};
iota | std::views::transform([](auto i) { return static_cast<uint16_t>(i); }),
true, // owning
"first" // column name
);
auto iota2 = std::ranges::iota_view{4, 14};
sparrow::primitive_array<int32_t> pr1(iota2, true, "second");
auto iota3 = std::ranges::iota_view{2, 12};
sparrow::primitive_array<int32_t> pr2(iota3, true, "third");
//Create record batch from named arrays
std::vector<sparrow::array> named_columns = {
sparrow::array(std::move(pr0)),
sparrow::array(std::move(pr1)),
sparrow::array(std::move(pr2))
};
sparrow::record_batch batch(named_columns, "my_batch");

From Initializer List

{"id", id_array},
{"name", name_array},
{"age", age_array}
};

From Struct Array

Accessing Data

Getting Column Information

//Get number of columns and rows
std::size_t num_cols = batch.nb_columns(); // Returns 3
std::size_t num_rows = batch.nb_rows(); // Returns 5
//Check if a column exists
bool has_age = batch.contains_column("age"); // Returns true
bool has_salary = batch.contains_column("salary"); // Returns false
//Get column name by index
const std::string& col_name = batch.get_column_name(0); // Returns "id"
//Get record batch name
const auto& batch_name = batch.name(); // Returns std::optional<std::string>
SPARROW_API const std::optional< name_type > & name() const
Gets the name of the record batch.
SPARROW_API bool contains_column(const name_type &key) const
Checks if the record batch contains a column with the specified name.
SPARROW_API const name_type & get_column_name(size_type index) const
Gets the name of the column at the specified index.
SPARROW_API size_type nb_rows() const
Gets the number of rows in the record batch.
SPARROW_API size_type nb_columns() const
Gets the number of columns in the record batch.

Accessing Columns

//Get column by name
const sparrow::array& age_col = batch.get_column("age");
//Get column by index
const sparrow::array& first_col = batch.get_column(0);
//Get all column names
auto names = batch.names(); // Returns a range of column names
for(const auto& name : names) {
std::cout << name << std::endl;
}
//Get all columns
auto columns = batch.columns(); // Returns a range of arrays
for(const auto& col : columns) {
// Process each column
}
SPARROW_API const array & get_column(const name_type &key) const
Gets the column with the specified name.
SPARROW_API name_range names() const
Gets a range view of the column names.
auto columns() const
Gets a range view of the columns.

Mutable Access

//Get mutable reference to column
sparrow::array& col = batch.get_column("age");
//Modify column (if applicable)
//Note: The array itself determines if modifications are allowed

Modifying a Record Batch

Adding Columns

//Add column with explicit name
sparrow::primitive_array<double> salary_array({50000.0, 60000.0, 70000.0, 80000.0, 90000.0});
batch.add_column("salary", sparrow::array(std::move(salary_array)));
//Add column using array's internal name
{"HR", "IT", "Sales", "IT", "HR"},
true,
"department"
);
batch.add_column(sparrow::array(std::move(dept_array)));
SPARROW_API void add_column(name_type name, array column)
Adds a new column to the record batch with the specified name.

Extracting Struct Array

//Extract the underlying struct array
//Note: After extraction, the record batch is in a moved-from state
SPARROW_API struct_array extract_struct_array()
Moves the internal columns into a struct_array and empties the record batch.

Arrow C Interface Integration

The record_batch class provides seamless integration with Arrow C structures.

From Arrow C Structures

Taking Ownership

//Transfer ownership of both ArrowArray and ArrowSchema
ArrowArray arr = /* ... */;
ArrowSchema sch = /* ... */;
sparrow::record_batch batch(std::move(arr), std::move(sch));

Taking Ownership of Array Only

//Transfer ownership of ArrowArray, reference ArrowSchema
ArrowArray arr = /* ... */;
ArrowSchema* sch = /* ... */;
sparrow::record_batch batch(std::move(arr), sch);

Referencing Existing Structures

//Reference both structures (no ownership transfer)
ArrowArray* arr = /* ... */;
ArrowSchema* sch = /* ... */;
sparrow::record_batch batch(arr, sch);
//Works with const pointers too
const ArrowArray* const_arr = /* ... */;
const ArrowSchema* const_sch = /* ... */;
sparrow::record_batch batch2(const_arr, const_sch);

Comparison and Equality

sparrow::record_batch batch1(names1, columns1);
sparrow::record_batch batch2(names2, columns2);
//Compare for equality
if(batch1 == batch2) {
std::cout << "Batches are equal" << std::endl;
}
//Check inequality
if(batch1 != batch2) {
std::cout << "Batches are different" << std::endl;
}

Two record batches are equal if:

  • They have the same number of columns
  • Column names match in the same order
  • Corresponding arrays are equal
  • Record batch names match (both present and equal, or both absent)

Formatting and Printing

If C++20 <format> is available, record_batch supports formatting:

#ifdefined(__cpp_lib_format)
const sparrow::record_batch batch = /* ... */;
std::string formatted = std::format("{}", batch);
std::cout << formatted << std::endl;
//Output example:
//|first|second|third|
//--------------------
//| 0| 4| 2|
//| 1| 5| 3|
//| 2| 6| 4|
//...
#endif

Stream output is also supported:

std::cout << batch << std::endl;

Copy and Move Semantics

Copy Operations

//Copy constructor
sparrow::record_batch batch1 = /* ... */;
sparrow::record_batch batch2(batch1); // Deep copy
//Copy assignment
sparrow::record_batch batch3 = /* ... */;
batch3 = batch1; // Deep copy

Move Operations

//Move constructor
sparrow::record_batch batch1 = /* ... */;
sparrow::record_batch batch2(std::move(batch1)); // batch1 is now invalid
//Move assignment
sparrow::record_batch batch3 = /* ... */;
batch3 = std::move(batch2); // batch2 is now invalid

Constraints and Invariants

Preconditions

  • Equal array lengths: All arrays must have the same length
  • Unique column names: Column names must be unique within a record batch
  • Non-empty names: When constructing from named arrays, each array must have a non-empty name
  • Size matching: When providing separate names and arrays, both ranges must have the same size

Postconditions

  • Consistent state: The record batch maintains internal consistency between names, arrays, and the name-to-array mapping
  • O(1) access: Column access by index is O(1), by name is O(1) average time
  • Thread safety: Read operations are thread-safe; write operations require external synchronization

Performance Considerations

  • Name lookup: Uses an internal hash map for O(1) average-case column lookup by name
  • Lazy map construction: The name-to-array map is built lazily and cached
  • Copy costs: Copying a record batch performs a deep copy of all arrays
  • Move efficiency: Move operations are efficient and don't copy array data

Complete Example

#include <cassert>
std::vector<sparrow::array> make_array_list(const std::size_t data_size)
{
auto iota = std::ranges::iota_view{std::size_t(0), std::size_t(data_size)};
iota | std::views::transform([](auto i) {
return static_cast<std::uint16_t>(i);
})
);
auto iota2 = std::ranges::iota_view{std::int32_t(4), 4 + std::int32_t(data_size)};
auto iota3 = std::ranges::iota_view{std::int32_t(2), 2 + std::int32_t(data_size)};
return {
sparrow::array{std::move(pr0)},
sparrow::array{std::move(pr1)},
sparrow::array{std::move(pr2)}
};
}
intmain()
{
const std::vector<std::string> name_list = {"first", "second", "third"};
constexpr std::size_t data_size = 10;
const std::vector<sparrow::array> array_list = make_array_list(data_size);
// Create record batch
const sparrow::record_batch record{name_list, array_list, "record batch name"};
// Verify properties
assert(record.name() == "record batch name");
assert(record.nb_columns() == array_list.size());
assert(record.nb_rows() == data_size);
assert(record.contains_column(name_list[0]));
assert(record.get_column_name(0) == name_list[0]);
assert(record.get_column(0) == array_list[0]);
assert(std::ranges::equal(record.names(), name_list));
assert(std::ranges::equal(record.columns(), array_list));
return EXIT_SUCCESS;
}
int main()
std::vector< sparrow::array > make_array_list(const std::size_t data_size)

See Also