Overview
The record_batch class provides a table-like data structure for storing columnar data with named fields. It represents a collection of equal-length arrays mapped to unique column names, similar to a database table or DataFrame where each array forms a column.
This class is particularly useful for:
- Working with tabular data in Arrow format
- Interoperating with Arrow C structures
- Batch processing of columnar data
- Serialization and data exchange
Key Features
- Column-oriented storage: Data is stored as arrays with associated names
- Arrow compatibility: Direct integration with Arrow C structures (ArrowArray and ArrowSchema)
- Type safety: Built on top of the
struct_array class
- Efficient access: O(1) column access by index, O(1) average access by name
- Validation: Ensures all arrays have the same length and column names are unique
- Flexible construction: Multiple ways to create record batches
Basic Usage
Creating a Record Batch
From Separate Names and Arrays
std::vector<std::string> names = {"id", "name", "age"};
std::vector<sparrow::array> columns = {
};
Dynamically typed array encapsulating an Arrow layout.
primitive_array_impl< T, Ext, T2 > primitive_array
Array of values of whose type has fixed binary size.
From Named Arrays
auto iota = std::ranges::iota_view{0, 10};
iota | std::views::transform([](auto i) { return static_cast<uint16_t>(i); }),
true,
"first"
);
auto iota2 = std::ranges::iota_view{4, 14};
auto iota3 = std::ranges::iota_view{2, 12};
std::vector<sparrow::array> named_columns = {
};
From Initializer List
{"id", id_array},
{"name", name_array},
{"age", age_array}
};
From Struct Array
Accessing Data
Getting Column Information
std::size_t num_rows = batch.
nb_rows();
const auto& batch_name = batch.
name();
SPARROW_API const std::optional< name_type > & name() const
Gets the name of the record batch.
SPARROW_API bool contains_column(const name_type &key) const
Checks if the record batch contains a column with the specified name.
SPARROW_API const name_type & get_column_name(size_type index) const
Gets the name of the column at the specified index.
SPARROW_API size_type nb_rows() const
Gets the number of rows in the record batch.
SPARROW_API size_type nb_columns() const
Gets the number of columns in the record batch.
Accessing Columns
auto names = batch.
names();
for(const auto& name : names) {
std::cout << name << std::endl;
}
for(const auto& col : columns) {
}
SPARROW_API const array & get_column(const name_type &key) const
Gets the column with the specified name.
SPARROW_API name_range names() const
Gets a range view of the column names.
auto columns() const
Gets a range view of the columns.
Mutable Access
Modifying a Record Batch
Adding Columns
{"HR", "IT", "Sales", "IT", "HR"},
true,
"department"
);
SPARROW_API void add_column(name_type name, array column)
Adds a new column to the record batch with the specified name.
Extracting Struct Array
SPARROW_API struct_array extract_struct_array()
Moves the internal columns into a struct_array and empties the record batch.
Arrow C Interface Integration
The record_batch class provides seamless integration with Arrow C structures.
From Arrow C Structures
Taking Ownership
Taking Ownership of Array Only
Referencing Existing Structures
Comparison and Equality
if(batch1 == batch2) {
std::cout << "Batches are equal" << std::endl;
}
if(batch1 != batch2) {
std::cout << "Batches are different" << std::endl;
}
Two record batches are equal if:
- They have the same number of columns
- Column names match in the same order
- Corresponding arrays are equal
- Record batch names match (both present and equal, or both absent)
Formatting and Printing
If C++20 <format> is available, record_batch supports formatting:
#ifdefined(__cpp_lib_format)
std::string formatted = std::format("{}", batch);
std::cout << formatted << std::endl;
#endif
Stream output is also supported:
std::cout << batch << std::endl;
Copy and Move Semantics
Copy Operations
Move Operations
batch3 = std::move(batch2);
Constraints and Invariants
Preconditions
- Equal array lengths: All arrays must have the same length
- Unique column names: Column names must be unique within a record batch
- Non-empty names: When constructing from named arrays, each array must have a non-empty name
- Size matching: When providing separate names and arrays, both ranges must have the same size
Postconditions
- Consistent state: The record batch maintains internal consistency between names, arrays, and the name-to-array mapping
- O(1) access: Column access by index is O(1), by name is O(1) average time
- Thread safety: Read operations are thread-safe; write operations require external synchronization
Performance Considerations
- Name lookup: Uses an internal hash map for O(1) average-case column lookup by name
- Lazy map construction: The name-to-array map is built lazily and cached
- Copy costs: Copying a record batch performs a deep copy of all arrays
- Move efficiency: Move operations are efficient and don't copy array data
Complete Example
#include <cassert>
{
auto iota = std::ranges::iota_view{std::size_t(0), std::size_t(data_size)};
iota | std::views::transform([](auto i) {
return static_cast<std::uint16_t>(i);
})
);
auto iota2 = std::ranges::iota_view{std::int32_t(4), 4 + std::int32_t(data_size)};
auto iota3 = std::ranges::iota_view{std::int32_t(2), 2 + std::int32_t(data_size)};
return {
};
}
{
const std::vector<std::string> name_list = {"first", "second", "third"};
constexpr std::size_t data_size = 10;
assert(record.name() == "record batch name");
assert(record.nb_columns() == array_list.size());
assert(record.nb_rows() == data_size);
assert(record.contains_column(name_list[0]));
assert(record.get_column_name(0) == name_list[0]);
assert(record.get_column(0) == array_list[0]);
assert(std::ranges::equal(record.names(), name_list));
assert(std::ranges::equal(record.columns(), array_list));
return EXIT_SUCCESS;
}
std::vector< sparrow::array > make_array_list(const std::size_t data_size)
See Also