sparrow 1.3.0
Loading...
Searching...
No Matches
Builder

Motivation

Arrow data structures are "structs of arrays" (SoA) which are all composed of flat arrays. This is a very efficient way to store data, but might be cumbersome to create such data structures. To illustrate this, consider the following example. Suppose you have a list of lists of integers, like [[1, 2], [3, 4, 5], [6, 7]]. The usual way of storing (and thinking about) this data is as a vector of vectors std::vector<std::vector<int>>. In Arrow format, this data is stored as two flat arrays (ignoring the null bitmap for now):

  • A flat array of integers [1, 2, 3, 4, 5, 6, 7]
  • An array of offsets [0, 2, 5, 7] which indicates the start of each list.

The last element of the offsets array is the total number of elements in the flat array. Therefore, to build the arrow data structure, we need to think in terms of the flattened array. While this is still relatively simple for the example above, it can become quite complex once we have even more nested data structures. For a triple nested list of integers [[[1, 2], [3, 4]], [[5, 6], [7, 8]]], we would need to create the following arrays:

  • A flat array of integers [1, 2, 3, 4, 5, 6, 7, 8]
  • An array of offsets [0, 2, 4, 8] which indicates the start of the inner lists.
  • An array of offsets [0, 2, 4] which indicates the start of the outer lists.

This is where the builder comes in. It provides a convenient way to build Arrow data structures from nested STL containers, i.e., from things like std::vector<std::vector<int>>, std::vector<std::tuple<int, double>>, etc.

Type / Layout Mapping

The sparrow::builder function is a template function that takes arbitrary nested STL containers and returns an Arrow data structure. The mapping between STL containers and Arrow data structures is conceptually the following:

Concept Arrow Layout Example
range of scalars primitive layout std::vector<int>
range of strings variable sized binary layout std::vector<std::string>
range of ranges list layout std::vector<std::vector<int>>
range of fixed-size ranges fixed size list layout std::vector<std::array<int,3>>
range of tuples struct layout std::vector<std::tuple<int, double>>
range of variants sparse/dense union layout std::vector<std::variant<int, double>>

When nesting containers, the above rules are applied recursively. For example, a std::vector<std::vector<int>> would be converted to a list layout of primitive layout (i.e., list[int]). Here are some more examples:

Type Arrow Layout
std::vector<std::vector<int>> list[int]
std::vector<std::vector<std::string>> list[utf8]
std::vector<std::vector<std::tuple<int, double>>> list[struct[int, double]]
std::vector<std::vector<std::variant<int, double>>> list[variant[int, double]]
std::vector<std::vector<std::array<int, 3>>> list[fixed_size_list[int]]
std::vector<std::tuple<std::vector<int>, std::string>> struct[list[int], utf8]

Null Support

Arrow has support for null values. This means an array element at a certain index can be missing. The sparrow::builder function supports this by using sparrow::nullable (similar to std::optional). Nullables can be "injected" into all levels of the nested STL containers. For example, a std::vector<sparrow::nullable<int>>{ 1, 2, sp::nullval, 4} would be converted to a primitive layout where the third element is missing. std::vector<std::tuple<int, sp::nullable<double>>>{ {1, 2.0}, {3, sp::nullval}, {4, 5.0} } would be converted to a struct layout with the second element of the second tuple missing.

The union layout in sparrow does not have its own bitmap. Instead, the bitmap of the elements in the union is used. Therefore, for having nullable values with unions, the nullable must be injected into the union elements: i.e., std::vector<std::variant<sp::nullable<int>, sp::nullable<double>>> instead of std::vector<sp::nullable<std::variant<int, double>>>.

Dict Encoding and Run End Encoding

Arrow has two special encodings: dict encoding and run end encoding. Dict encoding is used for columns with a small number of unique values. Instead of storing the values themselves, the indices of the values in a dictionary are stored. Run end encoding is used for columns with long runs of the same value. Instead of storing the value for each element, the value is stored once and the number of times it is repeated is stored.

Since any array of arbitrary layout can be "dict-encoded" or "run-end-encoded", it is ambiguous how to handle these encodings in the builder. A std::vector<int> could be mapped to a primitive layout, a dict-encoded primitive layout, or a run-end-encoded primitive layout.

sparrow::dict_encode<std::vector<int>> will be a dict-encoded primitive layout, sparrow::run_end_encode<std::vector<int>> will be a run-end-encoded primitive layout.

The dict-encoding and run-end-encoding can also appear inside the nested STL containers. For example, std::vector<std::tuple<int, sparrow::dict_encode<std::string>>> would be a struct layout with the second element being a dict encoded variable sized binary layout.

Note that the builder does not support consecutive nesting of dict_encoding and run_end_encoding. I.e., sparrow::dict_encode<sparrow::dict_encode<std::vector<int>>> is not supported. Similarly, sparrow::dict_encode<sparrow::run_end_encode<std::vector<int>>> is not supported.

Usage Example

Primitive Array

Primitive Array of Integers

// using initializer_list
auto arr = sparrow::build({1, 2, 3, 4, 5});
// using vector
std::vector<int> v{1, 2, 3, 4, 5};
auto arr2 = sparrow::build(v);
// using list
std::list<int> l{1, 2, 3, 4, 5};
auto arr3 = sparrow::build(l);
// using any range
auto iota = std::views::iota(1, 6)
| std::views::transform(
[](int i)
{
return static_cast<int>(i);
}
);
auto arr4 = sparrow::build(iota);
// all of the arrays above are equivalent to the manually built array
auto arr5 = sparrow::primitive_array<int>({1, 2, 3, 4, 5});
assert(arr == arr2);
assert(arr == arr3);
assert(arr == arr4);
assert(arr == arr5);

Primitive Array of Integers with Nulls

// using initializer_list (here we have to explicitly specify the type when using an
// initializer list with nulls)
// using vector
std::vector<sparrow::nullable<int>> v{1, 2, sparrow::nullval, 4, 5};
auto arr2 = sparrow::build(v);
// using list
std::list<sparrow::nullable<int>> l{1, 2, sparrow::nullval, 4, 5};
auto arr3 = sparrow::build(l);
// using any range
auto iota = std::views::iota(1, 6)
| std::views::transform(
{
}
);
auto arr4 = sparrow::build(iota);
// all of the arrays above are equivalent to the manually built array
std::vector<std::size_t> where_nulls{2};
sparrow::u8_buffer<int> values{1, 2, 3, 4, 5};
auto arr5 = sparrow::primitive_array<int>(std::move(values), where_nulls);
assert(arr == arr2);
assert(arr == arr3);
assert(arr == arr4);

List Array

ListArray of String

// [["hello", "world","!"], ["Another", "sentence"]]
std::vector<std::vector<std::string>> v{{"hello", "world", "!"}, {"Another", "sentence"}};
auto arr = sparrow::build(v);

ListArray of Strings with Nulls

// [["hello", "world","!"],NULL , ["Another", "sentence"]]
using string_vector = std::vector<std::string>;
using nullable_string_vector = sparrow::nullable<string_vector>;
std::vector<nullable_string_vector> v{
nullable_string_vector{string_vector{"hello", "world", "!"}},
nullable_string_vector{},
nullable_string_vector{string_vector{"Another", "sentence"}}
};
auto arr = sparrow::build(v);

ListArray of Struct Array

/*
[
[
(1, 2.5),
(2, 3.5)
],
[
(3, 5.5),
(5, 6.5),
(6, 7.5)
],
[
(7, 8.5)
]
]
*/
std::vector<std::vector<std::tuple<int, float>>> v{
{std::tuple<int, float>{1, 2.5f}, std::tuple<int, float>{2, 3.5f}},
{std::tuple<int, float>{3, 5.5f}, std::tuple<int, float>{5, 6.5f}, std::tuple<int, float>{6, 7.5f}},
{std::tuple<int, float>{7, 8.5f}}
};
auto arr = sparrow::build(v);

Fixed-Sized List

Fixed-Sized List Array of Variable-Sized Binary Array

std::vector<std::array<std::string, 2>> v{{"hello", "world"}, {"Another", "sentence"}, {"This", "is"}};
auto arr = sparrow::build(v);

Fixed-Sized List Array of Union Array

using variant_type = std::variant<int, float>;
using array_type = std::array<variant_type, 2>;
std::vector<array_type> v{{1, 2.5f}, {2, 3.5f}, {3, 4.5f}};
auto arr = sparrow::build(v);

Struct Array

using tuple_type = std::tuple<int, std::array<std::string, 2>, sparrow::nullable<float>>;
std::vector<tuple_type> v{
{1, {"hello", "world"}, 2.5f},
{2, {"Another", "sentence"}, sparrow::nullval},
{3, {"This", "is"}, 3.5f}
};
auto arr = sparrow::build(v);

Union Array

using variant_type = std::variant<int, std::array<std::string, 2>, sparrow::nullable<float>>;
std::vector<variant_type> v{
int{1},
std::array<std::string, 2>{{"A", "sentence"}},
};
auto arr = sparrow::build(v);

Dict Encoded Array

Dict Encoded Variable-Sized Binary Array

"hello",
"world",
"hello",
"world",
"hello",
}};
auto arr = sparrow::build(v);

Run End Encoded Array

"hello",
"hello",
"hello",
"world",
"world",
}};
auto arr = sparrow::build(v);