FAVROPARSER

Parses data from an Avro file.

Parses data from an Avro file. The following requirements apply:

  • Avro files must be encoded in the Avro binary serialization encoding format, described in the Apache Avro standard. The parser also supports Snappy and deflate compression.

  • FAVROPARSER does not support Avro files with separate schema files. The Avro file must include the schema.

You can load complex types in the Avro source (arrays, structs, or combinations) with strong typing or as flexible complex types. A flexible complex type is loaded into a VMap column, as in flex tables. To load complex types as VMap columns, specify a column type of LONG VARBINARY. To preserve the indexing in complex types, set flatten_maps to false.

This parser can notify you if it finds keys in the data that are not part of the table definition. See Unmatched Keys.

When loading into a flex table, Vertica loads all data into the __raw__ (VMap) column, including complex types found in the data.

This parser does not support apportioned load or cooperative parse.

Syntax

FAVROPARSER ( [parameter=value[,...]] )

Parameters

flatten_maps
Boolean, whether to flatten all Avro maps. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.

This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.

Default: true

flatten_arrays
Boolean, whether to flatten all Avro arrays. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.

This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.

Default: false

flatten_records
Boolean, whether to flatten all Avro records. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.

This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.

Default: true

reject_on_materialized_type_error

Boolean, whether to reject a data row that contains a materialized column value that cannot be coerced into a compatible data type. If the value is false and the type cannot be coerced, the parser sets the value in that column to NULL.

If the column is an array and the data to be loaded is too large, then false sets the column value to NULL and true rejects the row.

If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.

Default: false

suppress_warnings
String, which warnings to suppress:
  • unmatched_key (see Unmatched Keys)

  • true or t (suppress all warnings)

  • false or f (do not suppress warnings)

Default: false

Primitive data types

FAVROPARSER supports the following primitive data types, including as element types and field values in complex types.

AVRO Data Type Vertica Data Type Value
NULL NULL value No value
boolean Boolean data type A binary value
int INTEGER 32-bit signed integer
long INTEGER 64-bit signed integer
float

DOUBLE PRECISION (FLOAT)

Synonymous with 64-bit IEEE FLOAT

Single precision (32-bit) IEEE 754 floating-point number
double DOUBLE PRECISION (FLOAT) Double precision (64-bit) IEEE 754 floating-point number
bytes VARBINARY Sequence of 8-bit unsigned bytes
string VARCHAR Unicode character sequence

Avro logical types

FAVROPARSER supports the following Avro logical types. The target column must use a Vertica data type that supports the logical type. When you attempt to load data using an invalid logical type, the logical type is ignored and the underlying Avro type is used.

AVRO Logical Type Base Avro Type Supported Vertica Data Types

decimal

0 < precision ≤ 1024

0 ≤ scaleprecision

bytes or fixed

NUMERIC, Character

Vertica rejects the value if:

  • The Avro precision setting is greater than the precision setting for the target column.

  • For fixed types, the precision value is greater than what is allowed by the size attribute.

If the data type for the target column uses the default precision setting, the precision setting in the Avro schema overrides the default.

date integer DATE, Character
time-micros long

TIME/TIMETZ, Character

The time logical type does not provide a time zone value. For target columns that use the TIMETZ data type, Vertica uses UTC as the default.

time-millis int
timestamp-micros long

TIMESTAMP/TIMESTAMPTZ, TIME/TIMETZ

For timestamp-millis only, the timezone is included and is represented as an offset to UTC. Additionally, the millisecond values are right-extended with padded zeros.

timestamp-millis long
duration fixed INTERVAL, Character

Avro complex data types

The Avro format supports several complex data types. When loading into strongly-typed columns, you can use the ROW and ARRAY types to represent them. For example, Avro Record and Enums are structs (ROWs); see the Avro specification.

You can use ARRAY[ROW] to match an Avro map. You must name the ROW fields key and value. These are the names that the Avro format uses for those fields in the data, and the parser relies on field names to match data to table columns.

If the total size of an array exceeds the size defined by the target table, the parser sets the value to null.

When loading into flex tables or using flexible complex types, this parser handles Avro complex types as follows:

Record

The name of each field is used as a virtual column name. If flatten_records is true and several nesting levels are present, Vertica concatenates the record names to create the key name.

Map

The value of each map key is used as a virtual column name. If flatten_maps is true and several nesting levels are present, Vertica concatenates the key names to create the key name.

Enum

Vertica treats Avro Enums like records, with the name of the Enum as the key and the value as the value.

Array

Vertica treats Avro Arrays as key/value pairs. By default, the index of each element is the key. In the following example, product_detail is a Record with a field, product_category, that is an Array:

=> CREATE FLEX TABLE products;
CREATE TABLE

=> COPY products FROM :datafile WITH PARSER FAVROPARSER();
 Rows Loaded
-------------
           2
(1 row)

=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
                    maptostring
--------------------------------------------------------------------------------
 {
    "__name__": "Order",
    "customer_id": "111222",
    "order_details": {
        "0.__name__": "OrderDetail",
        "0.product_detail.__name__": "Product",
        "0.product_detail.price": "46.21",
        "0.product_detail.product_category": {
            "0": "electronics",
            "1": "printers",
            "2": "computers"
        },
        "0.product_detail.product_description": "hp printer X11ew description :\
P",
        "0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0004",
        "0.product_detail.product_id": "999012",
        "0.product_detail.product_map.one": "1.1",
        "0.product_detail.product_map.two": "1.1",
        "0.product_detail.product_name": "hp printer X11ew",
        "0.product_detail.product_status": "ONLY_FEW_LEFT",
        "0.quantity": "3",
        "0.total": "354.34"
    },
    "order_id": "2389646",
    "total": "132.43"
}
...

If flatten_arrays is true and several nesting levels are present, Vertica concatenates the indices to create the key name.

=> COPY products FROM :datafile WITH PARSER FAVROPARSER(flatten_arrays=true);
 Rows Loaded
-------------
           2
(1 row)

=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
                    maptostring
--------------------------------------------------------------------------------

 {
    "__name__": "Order",
    "customer_id": "111222",
    "order_details.0.__name__": "OrderDetail",
    "order_details.0.product_detail.__name__": "Product",
    "order_details.0.product_detail.price": "46.21",
    "order_details.0.product_detail.product_category.0": "electronics",
    "order_details.0.product_detail.product_category.1": "printers",
    "order_details.0.product_detail.product_category.2": "computers",
    "order_details.0.product_detail.product_description": "hp printer X11ew des\
cription :P",
    "order_details.0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0\
004",
    "order_details.0.product_detail.product_id": "999012",
    "order_details.0.product_detail.product_map.one": "1.1",
    "order_details.0.product_detail.product_map.two": "1.1",
    "order_details.0.product_detail.product_name": "hp printer X11ew",
    "order_details.0.product_detail.product_status": "ONLY_FEW_LEFT",
    "order_details.0.quantity": "3",
    "order_details.0.total": "354.34",
    "order_id": "2389646",
    "total": "132.43"
}
...

Union

Vertica treats Avro Unions as arrays.

Unmatched keys

Data being loaded can contain keys that are not part of the table definition. If you are loading into a flex table (or a flexible complex type column), no data is lost. For a table with strongly-defined columns, however, new keys cannot be loaded because the table does not have a place to put them.

This parser emits warnings if it finds new keys and if both of the following are true:

  • The target table is not a flex table.

  • The new key is not nested within a flexible complex type column.

New keys are logged in the UDX_EVENTS system table. If a new key is a complex type with nested keys, only the top-level key is logged. When you see a warning about unmatched keys, you can query this table and then use ALTER TABLE to modify your table definition for future loads.

Querying an external table loads data and thus can trigger these warnings. To prevent them, set the suppress_warnings parameter to 'unmatched_keys' or 'true':

=> CREATE EXTERNAL TABLE restaurants(
            name VARCHAR(50),
            menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100])
   AS COPY FROM '/data/rest.json'
   PARSER FAVROPARSER(suppress_warnings='unmatched_key');

Examples

This example shows how to create and load a flex table with Avro data using favroparser. After loading the data, you can query virtual columns:

=> CREATE FLEX TABLE avro_basic();
CREATE TABLE

=> COPY avro_basic FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER();
Rows Loaded
-------------
5
(1 row)

=> SELECT station, temp, time FROM avro_basic;
station | temp |     time
---------+------+---------------
mohali  | 0    | -619524000000
lucknow | 22   | -619506000000
norwich | -11  | -619484400000
ams     | 111  | -655531200000
baddi   | 78   | -655509600000
(5 rows)

For more information, see Avro data.