Avro data

Use FAVROPARSER to load Avro data files.

Use FAVROPARSER to load Avro data files. This parser supports both columnar and Flex tables.

A column can be of any scalar type, a strongly-typed complex type, or a flexible complex type. A flexible complex type means you do not fully specify the schema for that column. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from them.

The following requirements apply:

  • Avro files must be encoded in the Avro binary serialization encoding format, described in the Apache Avro standard. The parser also supports Snappy and deflate compression.

  • The Avro file must have its related schema in the file being loaded; FAVROPARSER does not support Avro files with separate schema files.

In the COPY statement, use the PARSER parameter to specify the Avro parser as in the following example:

=> COPY weather FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER()

This parser has several optional parameters, some of which are specific to use with Flex tables and flexible complex types.

Avro schemas and columnar tables

Avro includes the schema with the data. When you load Avro data into a columnar table, the column names in the schema in the data must match the column names in the table. You do not need to load all of the columns in the data.

For example, the following Avro schema uses the Avro record type to represent a user profile:

{
  "type": "record",
  "name": "Profile",
  "fields" : [
      {"name": "UserName", "type": "string"},
      {"name": "Email", "type": "string"},
      {"name": "Address", "type": "string"}
   ]
}

To successfully load the data into a columnar table with this schema, each target column name must match the "name" value in the schema. In the following example, the profiles table does not load values corresponding to the schema's Email field because the target column is named EmailAddr:

=> COPY profiles FROM '/home/dbadmin/data/user_profile.avro' PARSER FAVROPARSER();

=> SELECT * FROM profiles;
     UserName    |     EmailAddr      |       Address
-----------------+--------------------+---------------------
    dbadmin      |                    |    123 Main St.

Strongly-typed complex types

Avro data can contain arrays, structs, and combinations of the two. You can read this data either as flexible (VMap) columns or with strong typing. Strong typing allows you to query values directly, without having to use functions to unpack a VMap column.

Use the ARRAY and ROW types in the table definition as usual:

=> CREATE EXTERNAL TABLE rest
  (name VARCHAR, cuisine VARCHAR,
   location_city ARRAY[VARCHAR(80)],
   menu ARRAY[ ROW(item VARCHAR(80), price FLOAT) ]
  )
AS COPY FROM :avro_file PARSER FAVROPARSER();

You can use strong typing in both external tables and native tables.

An alternative to strong typing for complex types is flexible complex types.

Unmatched fields

The Avro parser loads all fields from the data that are part of the table definition. If the data contains other fields, the parser produces a warning and logs the new fields in a system table. You can review the logged events and decide whether to modify your table definition. For details, see Schema evolution.

Rejecting data on materialized column type errors

By default, if FAVROPARSER cannot coerce a data value to a type that matches the column definition, it sets the value to NULL. You can choose to instead reject these values using the reject_on_materialized_type_error parameter. If this parameter is true, COPY rejects such rows and reports an error.

If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.

If a flex table has a materialized column, the data being loaded must be coercible to that column's type. For example, if a materialized column is declared as a FLOAT and you try to load a VARCHAR value for that key, FAVROPARSER rejects the data row.

See also