Vertica provides two general approaches to developing a table definition during data exploration without resorting to inspecting data files by hand: inference and flex tables.
Vertica provides two general approaches to developing a table definition during data exploration without resorting to inspecting data files by hand: inference and flex tables.
If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition. The output indicates where the function could not make a decision, so that you can focus on those areas as you refine the definition. Using this approach, you iteratively test sample data, refine the table definition, and test more data until you arrive at a satisfactory table definition.
Alternatively, you can use flex tables, a special type of table designed for exploring data with an unknown or varying schema. You can use flex functions to explore this data and materialize individual columns in your table. When you have identified the columns that should consistently be part of your table, you can then define a regular (non-flex) table.
1 - Inference
If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition.
If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition. You can then refine this definition, such as by specifying VARCHAR lengths, NUMERIC precision, and ARRAY bounds. If the data format supports it, you can ignore columns in the data that you do not need for your table.
When you have an initial table definition, you can load more data. New data might have different properties, so it's important to adjust your table definition in light of more data. Test the table definition with enough data to feel confident before putting it into production.
You might find it easier to do initial exploration with external tables, whose definitions are more easily modified.
Create first draft of table
The INFER_TABLE_DDL function inspects a data file and returns a CREATE TABLE or CREATE EXTERNAL TABLE statement based on its contents. The column definitions in the returned statement can be incomplete, and sometimes the function cannot determine a column's data type. The function also uses the column names as they appear in the data, even if they violate Vertica restrictions. For all of these reasons, you must review and adjust the proposed definition.
In the following example, the input path contains data for a product catalog. Note the warnings about data types:
The table definition contains several VARCHAR values with unspecified lengths. The Vertica default length for VARCHAR is 80, and for a native table, Vertica allocates memory based on the maximum length. If a value is unlikely to be that large, such as for a name, you can specify a smaller size to improve query performance. Conversely, if a value could be much longer, such as the description field here, you can specify a larger size so that Vertica does not truncate the value.
For numeric values, INFER_TABLE_DDL suggests precision and scale based on the data in the sample file. As with VARCHAR, values that are too high cause more memory to be allocated. In addition, computations could produce unexpected results by using the extra precision instead of rounding.
Arrays, by default, are unbounded, and Vertica allocates memory to accommodate the maximum binary size. To reduce this footprint, you can specify a maximum number of elements for an array or its maximum binary size. For more information about array bounds, see Limits on Element Count and Collection Size on the ARRAY reference page.
In this example, you might make the following changes:
Shorten the product name and vendor name fields, for example VARCHAR(32).
Lengthen the description field, for example VARCHAR(1000).
Accept the suggested numeric precision and scale.
Specify a maximum number of elements for the vendors array.
As a sanity check, start by loading the data file you used for INFER_TABLE_DDL. All rows should load successfully. The following query output has added carriage returns for readability:
INFER_TABLE_DDL bases its recommendations on a small sample of your data. As you load more data you might find outliers. Before you use your new table in production, test it with more data.
In the following example, a data file contains values that do not satisfy the table definition:
If you compare the query output to the previous query, one immediate difference is the width of the name column. The value in the second row uses the full VARCHAR(32) length and appears to be truncated.
By default, COPY truncates VARCHAR values that are too long instead of rejecting them. The warning from the COPY statement indicates that in addition to the name length, there is another problem that caused data not to be loaded.
The first rejection is caused by the long string. The second is caused by a price that exeeds the defined scale. You can increase both using ALTER TABLE:
Continue to load more data using the ENFORCELENGTH and REJECTED DATA options, adjusting your table definition as needed, until the rate of new problems drops to an acceptable level. Most data sets of any size contain some anomalies that will be rejected. You can use the REJECTMAX option to set a limit on rejections per data load; if the limit is exceeded, COPY aborts the load.
Data formats
INFER_TABLE_DDL supports data in Parquet, ORC, Avro, and JSON formats. The first three formats enforce a schema on each file, but JSON data can contain any mix of fields and data types. When inspecting JSON data, INFER_TABLE_DDL can report multiple options. The iterative process of loading samples, examining the results, and adjusting the table definition is the same, but some details are different for JSON.
For data in other formats, including delimited text data, you cannot use INFER_TABLE_DDL. You can examine sample data and create your own initial table definition, and then test iteratively. Alternatively, you can use a flex table to explore the data.
2 - Flex tables
Flex tables are designed for data with an unknown or varying schema.
Flex tables are designed for data with an unknown or varying schema. While flex tables generally do not meet the performance needs of a production system, you can use them to explore data and develop a table definition. You can use flex tables with all data formats, including delimited (CSV).
A flex table has a special column, named __raw__, that contains all fields from data you load into it. Each row is independent and might have different fields in this column. You can use flex functions to explore this data and materialize individual columns in your table. When you have identified the columns that should consistently be part of your table, you can then define a regular (non-flex) table.
You can iterate by loading and exploring small amounts of data, as described in Inference. However, flex tables allow you to load larger amounts of data without producing errors and rejections, so you might choose to start with a larger sample.
Column discovery
The simplest flex table does not specify any columns. In this example, the first thousand rows are pulled from a much larger data set:
The COMPUTE_FLEXTABLE_KEYS function extrapolates field names and candidate data types from the raw data. The more data you have loaded the more work this function needs to do and the longer it takes to finish. If the function runs slowly for you, try a smaller sample size.
In this example, all 1000 rows have all of the fields. This suggests that the larger data set is probably consistent, particularly if the sample was taken from the middle. In this case, the data was exported from another database. Had the data been more varied, you would see different numbers of occurrences in the keys table.
COMPUTE_FLEXTABLE_KEYS, unlike INFER_TABLE_DDL, proposes lengths for VARCHAR values and precision and scale for NUMERIC values. The function uses the largest values found in the data in the table. Future data could exceed these limits, so make sure you are using representative data or a large-enough sample.
If you query a flex table using SELECT *, the results include the raw column. This column is in a binary format, so querying it is not especially useful. To see a sample of the data, query the fields explicitly, as if they were columns. The following example omits the text of the review, which COMPUTE_FLEXTABLE_KEYS reported as potentially quite large (VARCHAR(9878)):
This sample shows several integer fields with values of 0. These fields (useful, funny, and cool) appear to be counters indicating how many people found the review to have those qualities. To see how often they are used, which might help you decide whether to include them in your table, you can use the COUNT function:
All of the fields in this data except text are fairly small. How much of the review text you want to include depends on how you plan to use the data. If you plan to analyze the text, such as to perform sentiment analysis, you might set a larger size. If you don't plan to use the text, you might either ignore the text entirely or set a small size and allow the data to be truncated on load:
The JSON parser warns about keys found in the data that are not part of the table definition. Some other parsers require you to consume all of the fields. Consult the reference pages for specific parsers for more information about whether and how you can ignore fields.
To find out what unmatched fields appeared in the data, query the UDX_EVENTS system table:
The UDX_EVENTS table shows that the only unmatched field is the previously-seen text field. You can safely commit the load transaction. When you have tested with enough data to feel confident, you can suppress the parser warning as explained on FJSONPARSER.
The vendors field has a type of LONG VARBINARY, but the name of the column implies a collection. You can use the MAPTOSTRING function to sample the data:
=>SELECTMAPTOSTRING(__raw__)FROMproducts_flexLIMIT1;MAPTOSTRING-------------------------------------------------------------------------------{"description":"something long goes here","id":"1064","msrp":"329.99","name":"4k 48in TV","vendors":{"0.name":"Amazon","0.price":"299.99","1.name":"Best Buy","1.price":"289.99","2.name":"Bob's Hardware","2.price":"309.99"}}(1row)
By default, the JSON parser "flattens" JSON maps (key/value pairs). In this example, the value of the vendors field is presented as a set of six key/value pairs, but the key names (number.name and number.price) indicate that the data is more structured. Disabling this flattening makes the true structure of the JSON data more clear:
=>TRUNCATETABLEproducts_flex;=>COPYproducts_flexFROM'/home/data/products/sample.json'PARSERFJSONPARSER();RowsLoaded-------------3(1row)=>SELECTMAPTOSTRING(__raw__)FROMproducts_flexLIMIT1;MAPTOSTRING--------------------------------------------------------------------------------{"description":"something long goes here","id":"1064","msrp":"329.99","name":"4k 48in TV","vendors":{"0":{"name":"Amazon","price":"299.99"},"1":{"name":"Best Buy","price":"289.99"},"2":{"name":"Bob's Hardware","price":"309.99"}}}(1row)
This data indicates that the vendors field holds an array of structs with two fields, name and price. Using this information and the computed keys for the other fields in the data, you can define a table with strongly-typed columns:
The JSON format supports arrays, structs, and combinations of the two. Some formats are more limited. Delimited (CSV) data can represent arrays of scalar types, but not structs. Further, not all parsers that you can use with flex tables support complex types.
The parser has mistaken the arrays as strings, and COMPUTE_FLEXTABLE_KEYS has mistaken an array of numbers for an INTERVAL and an array of strings as a VARCHAR. These are limitations of flex tables used with certain file formats, not of Vertica in general. If you see unusual results, try adding real columns to the flex table:
For data exploration you need not be concerned with warnings such as this one. When you are ready to define a production table, heed any warnings you receive.
After loading the data, you can confirm that the arrays were loaded correctly:
You can continue to use flex tables to explore data, creating real columns for complex types as you encounter them. FDELIMITEDPARSER is specific to flex tables; for native tables (created with CREATE TABLE), DELIMITED is the default parser: