This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Flex parsers reference

Vertica supports several parsers to load different types of data into flex tables.

Vertica supports several parsers to load different types of data into flex tables. The parsers in this section are specific to flex tables. The parsers described in Data formats can also load data into flex tables.

All flex parsers store the data as a single Vmap in the LONG VARBINARY __raw__ column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.

For information about how you can use each type of flex parser, see Using flex table parsers.

1 - FCEFPARSER

Parses ArcSight Common Event Format (CEF) log files.

Parses ArcSight Common Event Format (CEF) log files. The fcefparser loads values directly into any table column with a column name that matches a source data key. The parser stores the data loaded into a flex table in a single VMap.

Syntax

FAVROPARSER ( [parameter-name='value'[,...]] )

Parameters

delimiter

Single-character delimiter.

Default: ' '

record_terminator

Single-character record terminator.

**Default ****value: **newline

trim

Boolean, specifies whether to trim white space from header names and key values.

Default: true

reject_on_unescaped_delimiter

Boolean, specifies whether to reject rows containing unescaped delimiters. The CEF standard does not permit them.

Default: false

Examples

The following example illustrates creating a sample flex table for CEF data, with two real columns, eventId and priority.

  1. Create a flex table cefdata:

    => create flex table cefdata();
    CREATE TABLE
    
  2. Load some basic CEF data, using the flex parser fcefparser:

    => copy cefdata from stdin parser fcefparser();
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> CEF:0|ArcSight|ArcSight|2.4.1|machine:20|New alert|High|
    >> \.
    
  3. Use the maptostring() function to view the contents of your cefdata flex table:

    => select maptostring(__raw__) from cefdata;
                      maptostring
    -------------------------------------------------------------
     {
       "deviceproduct" : "ArcSight",
       "devicevendor" : "ArcSight",
       "deviceversion" : "2.4.1",
       "name" : "New alert",
       "severity" : "High",
       "signatureid" : "machine:20",
       "version" : "0"
    }
    
    
    (1 row)
    
  4. Select some virtual columns from the cefdata flex table:

    
    = select deviceproduct, severity, deviceversion from cefdata;
     deviceproduct | severity | deviceversion
    ---------------+----------+---------------
     ArcSight      | High     | 2.4.1
    (1 row)
    

    For more information, see Common event format (CEF) data

    See also

2 - FCSVPARSER

Parses CSV format (comma-separated values) data.

Parses CSV format (comma-separated values) data. Use this parser to load CSV data into columnar, flex, and hybrid tables. All data must be encoded in Unicode UTF-8 format. The fcsvparser parser supports the RFC 4180 standard for CSV data, and other options, to accommodate variations in CSV file format definitions. Invalid records are rejected. For more information about data formats, see Handling Non-UTF-8 input.

Syntax

FCSVPARSER ( [parameter='value'[,...]] )

Parameters

type

The default parameter values for the parser, one of the following strings:

  • rfc4180

  • traditional

You do not have to use the type parameter when loading data that conforms to the RFC 4180 standard (such as MS Excel files). See Loading CSV data for the RFC4180 default parameters, and other options you can specify for traditional CSV files.

Default: RFC4180

delimiter

A single-character value used to separate fields in the CSV data.

Default: , (for rfc4180 and traditional)

escape

A single-character value used as an escape character to interpret the next character in the data literally.

Default:

  • rfc4180: "

  • traditional: \

enclosed_by

A single-character value. Use enclosed_by to include a value that is identical to the delimiter, but should be interpreted literally. For example, if the data delimiter is a comma (,), and you want to use a comma within the data ("my name is jane, and his is jim").

Default: "

record_terminator

A single-character value used to specify the end of a record.

Default:

  • rfc4180: \n

  • traditional: \r\n

header

Boolean, specifies whether to use the first row of data as a header column. When header=true (default), and no header exists, fcsvparser uses a default column heading. The default header consists of ucoln, where n is the column offset number, starting with 0 for the first column. You can specify custom column heading names using the header_names parameter, described next.

If you specify header=false, the fcsvparser parses the first row of input as data, rather than as column headers.

Default: true

header_names A list of column header names, delimited by the character defined by the parser's delimiter parameter. Use this parameter to specify header names in a CSV file without a header row, or to override the column names present in the CSV source. To override one or more existing column names, specify the header names to use. This parameter overrides any header row in the data.
trim

Boolean, specifies whether to trim white space from header names and key values.

Default: true

omit_empty_keys

Boolean, specifies how the parser handles header keys without values. If true, keys with an empty value in the header row are not loaded.

Default: false

reject_on_duplicate

Boolean, specifies whether to ignore duplicate records (false), or to reject duplicates (true). In either case, the load continues.

Default: false

reject_on_empty_key

Boolean, specifies whether to reject any row containing a key without a value.

Default: false

reject_on_materialized_type_error

Boolean, specifies whether to reject any materialized column value that the parser cannot coerce into a compatible data type. See Loading CSV data.

Default: false

Examples

This example shows how you can use fcsvparser to load a flex table, build a view, and then query that view.

  1. Create a flex table for CSV data:

    => CREATE FLEX TABLE rfc();
    CREATE TABLE
    
  2. Use fcsvparser to load the data from STDIN. Specify that no header exists, and enter some data as shown:

    => COPY rfc FROM stdin PARSER fcsvparser(header='false');
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 10,10,20
    >> 10,"10",30
    >> 10,"20""5",90
    >> \.
    
  3. Run the compute_flextable_keys_and_build_view function, and query the rfc_view. Notice that the default enclosed_by character permits an escape character (") within a field ("20""5"). Thus, the resulting value was parsed correctly. Since no header existed in the input data, the function added ucoln for each column:

    => SELECT compute_flextable_keys_and_build_view('rfc');
                               compute_flextable_keys_and_build_view
    --------------------------------------------------------------------------------------------
     Please see public.rfc_keys for updated keys
    The view public.rfc_view is ready for querying
    (1 row)
    
    => SELECT * FROM rfc_view;
     ucol0 | ucol1 | ucol2
    -------+-------+-------
     10    | 10    | 20
     10    | 10    | 30
     10    | 20"5  | 90
    (3 rows)
    

    For more information and examples using other parameters of this parser, see Loading CSV data.

See also

3 - FDELIMITEDPAIRPARSER

Parses delimited data files.

Parses delimited data files. This parser provides a subset of the functionality in the parser fdelimitedparser. Use the fdelimitedpairparser when the data you are loading specifies pairs of column names with data in each row.

Syntax

FDELIMITEDPAIRPARSER ( [parameter-name='value'[,...]] )

Parameters

delimiter

Specifies a single-character delimiter.

Default: ' '

record_terminator

Specifies a single-character record terminator.

Default: newline

trim

Boolean specifies whether to trim white space from header names and key values.

Default: true

Examples

The following example illustrates creating a sample flex table for simple delimited data, with two real columns, eventId and priority.

  1. Create a table:

    => create flex table CEFData(eventId int default(eventId::int), priority int default(priority::int) );
    CREATE TABLE
    
  2. Load a sample delimited OpenText ArcSight log file into the CEFData table, using the fcefparser:

    => copy CEFData from '/home/release/kmm/flextables/sampleArcSight.txt' parser fdelimitedpairparser();
    Rows Loaded | 200
    
  3. After loading the sample data file, use maptostring() to display the virtual columns in the __raw__ column of CEFData:

    => select maptostring(__raw__) from CEFData limit 1;                                                                                                                                                                                                                                        maptostring
    -----------------------------------------------------------
       "agentassetid" : "4-WwHuD0BABCCQDVAeX21vg==",
       "agentzone" : "3083",
       "agt" : "265723237",
       "ahost" : "svsvm0176",
       "aid" : "3tGoHuD0BABCCMDVAeX21vg==",
       "art" : "1099267576901",
       "assetcriticality" : "0",
       "at" : "snort_db",
       "atz" : "America/Los_Angeles",
       "av" : "5.3.0.19524.0",
       "cat" : "attempted-recon",
       "categorybehavior" : "/Communicate/Query",
       "categorydevicegroup" : "/IDS/Network",
       "categoryobject" : "/Host",
       "categoryoutcome" : "/Attempt",
       "categorysignificance" : "/Recon",
       "categorytechnique" : "/Scan",
       "categorytupledescription" : "An IDS observed a scan of a host.",
       "cnt" : "1",
       "cs2" : "3",
       "destinationgeocountrycode" : "US",
       "destinationgeolocationinfo" : "Richardson",
       "destinationgeopostalcode" : "75082",
       "destinationgeoregioncode" : "TX",
       "destinationzone" : "3133",
       "device product" : "Snort",
       "device vendor" : "Snort",
       "device version" : "1.8",
       "deviceseverity" : "2",
       "dhost" : "198.198.121.200",
       "dlat" : "329913940429",
       "dlong" : "-966644973754",
       "dst" : "3334896072",
       "dtz" : "America/Los_Angeles",
       "dvchost" : "unknown:eth1",
       "end" : "1364676323451",
       "eventid" : "1219383333",
       "fdevice product" : "Snort",
       "fdevice vendor" : "Snort",
       "fdevice version" : "1.8",
       "fdtz" : "America/Los_Angeles",
       "fdvchost" : "unknown:eth1",
       "lblstring2label" : "sig_rev",
       "locality" : "0",
       "modelconfidence" : "0",
       "mrt" : "1364675789222",
       "name" : "ICMP PING NMAP",
       "oagentassetid" : "4-WwHuD0BABCCQDVAeX21vg==",
       "oagentzone" : "3083",
       "oagt" : "265723237",
       "oahost" : "svsvm0176",
       "oaid" : "3tGoHuD0BABCCMDVAeX21vg==",
       "oat" : "snort_db",
       "oatz" : "America/Los_Angeles",
       "oav" : "5.3.0.19524.0",
       "originator" : "0",
       "priority" : "8",
       "proto" : "ICMP",
       "relevance" : "10",
       "rt" : "1099267573000",
       "severity" : "8",
       "shost" : "198.198.104.10",
       "signature id" : "[1:469]",
       "slat" : "329913940429",
       "slong" : "-966644973754",
       "sourcegeocountrycode" : "US",
       "sourcegeolocationinfo" : "Richardson",
       "sourcegeopostalcode" : "75082",
       "sourcegeoregioncode" : "TX",
       "sourcezone" : "3133",
       "src" : "3334891530",
       "start" : "1364676323451",
       "type" : "0"
    }
    
    (1 row)
    
  4. Select the eventID and priority real columns, along with two virtual columns, atz and destinationgeoregioncode:

    
    =>  select eventID, priority, atz, destinationgeoregioncode from CEFData limit 10;
      eventID   | priority |         atz         | destinationgeoregioncode
    ------------+----------+---------------------+--------------------------
     1218325417 |        5 | America/Los_Angeles |
     1219383333 |        8 | America/Los_Angeles | TX
     1219533691 |        9 | America/Los_Angeles | TX
     1220034458 |        5 | America/Los_Angeles | TX
     1220034578 |        9 | America/Los_Angeles |
     1220067119 |        5 | America/Los_Angeles | TX
     1220106960 |        5 | America/Los_Angeles | TX
     1220142122 |        5 | America/Los_Angeles | TX
     1220312009 |        5 | America/Los_Angeles | TX
     1220321355 |        5 | America/Los_Angeles | CA
    (10 rows)
    

See also

4 - FDELIMITEDPARSER

Parses data using a delimiter character to separate values.

Parses data using a delimiter character to separate values. The fdelimitedparser loads delimited data, storing it in a single-value VMap. You can use this parser to load data into columnar and flex tables.

Syntax

FDLIMITEDPARSER ( [parameter-name='value'[,...]] )

Parameters

delimiter

Single character delimiter.

Default: |

record_terminator

Single-character record terminator.

Default: \n

trim

Boolean, specifies whether to trim white space from header names and key values.

Default: true

header

Boolean, specifies that a header column exists. The parser uses col### for the column names if you use this parameter but no header exists.

Default: true

omit_empty_keys

Boolean, specifies how the parser handles header keys without values. If omit_empty_keys=true, keys with an empty value in the headerrow are not loaded.

Default: false

reject_on_duplicate

Boolean, specifies whether to ignore duplicate records (false), or to reject duplicates (true). In either case, the load continues.

Default: false

reject_on_empty_key

Boolean, specifies whether to reject any row containing a key without a value.

Default: false

reject_on_materialized_type_error

Boolean, specifies whether to reject any row value for a materialized column that the parser cannot coerce into a compatible data type. See Using flex table parsers.

Default: false

treat_empty_val_as_null

Boolean, specifies that empty fields become NULLs, rather than empty strings ('').

Default: true

Examples

  1. Create a flex table for delimited data:

    t=> CREATE FLEX TABLE delim_flex ();
    CREATE TABLE
    
  2. Use the fdelimitedparser to load some delimited data from STDIN, specifying a comma (,) column delimiter:

    => COPY delim_flex FROM STDIN parser fdelimitedparser (delimiter=',');
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> deviceproduct, severity, deviceversion
    >> ArcSight, High, 2.4.1
    >> \.
    

You can now query virtual columns in the delim_flex flex table:

=> SELECT deviceproduct, severity, deviceversion from delim_flex;
 deviceproduct | severity | deviceversion
---------------+----------+---------------
 ArcSight      | High     | 2.4.1
(1 row)

See also

5 - FREGEXPARSER

Parses a regular expression, matching columns to the contents of the named regular expression groups.

Parses a regular expression, matching columns to the contents of the named regular expression groups.

Syntax

FREGEXPARSER ( pattern=[parameter-name='value'[,...]] )

Parameters

pattern

Specifies the regular expression of data to match.

Default: Empty string ("")

use_jit

Boolean, specifies whether to use just-in-time compiling when parsing the regular expression.

Default: false

record_terminator

Specifies the character used to separate input records.

Default: \n

logline_column

A string that captures the destination column containing the full string that the regular expression matched.

Default: Empty string ("")

Example

These examples use the following regular expression, which searches for information that includes the timestamp, date, thread_name, and thread_id strings.

This example expression loads any thread_id hex value, regardless of whether it has a 0x prefix, (<thread_id>(?:0x)?[0-9a-f]+).

'^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
 (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'
  1. Create a flex table (vlog) to contain the results of a Vertica log file. For this example, we made a copy of a log file in the directory /home/dbadmin/data/vertica.log:

    => CREATE FLEX TABLE vlog1();
    CREATE TABLE
    
  2. Use the fregexparser with the sample regular expression to load data from the log file. Be sure to remove any line characters before using this expression shown here:

    =>  COPY vlog1 FROM '/home/dbadmin/tempdat/KMvertica.log'
    PARSER FREGEXPARSER(pattern=
    '^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
     (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
    -?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
    \<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
    ?(?<text>.*)'
    
    ); Rows Loaded ------------- 31049 (1 row)
  3. After successfully loading data, use the MAPTOSTRING() function with the table's __raw__ column. The four rows (limt 4) that the query returns are regular expression results of the KMvertica.log file, parsed with fregexparser. The output shows thread_id values with a preceding 0x or without:

    => SELECT maptostring(__raw__) FROM vlog1 LIMIT 4;
                            maptostring
    -------------------------------------------------------------------------------------
    {
      "text" : " [Init] <INFO> Log /home/dbadmin/VMart/v_vmart_node0001_catalog/vertica.log 
    opened; #2",
      "thread_id" : "0x7f2157e287c0",
      "thread_name" : "Main",
      "time" : "2017-03-21 23:30:01.704"
    }
    
    {
      "text" : " [Init] <INFO> Processing command line: /opt/vertica/bin/vertica -D 
    /home/dbadmin/VMart/v_vmart_node0001_catalog -C VMart -n v_vmart_node0001 -h 
    10.20.100.247 -p 5433 -P 4803 -Y ipv4",
      "thread_id" : "0x7f2157e287c0",
      "thread_name" : "Main",
      "time" : "2017-03-21 23:30:01.704"
    }
    
    {
      "text" : " [Init] <INFO> Starting up Vertica Analytic Database v8.1.1-20170321",
      "thread_id" : "7f2157e287c0",
      "thread_name" : "Main",
      "time" : "2017-03-21 23:30:01.704"
    }
    
    {
      "text" : " [Init] <INFO> Compiler Version: 4.8.2 20140120 (Red Hat 4.8.2-15)",
      "thread_id" : "7f2157e287c0",
      "thread_name" : "Main",
      "time" : "2017-03-21 23:30:01.704"
    }
    (4 rows)
    

See also