Google Cloud Storage (GCS)

The Switchhboard Google Cloud Storage (GCS) connector provides automated, scheduled ingestions of data from GCS in a variety of formats.

Prerequisites

To configure access to the Google Cloud Storage (GCS) connector, you need:

  • Google OAuth or Google Service account credential
  • Google Cloud Project Name
  • Bucket Path
  • File Path

To obtain these credentials, contact the administrator of your Google Cloud Storage Account.

Scheduling

Google Cloud Storage (GCS) connector can be scheduled to run multiple times a day at user-defined hours and timezone.

  • To configure this schedule, use the delay_hours parameter.
  • By default, the connector will run once at 6am PT.

Sample Switchboard Script

import gs_file_raw from {
      type: "gcs_ng";
      regex: "gs://source_bucket/my_file_name.*";
      datetime_pattern: "*_YYYY-MM-DD_*";
      key: "gcs_key";
      lookback_days: 1;
      } using {
            filename: string;
            date: datetime;
            bx_pub_name: string;
            placement: string;
            device: string;
            demand_source: string;
};

For parquet files you should use the raw version of this downloader.

This version of the downloader does not support the same format, encoding and compression parameters as the above . It only supports the parquet format, with no other options.

download t from {
    type: "gcs:raw";
    pattern: "gs://source_bucket/my_file_name.*";
    datetime_pattern: "*file-YYYY-MM-DD.parquet";
    format: "parquet";
} using {
    idcol: integer;
    *
};

Parameters

Parameter Description Required/Optional?
File Pattern A list of requested file patterns:
pattern: “gs://source_bucket/my_pattern”;
regex: “gs://source_bucket/my_file_name.”;
Required
lookback_days Limits the number of previous days that the Datetime pattern applies.
For Example: lookback_days: 5;
Optional
period_hours Frequency of the system to checks for updates.
For Example: period_hours: 3;
Optional
delay_hours Delay in hours the system waits between the previous update and the next update after midnight.
For Example: delay_hours: 11;
Optional
format Specifies a format type.
For Example: format: “csv”;
For additional CSV specific parameters, see the File Formats and Encoding section. 
Optional
datime_pattern Specifies the date and time pattern type.
For Example: datetime_pattern: “YYYY-MM-DD”;
For additional information, see the Datetime Patterns section.
Optional

Specifying Patterns

File Patterns

Switchboard matches target GCS files based on wildcard patterns or regular expression. Switchboard polls the source bucket for new files that match the pattern or regular expression provided. By default, Switchboard re-ingest files upon detection of source file checksum.

  • To specify a file match by pattern, use the pattern parameter. The * character is used as a wildcard pattern match:
    pattern: "gs://source_bucket/my_pattern*";
    
  • To specify a file match using a regular expression, use the regex parameter containing a valid matching pattern:
    regex: "gs://source_bucket/my_pattern(a|b)_\d{6}.csv";
    

Datetime patterns

Configure Switchboard to poll for file names that match a date pattern. It allows importing the date-range backfill in the Switchboard UI. 

  • Add a datetime_pattern to the import configuration. Since target objects may have multiple dates in the filename, it is important to specify a pattern that matches the specific date string required.
    • To match the first date string in a GCS object name of gs://my_bucket/my_file_name_2020-01-01_2020_01-08.csv, use the following pattern:
      datetime_pattern: "*my_file_name_YYYY-MM-DD_*";
      
  • To limit the number of previous days that the DateTime pattern applies, use the lookback_days parameter. If no files are found for the lookback period, Switchboard will consider this an error. 
    • To locate files with a DateTime pattern that matches files within the past 5 days:
      lookback_days: 5;
      

File Formats and Encoding

Switchboard imports files with a variety of formats, encodings, and compression schemes.

File Formats

To specify a format type, use the format parameter:

format: "csv";

The available options are:

Format Description
csv Character-separate values by row. See CSV specific options
json New-line delimited JSON
parquet Parquet file format — note, use the gcs:raw downloader instead
avro Avro file format

CSV specific options

Options Description
header_row Boolean: Skip header row
preamble_rows Count of leading rows to skip
postamble_rows Count of trailing rows to skip
delimiter Delimiter characters:
comma: “,”
pipe: “|”
tab: “\t”
thorn: “þ”
space: “ “
caret: “^”
semicolon: “;”

Encodings

By default, Switchboard files are encoded in UTF-8. Switchboard uses standard Java Charset encoding format strings. 

  • To specify a file encoding using Latin Alphabet No. 1, provide the encoding parameter in the  import statement:
    encoding: "ISO-8859-1";
    

Compression

Switchboard ingests files in gzip or zip compression formats. 

  • To specify files in gzip format, provide the parameter in the import statement:
    compression: "gzip";