Amazon S3
The Switchboard Amazon Simple Storage Service (S3) connector provides automated, scheduled ingestions of data from S3 in a variety of formats.
- Prerequisites
- Scheduling
- Parameters
- Specifying Patterns
- File Formats and Encoding
- Sample Switchboard Script
Prerequisites
To configure access to the Amazon S3 connector, you need:
- Amazon AWS IAM Key/Secret Pair
- Bucket Path
- File Path
To add a new Amazon AWS IAM Key/Secret Pair:
- Log in to the Dashboard.
- Click on the Keys tab.
- In the Options menu, select AWS.
NOTE: The Amazon AWS IAM Key/Secret Pair must have read access to the source bucket.
Scheduling
Amazon S3 connector can be scheduled to run multiple times a day at user-defined hour and timezone.
- To configure this schedule, use the delay_hours parameter.
- By default, the connector will run once at 6am PT.
Using Switchboard Static IP
If necessary due to IT or security policy, this connector can be configured to route traffic through one of Switchboard’s static IP addresses. To do so, include the parameter static_ip: true;
in the Switchboard Script import statement.
Parameters
- File Pattern string
- required
- A list of requested file patterns.
- Example: pattern: “s3://source_bucket/my_pattern”;
- Example: regex: “s3://source_bucket/my_file_name.”;
- datime_pattern string
- optional
- Specifies the date and time pattern type.
- Example: datetime_pattern: “YYYY-MM-DD”;
- For additional information, see the Datetime Patterns section.
- delay_hours integer
- optional
- Delay in hours the system waits between the previous update and the next update. Variances will be outlined in the scheduling section for each connector.
- Example: delay_hours: 11;
- format string
- optional
- Specifies a format type.
- Example: format: “csv”;
- For additional CSV specific parameters, see the File Formats and Encoding section.
- lookback_days integer
- optional
- Limits the number of previous days that the DateTime pattern applies.
- Example: lookback_days: 5;
- period_hours integer
- optional
- Frequency of the system to checks for updates.
- Example: period_hours: 3;
Specifying Patterns
File Patterns
Switchboard matches target S3 files based on wildcard patterns or regular expression. Switchboard polls the source bucket for new files that match the pattern or regular expression provided. By default, Switchboard re-ingest files upon detection of source file checksum.
- To specify a file match by pattern, use the pattern parameter. The * character is used as a wildcard pattern match:
pattern: "s3://source_bucket/my_pattern*";
- To specify a file match using a regular expression, use the regex parameter containing a valid matching pattern:
regex: "s3://source_bucket/my_pattern(a|b)_\d{6}.csv";
Datetime patterns
Configure Switchboard to poll for file names that match a date pattern. It allows importing the date-range backfill in the Switchboard UI.
- Add a datetime_pattern to the import configuration. Since target objects may have multiple dates in the filename, it is important to specify a pattern that matches the specific date string required.
- To match the first date string in an S3 object name of s3://my_bucket/my_file_name_2020-01-01_2020_01-08.csv, use the following pattern:
datetime_pattern: "*my_file_name_YYYY-MM-DD_*";
- To match the first date string in an S3 object name of s3://my_bucket/my_file_name_2020-01-01_2020_01-08.csv, use the following pattern:
- To limit the number of previous days that the DateTime pattern applies, use the lookback_days parameter. If no files are found for the lookback period, Switchboard will consider this an error.
- To locate files with a DateTime pattern that matches files within the past 5 days:
lookback_days: 5;
- To locate files with a DateTime pattern that matches files within the past 5 days:
File Formats and Encoding
Switchboard imports files with a variety of formats, encodings, and compression schemes.
File Formats
To specify a format type, use the format parameter:
format: "csv";
The available options are:
Format | Description |
---|---|
csv | Character-separate values by row. See CSV specific options |
json | New-line delimited JSON |
parquet | Parquet file format — note use, the s3:raw downloader instead |
avro | Avro file format |
CSV specific options
Options | Description |
---|---|
header_row | Boolean: Skip header row |
preamble_rows | Count of leading rows to skip |
postamble_rows | Count of trailing rows to skip |
delimiter | Delimiter characters: comma: “,” pipe: “|” tab: “\t” thorn: “þ” space: “ “ caret: “^” semicolon: “;” |
Encodings
By default, Switchboard files are encoded in UTF-8. Switchboard uses standard Java Charset encoding format strings.
- To specify a file encoding using Latin Alphabet No. 1, provide the encoding parameter in the import statement:
encoding: "ISO-8859-1";
Compression
Switchboard ingests files in gzip or zip compression formats.
- To specify files in gzip format, provide the parameter in the import statement:
compression: "gzip";
Sample Switchboard Script
import s3_file_raw from {
type: "s3_ng";
regex: "s3://source_bucket/my_file_name.*";
datetime_pattern: "*_YYYY-MM-DD_*";
key: "aws_key";
lookback_days: 1;
} using {
filename: string;
date: datetime;
bx_pub_name: string;
placement: string;
device: string;
demand_source: string;
};
For parquet files you should use the raw
version of this downloader.
This version of the downloader does not support the same format
, encoding
and compression
parameters as the above . It only supports the parquet format, with no other options.
download t from {
type: "s3:raw";
pattern: "s3://mybucket/somepath/*.parquet";
datetime_pattern: "*file-YYYY-MM-DD.parquet";
format: "parquet";
} using {
idcol: integer;
*
};