AWS Data Pipeline

Table of Contents

Topics

Objects

Everything in AWS Data Pipeline(DP) is an object with some fields.

Each object has id and type. id is the unique identifier within a DP. type is for specifying the feature of the object.

Schedule

{
  "type": "Schedule",
  "id": "Daily",
  "startDateTime":"2016-10-07T01:00:00",
  "period": "1 day"
}

Basic fields of a Schedule object are as above.

There is the backfill feature, which fills the task of the past.

If you deploy a pipeline which has the daily schedule's startDateTime specifies 10 days ago, 10 jobs will be triggered at the time you firstly deploy.

Every other object should specify its schedule field like this:

{
  "type": "EmrCluster",
  "id": "ParquetryCluster",

  "masterInstanceType": "m1.large",
  "coreInstanceType": "m1.large",
  "coreInstanceCount": "2",
  "releaseLabel": "emr-5.0.0",
  "applications": ["spark"],
  "terminateAfter": "3 hours",

  "scheduleType": "cron",
  "schedule": {"ref": "Daily"},
}

Note

"schedule": {"ref": "<schedule-object-id>"}

We can reference other objects by putting an object with ref field whose value is the target object's id.

There is scehduleType field. It must be one of cron or ondemand, timeseries. Mostly just put cron, and it will work as expected.

Resource

Objects which have its type value as EmrCluster or Ec2Resource are Resources. They are nodes which activities run on. They are created at every scheduled moment. They are terminated when every scheduled activity ends.

Activity

Activities are the tasks which we want to be done. Activities must specify both schedule and resource:

{
  "type": "EmrActivity",
  "id": "MyJob",
  "schedule": {"ref": "Daily"},
  "runsOn": {"ref": "ParquetryCluster"},
}

The Default Object

{
  "id": "Default",
  "failureAndRerunMode": "cascade",

  "keyPair": "my-key-name",
  "pipelineLogUri": "s3://datapipeline.k.nexon.com/Parquetry/",
  "resourceRole": "DataPipelineDefaultResourceRole",
  "role": "DataPipelineDefaultRole",

  "scheduleType": "cron",
  "schedule": {"ref": "Daily"}
}

Default object is the object whose id is Default.

It's a special object. It's the only object which doesn't have type field. Other objects inherit Default object's fields.

We can use Default object to specify common fields like schedule, role, etc.

Custom Fields

{
  "id": "MyObject",
  "type": "Schedule",
  ...
  "my_something": "hello, world",
  "mySomething": "Good bye"
}

Within an object, fields prefixed with my are custom fields.

Referencing Values

{
  "id": "A",
  "type": "Ec2Resource",
  "my_value": "#{id}"
}

With enclosing field names with #{}, we can reference its values. There are also some functions and operators to tweak the values.