Releases: GoogleCloudPlatform/DataflowJavaSDK
Version 2.1.0
Version 2.1.0 is based on a subset of Apache Beam 2.1.0. See the Apache Beam
2.1.0 release notes for additional change information.
Issues
Known issue: When running in batch mode, Gauge metrics are not reported.
Updates and improvements
- Added Metrics support for
DataflowRunnerin streaming mode. - Added
OnTimeBehaviortoWindowinStrategyto control emitting ofON_TIMEpanes. - Added default file name policy for windowed file
FileBasedSinks which consume windowed input. - Fixed an issue in which processing time timers for expired windows were ignored.
- Fixed an issue in which
DatastoreIOfailed to make progress when Datastore was slow to respond. - Fixed an issue in which
bzip2files were being partially read; added support for concatenatedbzip2files. - Improved several stability, performance, and documentation issues.
Version 1.9.1
- Fixed an issue with Dataflow jobs that read from
CompressedSources with compression type set toBZIP2are potentially losing data during processing. For more information, see Issue #596.
Version 2.0.0
The Dataflow SDK for Java 2.0.0 is the first stable 2.x release of the Dataflow SDK for Java, based on a subset of Apache Beam 2.0.0. See the Apache Beam 2.0.0 release notes for additional change information.
Note for users upgrading from version 1.x
This is a new major version, and therefore comes with the following caveats:
- Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Dataflow 2.x pipelines may only be updated across versions starting with SDK version 2.0.0.
Updates and improvements since 2.0.0-beta3
Version 2.0.0 is based on a subset of Apache Beam 2.0.0. The most relevant changes in this release for Cloud Dataflow customers include:
- Added new API in
BigQueryIOfor writing into multiple tables, possibly with different schemas, based on data. See BigQueryIO.Write.to(SerializableFunction) and BigQueryIO.Write.to(DynamicDestinations). - Added new API for writing windowed and unbounded collections to
TextIOandAvroIO. For example, see TextIO.Write.withWindowedWrites() and TextIO.Write.withFilenamePolicy(FilenamePolicy). - Added
TFRecordIOto read and write TensorFlow TFRecord files. - Added the ability to automatically register
CoderProviders in the defaultCoderRegistry.CoderProviders are registered by aServiceLoadervia concrete implementations of aCoderProviderRegistrar. - Changed order of parameters for
ParDowith side inputs and outputs. - Changed order of parameters for
MapElementsandFlatMapElementstransforms when specifying an output type. - Changed the pattern for reading and writing custom types to
PubsubIOandKafkaIO. - Changed the syntax for reading to and writing from
TextIO,AvroIO,TFRecordIO,KinesisIO,BigQueryIO. - Changed syntax for configuring windowing parameters other than the
WindowFnitself using theWindowtransform. - Consolidated
XmlSourceandXmlSinkintoXmlIO. - Renamed
CountingInputtoGenerateSequenceand unified the syntax for producing bounded and unbounded sequences. - Renamed
BoundedSource#splitIntoBundlesto#split. - Renamed
UnboundedSource#generateInitialSplitsto#split. - Output from
@StartBundleis no longer possible. Instead of accepting a parameter of typeContext, this method may optionally accept an argument of typeStartBundleContextto accessPipelineOptions. - Output from
@FinishBundlenow always requires an explicit timestamp and window. Instead of accepting a parameter of typeContext, this method may optionally accept an argument of typeFinishBundleContextto accessPipelineOptionsand emit output to specific windows. XmlIOis no longer part of the SDK core. It must be added manually using the newxml-iopackage.
More information
Please see Cloud Dataflow documentation and release notes for version 2.0.
Version 2.0.0-beta3
The Dataflow SDK for Java 2.0.0-beta3 is the third 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
- Breaking Changes: The Dataflow SDK 2.x for Java releases have a number of breaking changes from the 1.x series of releases and from earlier 2.x beta releases. Please see below for details.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.
Beta
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
- No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
- Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
- Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.
Updates since 2.0.0-beta2
Version 2.0.0-beta3 is based on a subset of Apache Beam 0.6.0. The most relevant changes in this release for Cloud Dataflow customers include:
- Changed
TextIOto only operate on strings. - Changed
KafkaIOto specify type parameters explicitly. - Renamed factory functions of
ToString. - Changed
Count,Latest,Sample,SortValuestransforms. - Renamed
Write.BoundtoWrite. - Renamed
Flattentransform classes. - Split
GroupByKey.createmethod intocreateandcreateWithFewKeysmethods.
Additional breaking changes
Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.
Version 2.0.0-beta2
The Dataflow SDK for Java 2.0.0-beta2 is the second 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
- Breaking Changes: The Dataflow SDK 2.x for Java releases have a number of breaking changes from the 1.x series of releases and from earlier 2.x beta releases. Please see below for details.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.
Beta
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
- No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
- Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
- Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.
Updates since 2.0.0-beta1
This release is based on a subset of Apache Beam 0.5.0. The most relevant changes in this release for Cloud Dataflow customers include:
PubsubIOfunctionality:ReadandWritenow provide access to Cloud Pubsub message attributes.- New scenario: support for stateful pipelines via the new State API.
- New scenario: support for timer via the new Timer API (limited to the
DirectRunnerin this release). - Change to
PubsubIOconstruction:PubsubIO.ReadandPubsubIO.Writemust now be instantiated usingPubsubIO.<T>read()andPubsubIO.<T>write()instead of the static factory methods such asPubsubIO.Read.topic(String). Specifying a coder via.withCoder(Coder)for the output type is required forRead. Specifying a coder for the input type or specifying a format function via.withAttributes(SimpleFunction<T, PubsubMessage>)is required forWrite.
Additional breaking changes
Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.
Version 2.0.0-beta1
The Dataflow SDK for Java 2.0.0-beta1 is the first 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
- Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases. Please see below for details.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.
Beta
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
- No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
- Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
- Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.
New Functionality
This release is based on a subset of Apache Beam 0.4.0. The most relevant changes for Cloud Dataflow customers include:
- Improvements to compression:
CompressedSourcesupports reading ZIP-compressed files.TextIO.WriteandAvroIO.Writesupport compressed output. AvroIOfunctionality:Writesupports the addition of custom user metadata.BigQueryIOfunctionality:Writenow splits large (> 12 TiB) bulk imports into multiple BigQuery load jobs enabling it to handle very large datasets.BigtableIOfunctionality:Writesupports unboundedPCollections and so can be used in theDataflowRunnerin the streaming mode.
For complete details, please see the release notes for Apache Beam 0.3.0-incubating and 0.4.0.
Other Apache Beam modules from version 0.4.0 can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.
Major changes from Dataflow SDK 1.x for Java
This release has a number of significant changes from the 1.x series of releases.
All users need to read and understand these changes in order to upgrade to 2.x versions.
Package rename and restructuring
As part of generalizing Apache Beam to work well with environments beyond Google Cloud Platform, the SDK code has been renamed and restructured.
-
Renamed
com.google.cloud.dataflowtoorg.apache.beamThe SDK is now declared in the package
org.apache.beaminstead ofcom.google.cloud.dataflow. You need to update all your import statements with this change. -
New subpackages:
runners.dataflow,runners.direct, andio.gcpRunners have been reorganized into their own packages, so many things from
com.google.cloud.dataflow.sdk.runnershave moved into eitherorg.apache.beam.runners.directororg.apache.beam.runners.dataflow.Pipeline options specific to running on the Dataflow service have moved from
com.google.cloud.dataflow.sdk.optionstoorg.apache.beam.runners.dataflow.options.Most I/O connectors to Google Cloud Platform services have been moved into subpackages. For example,
BigQueryIOhas moved fromcom.google.cloud.dataflow.sdk.iotoorg.apache.beam.sdk.io.gcp.bigquery.Most IDEs will be able to help identify the new locations. To verify the new location for specific files, you can use
tto search the code on GitHub. The Dataflow SDK 1.x for Java releases are built from the GoogleCloudPlatform/DataflowJavaSDK repository. The Dataflow SDK 2.x for Java releases correspond to code from the apache/beam repository.
Runners
-
Removed
Pipelinefrom Runner namesThe names of all Runners have been shortened by removing
Pipelinefrom the names. For example,DirectPipelineRunneris nowDirectRunner, andDataflowPipelineRunneris nowDataflowRunner. -
Require setting
--tempLocationto a Google Cloud Storage pathInstead of allowing allowing you to specify only one of
--stagingLocationor--tempLocationand then Dataflow inferring the other, the Dataflow Service now requires--gcpTempLocationto be set to a Google Cloud Storage path, but it can be inferred from the more general--tempLocation. Unless overridden, this will also be used for the--stagingLocation. -
DirectRunnersupports unboundedPCollectionsThe
DirectRunnercontinues to run on a user's local machine, but now additionally supports multithreaded execution, unboundedPCollections, and triggers for speculative and late outputs. It more closely aligns to the documented Beam model, and may (correctly) cause additional unit tests failures.As this functionality is now in the
DirectRunner, theInProcessPipelineRunner(Dataflow SDK 1.6+ for Java) has been removed. -
Replaced
BlockingDataflowPipelineRunnerwithPipelineResult.waitToFinish()The
BlockingDataflowPipelineRunneris now removed. If your code programmatically expects to run a pipeline and wait until it has terminated, then it should use theDataflowRunnerand explicitly callpipeline.run().waitToFinish().If you used
--runner BlockingDataflowPipelineRunneron the command line to interactively induce your main program to block until the pipeline has terminated, then this is a concern of the main program; it should provide an option such as--blockOnRunthat will induce it to callwaitToFinish().
ParDo and DoFn
-
DoFns use annotations instead of method overridesIn order to allow for more flexibility and customization,
DoFnnow uses method annotations to customize processing instead of requiring users to override specific methods.The differences between the new
DoFnand the old are demonstrated in the following code sample. Previously (with Dataflow SDK 1.x for Java), your code would look like this:new DoFn<Foo, Baz>() { @Override public void processElement(ProcessContext c) { … } }
Now (with Dataflow SDK 2.x for Java), your code will look like this:
new DoFn<Foo, Baz>() { @ProcessElement // <-- This is the only difference public void processElement(ProcessContext c) { … } }
If your
DoFnaccessedProcessContext#window(), then there is a further change. Instead of this:public class MyDoFn extends DoFn<Foo, Baz> implements RequiresWindowAccess { @Override public void processElement(ProcessContext c) { … (MyWindowType) c.window() … } }
you will write this:
public class MyDoFn extends DoFn<Foo, Baz> { @ProcessElement public void processElement(ProcessContext c, MyWindowType window) { … window … } }
or:
return new DoFn<Foo, Baz>() { @ProcessElement public void processElement(ProcessContext c, MyWindowType window) { … window … } }
The runtime will automatically provide the window to your
DoFn. -
DoFns are reused across multiple bundlesIn order to allow for performance improvements, the same
DoFnmay now be reused to process multiple bundles of elements, instead of guaranteeing a fresh instance per bundle. AnyDoFnthat keeps local state (e.g. instance variables) beyond the end of a bundle may encounter behavioral changes, as the next bundle
will now start with that state instead of a fresh copy.To manage the lifecycle, new
@Setupand@Teardownmethods have been added. The full lifecycle is as follows (while a failure may truncate the lifecycle at any point):@Setup: Per-instance initialization of theDoFn, such as opening reusable connections.- Any number of the sequence:
@StartBundle: Per-bundle initialization, such as resetting the state of theDoFn.@ProcessElement: The usual element processing.@FinishBundle: Per-bundle concluding steps, such as flushing side effects.
@Teardown: Per-instance teardown of the resources held by theDoFn, such as closing reusable connections.
Note: This change is expected to have limited impact in practice. However, it does not generate a compile-time error and has the potential to silently cause unexpected results.
PTransforms
-
Removed
.named()Remove the
.named()methods from PTransforms and sub-classes. Instead usePCollection.apply(“name”, PTransform). -
Renamed
PTransform.apply()toPTransform.expand()PTransform.apply()was renamed toPTransform.expand()to avoid confusion withPCollection.apply(). All user-written composite transforms will need to rename the overriddenapply()method toexpand(). There is no change to how pipelines are constructed.
Additional breaking changes
Please see the official [Dataflow SDK 2.x for Java release notes](https://cloud.google.com/dataflow/release-notes/re...
Version 1.9.0
- Added the
ValueProviderinterface for use in pipeline options. Making an option of typeValueProvider<T>instead ofTallows its value to be supplied at runtime (rather than pipeline construction time) and enables Dataflow templates. Support forValueProviderhas been added toTextIO,PubSubIO, andBigQueryIOand can be added to arbitrary PTransforms as well. - Added the ability to automatically save profiling information to Google Cloud Storage using the
--saveProfilesToGcspipeline option. For more information on profiling pipelines executed by theDataflowPipelineRunner, see issue #72. - Deprecated the
--enableProfilingAgentpipeline option that saved profiles to the individual worker disks. For more information on profiling pipelines executed by theDataflowPipelineRunner, see issue #72. - Changed
FileBasedSourceto throw an exception when reading from a file pattern that has no matches. Pipelines will now fail at runtime rather than silently reading no data in this case. This change affectsTextIO.ReadorAvroIO.Readwhen configuredwithoutValidation. - Enhanced
Codervalidation in theDirectPipelineRunnerto catch coders that cannot properly encode and decode their input. - Improved display data throughout core transforms, including properly handling arrays in
PipelineOptions. - Improved performance for pipelines using the
DataflowPipelineRunnerin streaming mode. - Improved scalability of the
InProcessRunner, enabling testing with larger datasets. - Improved the cleanup of temporary files created by
TextIO,AvroIO, and otherFileBasedSourceimplementations. - Modified the default version range in the archetypes to exclude beta releases of Dataflow SDK for Java, version 2.0.0 and later.
Version 1.8.1
- Improved the performance of bounded side inputs in the
DataflowPipelineRunner.
Version 1.8.0
- Added support to
BigQueryIO.Readfor queries in the new BigQuery Standard SQL dialect using.withStandardSQL(). - Added support in
BigQueryIOfor the newBYTES,TIME,DATE, andDATETIMEtypes. - Added support to
BigtableIO.Readfor reading from a restricted key range using.withKeyRange(ByteKeyRange). - Improved initial splitting of large uncompressed files in
CompressedSource, leading to better performance when executing batch pipelines that useTextIO.Readon the Cloud Dataflow service. - Fixed a performance regression when using
BigQueryIO.Writein streaming mode.
Version 1.7.0
- Added support for Cloud Datastore API v1 in the new
com.google.cloud.dataflow.sdk.io.datastore.DatastoreIO. Deprecated the oldDatastoreIOclass that supported only the deprecated Cloud Datastore API v1beta2. - Improved
DatastoreIO.Readto support dynamic work rebalancing, and added an option to control the number of query splits usingwithNumQuerySplits. - Improved
DatastoreIO.Writeto work with an unboundedPCollection, supporting writing to Cloud Datastore when using theDataflowPipelineRunnerin streaming mode. - Added the ability to delete Cloud Datastore
Entityobjects directly usingDatastore.v1().deleteEntityor to delete entities by key usingDatastore.v1().deleteKey. - Added support for reading from a
BoundedSourceto theDataflowPipelineRunnerin streaming mode. This enables the use ofTextIO.Read,AvroIO.Readand other bounded sources in these pipelines. - Added support for optionally writing a header and/or footer to text files produced with
TextIO.Write. - Added the ability to control the number of output shards produced when using a
Sink. - Added
TestStreamto enable testing of triggers with multiple panes and late data with theInProcessPipelineRunner. - Added the ability to control the rate at which
UnboundedCountingInputproduces elements usingwithRate(long, Duration). - Improved performance and stability for pipelines using the
DataflowPipelineRunnerin streaming mode. - To support
TestStream, reimplementedDataflowAssertto useGroupByKeyinstead ofsideInputsto check assertions. This is an update-incompatible change toDataflowAssertfor pipelines run on theDataflowPipelineRunnerin streaming mode. - Fixed an issue in which a
FileBasedSinkwould produce no files when writing an emptyPCollection. - Fixed an issue in which
BigQueryIO.Readcould not query a table in a non-USregion when using theDirectPipelineRunneror theInProcessPipelineRunner. - Fixed an issue in which the combination of timestamps near the end of the global window and a large
allowedLatenesscould cause anIllegalStateExceptionfor pipelines run in theDirectPipelineRunner. - Fixed a
NullPointerExceptionthat could be thrown during pipeline submission when using anAfterWatermarktrigger with no late firings.