In this post I’ll try to review StreamSets Data Collector, one of the most popular tools for creating smart data pipelines for streaming, batch and change data capture, which allows you to move data around in a near real time.
First I’d like to point out that the whole review is my own personal opinion based on several years of developing, designing and administering data pipelines based on StreamSets technology.
Here I’d like to share my experience of using its products.
Essentially StreamSets have SaaS offering for those of you who are fully in Cloud, and on-prem option for the majority of today enterprises.
On the following link you can find what StreamSets has to offer:
Although there are many players on the market which offer essentially the same, StreamSets combination of easy of use, number of various connectors
, Open source policy and flexible licensing was a winning combination at the time of evaluation (a couple of years ago) which attracts a lot of enterprises to give it a try.
Since it’s open source project, platform is extensible (you can create your own stages).
It’s also one of the most popular option for getting data out of Oracle in the real time by using CDC (change data capture).
With StreamSets data collector you can create data pipelines quickly, in most cases without writing a single line of code, and if something is not covered, you can always extend existing functionality by adding Groovy/Jython or Java Script stage.
You will still need to design your pipeline in a right way, and you definitely need to have expertise in technologies you are connecting (e.g. Oracle, Kafka, Postgres…) especially if you need to get high performance and low impact on data sources, but essentially you can build your pipelines really fast.
Here is an example of moving data in a real time from Oracle database into the Kafka topic, and from there wherever you want.
StreamSets Data Collector allows you to be self-contained, by delivering data on-prem and in the Cloud (one or more Cloud provider) at the same time (hybrid architecture which is dominant architectural style).
In evaluation phase, I’ve tried or inspect some other products such as Oracle Golden Gate, Striim and Apache NiFi, but at that time StreamSets was the best option for our use case.
Technically, Oracle Golden Gate would be the first choice since it has far superior CDC technology when compared to StreamSets or any other similar product, but it was out of our budget due to Oracle’s licensing policy.
Striim, another competitor on the same market space, at the time of evaluation uses the same technology for Oracle CDC as StreamSets, but had slightly more restrictive licensing policy and it’s not open source project.
On the other hand, Apache NiFi is Open sourced project have the most flexible licensing policy (free) and many connectors available.
Unfortunately, Apache NiFi supports CDC only on MySQL database.
For all other databases (Oracle, MS SQL, Postgres…) you need to use workaround and to change design of your pipeline.
I’ll write about Apach NiFI in the future as it gains a lot of attraction and it’s definitely worth to try it out.
All in all, several years ago when I was doing product evaluation from several vendors, StreamSets Data Collector was a clear winner for our use case.
After a few years of extensive usage, generally I’m satisfied with Data Collector, but there are some things I don’t like and hopefully will be improved in future.
1. Playing a ping-pong game with StreamSets support
When I found a bug and open a ticket, I’m too often getting reply that what I’ve found is actually feature request, not a bug.
This usually ends up with a lot of frustration and wasted time.
One such example is committed offsets, extremely important functionality which allows you to start the pipeline from the specific point (something similar to Point in time recovery in database world).
This functionality is exposed for some data sources (e.g. Oracle, MySQL, MS SQL…), but it’s not implemented for some of the most popular connectors such as Postgres or Kafka.
On the following picture you can see how it looks like when offset is populated (in this example Oracle stage)
and down below you can see how it looks like when it is not populated at all (example taken from the Kafka stage).
Although this issue can be easily fixed (just one function call to Kafka or one SELECT against pg_replication_slots in case of Postgres), months/years has passed, but bug (or “feature request”) is still present.
2. Capacity to resolve bugs in reasonable amount of time
This is something what I’ve found very often when dealing with smaller, fast growing companies.
A couple of years ago when evaluating different products, I’ve found a bug in Oracle CDC connector which will show up when only PK (Primary Key) column has been updated.
In such cases you’ll get an error:
JDBC_90 – Record doesn’t have any columns for table <table_name>
When I open a ticket, It started with questioning why do I need to update primary key column and it continues with claims that what I’ve found is feature request – not a bug (check point 1).
I put a lot of effort to debug pipeline of interest and to debug Oracle database, just to provide enough evidence to move forward.
I even prove it that Oracle in certain cases updates PK column internally, which is argument why that functionality is must have, otherwise the whole connector for Oracle will be useless.
The point is that bug is still not fixed (fix is expected later this year).
If fixing updates on PK column is not an easy fix, I have another example which can be fixed in a matter of seconds/minutes.
While performing upgrade of Data Collector from 3.19 to 3.22, I’ve got the following error:
JDBC_610 – Error while querying the database incarnation history: ORA-00942: table or view does not exist.
After investigation I’ve found that this is a documentation bug (privilege is missing – and it’s not documented in the Upgrade part of documentation).
To fix this bug, all it takes is to update one html page.
More than 2+ months have passed since I’ve raised the ticket, but the documentation error is still there.
3. Lack of important functionalities
Although Data Collector is extremely flexible and extendable platform, some important functionalities are still missing.
In the previous point I’ve mentioned bug in Oracle connector.
My advice, when evaluating different software solution is to do a deep dive analysis for the most important connectors.
If majority of your data is in the Oracle database, you need to spend majority of your time to check Oracle connector.
Discovering that your Oracle connector doesn’t support PK updates will be huge surprise once when you start building the pipelines (for example Striim, Golden Gate and many others have that functionality implemented).
Another very common request from developers might be to get extract both: new and old values when performing CDC.
In such cases you need to check if UNDO data are also available from Oracle connector.
For Streamsets Oracle connector, only new values (from REDO) are available, while old values (from UNDO) are not, which is pity since it is exposed in V$LOGMNR_CONTENTS view from where Oracle LogMiner (Oracle tool/package used by StreamSets Data Collector to execute CDC on Oracle database) is reading data.
If that functionality is important for you, again, you should look at different product who already have it.
There might be restrictions you are not aware of.
You should ask a lot of questions – if Oracle RAC (Real Application Cluster) is supported, what about Multi tenant support, which versions (or editions) are supported etc.
The best way to check if you asked all important question is to take a look at competitor who has all.
In case of Oracle CDC it is definitely Oracle Golden Gate and here is a link to great document which can help you to articulate your questions:
Check chapter: Challenges with LogMiner, since all vendors except Oracle Golden Gate (which has it’s own native CDC API) use LogMiner for CDC on Oracle database.
All in all, it is not enough to check if some product supports Oracle CDC.
You need to dig deeper to check what is covered, what is performance impact, what technology has been used and many other criteria.
I might write blog about criteria you need to check to get the best solution for your use case.
4. SDC 4 and later versions are no longer open source
At the time of evaluation, StreamSets Data Collector was open source project, and that fact wad one of the major advantage over its competitors.
It allows you not only to inspect the code to find bugs, but also to extend the whole platform by creating new stages which explains why open source movement is so important (and one of the main selection criteria).
While the whole industry is going towards Open source, StreamSests took the opposite approach – from Open source to classical close source.
It is interesting to note that Data Collector was open source project before version 4, but the project was always under the StreamSets governance (it was never under the Apache foundation control), which is a classical approach when you want to establish customers base first and charge them later.
- with Data Collector 3, if you don’t need support, you could run production pipelines totally free (only server registration is required) with all connectors (Oracle, Postgres, Kafka…)
- with Data Collector 4 (closed source) you must purchase commercial license to run production pipelines
- with upcoming version 5, you must purchase commercial license for Data Collector and license for Control Hub to run production pipelines
5. Connectors/stages are frequently removed/deprecated
This topic is about product statement of direction and why it is important.
Large software vendors provide clear upgrade path and you won’t experience dramatic changes when upgrading to a new release.
New release comes with lot of new functionalities, but old ones are still in place, and when something is deprecated it can take years/decades until it will be finally removed from product.
This is how large software vendors protect customer investment in their product.
With smaller vendors, customers are more exposed by sudden changes, and often no clear upgrade path is provided.
In case of StreamSets Data Collector 4, there are many connectors such as Flume/Value replacer, Spark evaluator, SDC RPC, Kafka consumer etc. which are deprecated now and will be removed in a future releases.
Due to frequent updates, you don’t have years/decade to redesign or redevelop your pipelines when some component is deprecated, but months (year at max.).
This means you need to spend man days to redesign, redevelop, build, retest and redeploy pipelines which will add significant extra maintenance costs (Have you heard for term TCO – Total cost of ownership?) just to get essentially the same functionalities you had before.
For some components such as Kafka consumer, you have clear upgrade path (Kafka Multitopic consumer), but if you are using one of components mentioned above, you are out of luck.
Here is example from a real life.
I’ve been using SDC RPC component extensively, which is convenient way for recording errors in the pipeline and propagating those errors into Kafka and relational databases.
Since SDC RPC is one of deprecated components/technologies for which there is no upgrade path, I need to reinvest my time and to repeat the whole development cycle to get the functionality I already have, instead to invest time to build new pipelines.
I can often hear argument that I don’t need to upgrade software in such situation.
But if you stay with the current release, you’ll lost support for your product within a year and you won’t be able to open ticket (there are no terms such as LTS – Long Term Release versions in StreamSets release cycle).
Due to frequent releases, there is no guarantee that components you are using in your pipelines now won’t be dropped within a year, which is why you need to plan extra maintenance costs for pipeline redesign in advance.
The best you can do is to create pipelines with minimal logic inside, pipelines which can be easily replaced/upgraded or migrated to some other software, in case when technology you use will be completely dropped.
6. Data Collector user interface is deprecated
Data Collector is the most important product in StreamSets offerings.
It started as an open source product, but now it’s classical commercial product (check the point 4).
Currently Data Collector (SDC) can be used not only to run, but also to design/develop/deploy/run/monitor/debug data pipelines.
It is really perfect, self-contained, simple to use and maintain, stable and resilient application for real time data pipelines.
Unfortunately, this won’t be the case any longer.
My guess is that StreamSets wants to propagate sell for another product – Control Hub, and for that reason they dropped support for user interface which will prevent you to develop pipelines without Control Hub from the next major release of Data Collector – version 5.
With Control Hub in place, once perfect solution, become more fragile and affected by many types of outage, because the whole architecture itself is now complicated and difficult to maintain.
Namely, Control Hub needs to communicate with all Data Collectors (or Transformers) so any network issue between components will affect Control Hub and Data collector communication.
Furthermore, with Control Hub you need to maintain (backup, patch, restore etc.) 14 new Postgres or MySQL databases (yes you read it correctly – 14 new databases), although Data Collector itself has everything what is needed without all those databases in place.
It is also required to create one instance of InfluxDb (Time series database) which is adding additional burden in the whole picture, since your DBA team probably won’t be familiar with that technology (how to backup, maintain, restore, patch, tune, automation scripts…).
To perform all development tasks (design/develop/build/deploy/monitor/debug pipelines):
- before Data Collector 5, you only have one component /moving part (Data Collector itself)
- with upcoming release of Data Collector 5, you will have 17 moving parts to perform the same tasks
1 Data Collector
1 Control Hub
14 Postgres or MySQL databases
1 InfluxDb database
Possible solution is to use DataOps platform instead, which runs Control Hub in the Cloud (all three major cloud vendors are supported) maintained by StreamSets, but that triggers many question marks starting with even more complexities and fragility of such hybrid architecture (Control Hub in Cloud while Data Collector runs on-prem), data sovereignty, security of hybrid architecture, dependency (lesson learned from the current global health crisis where everyone wants to be self-contained), network latency etc.
Although it lost some of the advantages it once had, StreamSets Data Collector is still a good choice for DataOps.
It is not open source any longer, and it lost much of price advantage over its competitorsalong with simplicity it once had, but it’s still valid option if you need on-prem or SaaS DataOps solution.
I’ll write more about StreamSets Data Collector in the future since I solved numerous issues and I have lot of thing to share if you want to achieve performance, robustness and flexibility.