December 13, 2016
If your company is preparing to take on real-time data analytics, prepare for some significant changes in the realm of network administration. Real-time analytics deliver powerful BI and insight, but also bring some considerable challenges to the network and compute infrastructures. Real-time analytics is valuable for evaluating website traffic, social media happenings, dynamic marketing and much more.
There are no shortages of platforms for taking on real-time analytics, also called data streaming. Most are part of the Hadoop ecosystem, which you're probably familiar with if big data and analytics is already a part of your organization's operations. But Hadoop isn't built for real-time analytics. There are five major players in this arena that network administrators and other IT professionals need to be aware of.
Spark has set the real time data streaming world on fire, literally. It's probably one of the most popular products out there, meaning you can get a lot of help, insight, and advice from the open source community forums. Apache has thriving developer groups, and the members are quite enthusiastic about helping newbies out and answering their questions.
Spark is probably the most well-known of all streaming analytical platforms. This open source framework runs in memory and across clusters. It does work with the Hadoop ecosystem, but it is not necessary to have or use Hadoop in order to leverage Apache Spark. Spark can run on top of Hadoop YARN. It is able to read streaming data directly from the HDFS. Spark is notable for its in-memory processing capabilities, graph processing, and its promise in the realm of machine learning. It is in use in many notable companies, including Yahoo!, Intel, and Groupon. The network administrator will appreciate the relative leanness of Spark.
Almost as popular as Apache Spark, Storm is another open source, real time analytical platform that allows for streaming data and works with the Hadoop ecosystem. Storm is used for all sorts of real-time analytical operations, including machine learning, continuous computation, and others. Perhaps the strongest selling point for Storm is that it is compatible with a wide variety of programming languages. Like Spark, Storm runs atop Hadoop YARN, and is capable of being used with other products, like Flume. It is already in use in organizations like Spotify, Yelp, and WebMD. If you are a network administrator who also does some coding, you'll appreciate the compatibility options that come with Apache Storm.
Far lesser known than either Spark or Storm is Apache Samza. This is a distributed stream processing platform that is based upon a couple of other popular Apache products: YARN and Kafka. Samza delivers a simple API that is callback based and much like the more easily recognizable MapReduce. Samza comes with some perks like snapshot management and fault tolerance, and is quite a durable and scalable product to consider.
For network administrators working in companies that are already taking advantage of some of Amazon's other data related products and services, Kinesis is a good option. It is also a real-time processing and data streaming platform, built to work in the cloud.
Kinesis is designed to work with other Amazon products and services through connectors, including S3, Redshift, DynamoDB, etc. This product comes with a complete library, the Kinesis Client Library or KCL, which gives you the ability to develop applications and take advantage for streaming data to go into dashboards, deliver alerts, or even engage in dynamic pricing activities.
Other Proprietary Solutions
Many software vendors are developing their own real-time analytical platforms, or are partnering with open source projects like those at Apache. If your vendor recommends a specific streaming platform, that's probably the best to use with their products. But many options, like Spark and Storm, work with numerous other products, so you aren't forced to use anything in particular.
If you're already partnering with specific vendors, your best bet is to find out what real-time platform they have developed or are partnering with. Some vendors are utilizing the Hadoop ecosystem and taking advantage of the open source communities like Apache, while others are engaging in their own development projects, hoping that their big data analytics platforms take off.
The most important part of choosing a real-time data analytics platform is looking for one that offers true enterprise-level features and functionality. The network administrator needs to be aware that almost all of these big data analytics products come with most of the security features defaulted to off. That means that security and other underlying features have to be deliberately turned on.
You can't always depend on your data analytics team to do that, because they're focused primarily on the analytics, whereas it's up to the network administrator and systems administrator to sweat over things like performance configurations and security settings.
What does it take to build a network and IT department that can leverage all of the latest technologies and data analytics products with ease and finesse? It's all about building a "frictionless enterprise". You can learn all about the frictionless enterprise and how to achieve this lofty goal in your organization when you download our white paper: The Frictionless Enterprise.