Collecting big data is essential in order to efficiently process and analyze it, which is the only way to become a truly data-driven company. So why is big data collection such a challenge for so many organizations?
Nearly all companies report that they’re collecting data in order to gain better business insights, but just 39 percent of them say that they’ve invested in a big data platform. Plus, the big data collection process is extremely time-consuming and error-prone when done manually. According to another survey, data scientists spend 19 percent of their time collecting data. That's equivalent to roughly one full day per workweek.
The good news is that you don’t have to go it alone: everything comes as a service these days, including big data collection. Right now, there’s no shortage of platforms that would be happy to help you centralize all your enterprise data under one roof (sometimes called a “data hub”).
But with so many options to choose from, how can you find the best big data collection platform for you? In this article, we’ll review 5 of the top platforms for collecting big data—including features, pros and cons, and user reviews—so that you can make the choice that’s right for your business.
Table of Contents
1) S3/CloudFront Logging
Amazon S3 (Simple Storage Service) provides simple object storage as part of the Amazon Web Services public cloud platform. Amazon CloudFront, meanwhile, is a content delivery network from AWS that can integrate with S3 to deliver objects more quickly, securely, and reliably.
Using S3 and CloudFront in combination doesn’t give you a big data collection platform per se, but it might be close enough to work for your purposes. S3 and CloudFront provide automated weblogging for objects in the object store. This allows you to track events by sending an HTTP request for a 1x1 pixel image from a relevant S3 directory.
Making this request will generate a log in W3C format with all of the relevant HTTP request parameters: IP address, browser, date/time, etc. You can also pass extra session-level data, such as username or mouse position, via the query string. By placing these images in directories with relevant names to differentiate between the various events, voila—you have a data collection service.
Sound like an interesting option? For more info, check out our blog post on how to implement S3/CloudFront logging.
Loggly is an enterprise-class log management and analytics solution that runs in the cloud. The Loggly big data collection platform provides centralized logging, log searches, and log graphs, without the need to install the software on-premises.
The Loggly platform is built on open-source technologies such as ElasticSearch, Apache Lucene, and Apache Kafka. You can load data from Loggly into Integrate.io via S3, join it with other sources, and export it to your data warehouse.
The Loggly pricing model is as follows:
- Free tier: Centralized log management, automated summaries, and searches for a single user.
- Standard tier ($79/month): 3 users and source groups, email alerts, monitoring charts and dashboards, and direct access to Loggly support.
- Pro tier ($159/month): 5 users and source groups, email alerts and push notifications to platforms such as Slack and PagerDuty, API access, and archiving to Amazon S3.
- Enterprise tier ($279/month): Unlimited users and source groups, custom data retention periods, the Live Tail feature for near real-time monitoring, anomaly detection, and integrations with GitHub and JIRA.
On the business software review website G2, Loggly currently has an average rating of 4.2 out of 5 stars, based on 11 user reviews. IT administrator Joshua R. writes:
“I can log both my development and live servers to the same system, and either join or separate their content at will. The existing collection tools for Loggly are very strong, and the setup for those tools is incredibly easy. Beyond that, they offer the necessary endpoints and formats to receive logs from a wide variety of candidates.”
However, some reviews complain that the Loggly web interface is “slow” and “clunky,” while others dislike the inflexible pricing model.
Segment is a customer data platform (CDP) that collects user behavior data from mobile apps and the web. From there, you can then send your data onwards to customer analytics platforms services such as Google Analytics or Woopra, or even store your data on Amazon Redshift for highly customized analyses. All you have to do on your end is add some logging lines to your code, and Segment will take care of the rest.
The Segment big data collection platform offers three pricing tiers:
- Free tier: 1,000 visitors per month and two sources using more than 300 integrations.
- Team tier (starting at $120/month): 10,000 visitors per month and unlimited sources.
- Business tier (custom pricing): Custom volume of visitors per month, dedicated support plans, and advanced roles and permissions.
Segment has received an average rating of 4.7 out of 5 stars on G2, earning it the title of “Leader” in the field of CDP software. Technical architect Julien W. writes:
"Segment is uniquely positioned as a customer data infrastructure and logging platform. It is well liked by both our development teams and marketing teams, and is the glue that holds everything together when it comes to leveraging our company's data."
Despite the generally high reviews, however, some common complaints about Segment are the tool’s high cost, as well as the lack of certain features (requiring programming skills to implement yourself).
The best part? Segment integrates with Integrate.io, so that you can process the data even further. For more information, check out our guide on processing your customer data with Segment using Integrate.io.
4) Rapid7 insightOps
Rapid7 insightOps (formerly known as Logentries) claims to offer “ridiculously easy log management.” Similar to Loggly, insightOps is a big data collection platform that provides central, cloud-based log management, including monitoring, tailing, and querying.
The features of insightOps include:
- Visual Search for automatically visualizing hidden insights and trends in your data.
- Live streaming, monitoring, and alerts.
- Custom dashboards and ad hoc reporting.
There are two tiers in the insightOps pricing model: standard and enterprise.
- Standard tier (starting at $48/month): 30-day retention period for data. Monthly pricing is based on data consumption, e.g. $48 for 30 gigabytes and $433 for 300 gigabytes. Larger amounts require a custom quote.
- Enterprise tier (custom pricing): 90-day retention period for data and automatic data normalization (converting unstructured data into structured data).
insightOps also offers a 30-day free trial, so that you can thoroughly try the platform for yourself before you buy.
The insightOps platform currently has an average rating of 3.1 out of 5 stars on G2, based on 11 reviews. Site reliability engineer Micah H. writes that insightOps “hits the perfect sweet spot for us in terms of functionality, pricing, support, and management”:
“Its app is intuitive and its query language (LEQL) is powerful, yet easy to use for simple cases… Their support is quick and knowledgeable. I've personally filed over 15 support tickets and they have always promptly answered.”
However, reviewers also mention that insightOps has a few noteworthy drawbacks, including slow query times, a confusing user interface, and the lack of certain desirable features.
Papertrail is another cloud-hosted log management service that, like Loggly, is owned by the IT monitoring software company SolarWinds. Compared with Loggly, Papertrail is more of a no-frills service with less flashy dashboards. Here at Integrate.io, we use Papertrail for our own debugging purposes and integrate Papertrail with Integrate.io to process the logs.
The Papertrail big data collection platform offers a wide range of pricing tiers, from $7/month for 1 gigabyte to $230/month for 25 gigabytes. There’s also a free option with 50 megabytes per month, which lets you search through data from the past 48 hours and archive data for 7 days.
After 12 reviews on G2, Papertrail has an average rating of 4.3 out of 5 stars. Software architect Kevin V. writes:
“Papertrail handles log aggregation well and the price is great. It may not be as full-featured as other log aggregators… however, the price is substantially less. It gets the job done well for basic log aggregation. Papertrail also has a fantastic support team that has delivered every time.”
However, reviewers also mention disadvantages such as the fickle search functionality, an outdated user interface, and the lack of analytics capabilities.
Now that we’ve looked at 5 of the top big data collection platforms, it’s a little more clear which options are best for which use cases. For example:
- S3/CloudFront logging: Existing customers of Amazon Web Services.
- Segment: Users who need a big data collection platform specifically for customer data.
- Papertrail: Users who want a great deal of flexibility in the platform’s pricing model.
No matter which platform for collecting big data you choose, you need a way to get that information into a data warehouse for efficient processing and analysis. That’s exactly why we built the Integrate.io ETL platform for building powerful, robust data pipelines to your cloud data warehouse or data lake.
Integrate.io offers an intuitive, drag-and-drop interface for even non-technical users—plus, more than 100 integrations with data warehouses and databases, business intelligence and analytics tools, and more. Ready to get started? Schedule a call with our team to talk about your business needs and objectives, or to start your free trial of the Integrate.io platform.