What alternatives are there to Google BigQuery

This Google Cloud product is a hidden champion

The cloud market is still fiercely competitive - Microsoft and Amazon are clearly (still?) The top dogs when it comes to public cloud computing. In recent years, Google Cloud has established itself as a very good alternative to the two offers mentioned. With regard to a new Big Data Analytics project, we have relied on Google Cloud for the following reasons:

  1. Managed Kubernetes Cluster
  2. Intuitive interface and product overview, especially if there are many software developers on the team
  3. The Hidden champion: Google BigQuery

With this article we would like to present our experiences with BigQuery to you - first about the problem.

Problem

Have you ever tried to perform an arithmetic operation in an Excel file with a few gigabytes? It's not really fun and costs valuable time. In our case, we have designed an analytics platform that is able to evaluate terabytes of transaction data in a reasonable time. Not only does Excel fail here, but also most databases such as B. MySQL, MSSQL and Oracle. Of course you can use EXASOL, SAP HANA or similar (massive parallel processing) solutions for such challenges, but these applications cost you several (ten) thousand euros only for the appropriate licenses.

Open source and "serverless" MPP databases

In contrast to the above-mentioned MPP solutions that require a license, open source and “serverless” cloud MPP databases have existed on the market for some time. We have dealt in depth with various solutions and ultimately focus on Apache Impala (Cloudera), Pivotal Greenplum and Google BigQuery.

 

costs

impala and Greenplum convince in terms of price, as there are no costs for the pure application. BigQuery , as a column-oriented cloud solution, operates here with a pay-as-you-go model. storage costs around 2 cents per gigabyte and month. Furthermore, Google calculates the query of the database. At 5 € / TB, this should not be underestimated - to be fair, it must be said that only the selected columns are relevant.

Accordingly, the queries should be designed with great care and avoided if possible.

Usability & documentation

Anyone who has ever managed a database knows how labor-intensive this task can be. With the open source solutions, you can of course take over the hosting yourself - with the resulting overhead and presumably frustrated employees. Installation, maintenance, update and even more maintenance …… .Or, alternatively, a third party provider (e.g. Pivotal or Cloudera) will take care of the hosting.

Google BigQuery is only available as a "serverless" offer - you activate the API in the Google Cloud and voilà MPP-Database-as-a-Service. It is particularly worth mentioning that BigQuery automatically scales and hides the complexity surrounding infrastructure from the end user.

The documentation of all three products is very good. There are many tutorials related to these three MPP databases. We were impressed by the BigQuery documentation - especially due to the familiarity of the documentation with other Google products.

The thing about the SQL standards ...

Many providers officially state that they support certain SQL standards and this statement is mostly true. According to its own statement, Greenplum Database can be used with the SQL 2003 specification. Google BigQuery with SQL 2011. It is worth mentioning here, however, that you repeatedly come up against limits, as data types and standard functions are sometimes missing. Our advice is to first check all features (YOURSELF!). In the case of BigQuery, Decimal / Numeric, a data type used to avoid rounding errors, was not yet officially supported. After various emails with Google Support, however, we were activated for the BETA version with Numeric.

performance

Our performance ranking is based on the benchmarks from [1] and [2]. The test data from TPC-DS and TPC-H were used for the benchmarks. Impala and BigQuery run head to head. BigQuery is convincing in the “Concurrency” benchmark and is accordingly in first place for our purposes. Unfortunately, Greenplum could not hold a candle to Impala and BigQuery. Our SQL queries, which analyze over several hundred gigabytes, were answered within a few seconds.

Our recommendation: Google BigQuery

We clearly recommend BigQuery. The simplicity combined with this performance is unsurpassed in our eyes these days. BigQuery is recommended both for rapid data prototyping and for use as a data warehouse (data vault). Of course, BigQuery still has some teething problems and missing features (roles and rights leaves something to be desired ...) - but we still think that BigQuery is at least worth a look. There are no costs in this regard.

 

Swell:

[1] https://de.slideshare.net/cloudera/new-performance-benchmarks-apache-impala-incubating-leads-traditional-analytic-database

[2] http://blog.atscale.com/BI-Benchmarks-with-Google-BigQuery

/ by Muhammed Demircan