What are the best practices while using BigQuery?

BigQuery is affordable and fast, but it can quickly consume a lot of processing power, increasing the bills if not used properly. In BigQuery, slots decide the amount of processing power that directly impacts cost. First, let’s understand slots.

Slots

In BigQuery, slots are the computational capacity required to execute a SQL query. Slots play a crucial role in pricing and resource allocation. BigQuery manages the slot allocation, and the number of slots allocated to a job depends on

  1. Query size: The amount of data processed
  2. Complexity: The amount of information shuffled

In supplement to consumption-based pricing, BigQuery also provides flat-rate pricing, that is, you can buy a fixed number of slots over some period which provides more granularity in planning capital.

BigQuery is fast because it uses heavy parallel processing by converting the declarative SQL into stages.

Without further delay, let’s see the best practices in BigQuery for:

  • Cost controls
  • Query performance
  • Storage optimization

Cost controls

  • BigQuery is a columnar database. In other words, all columns from a table are stored separately at the physical level. Hence, we should always avoid using select * in our queries. This practice will limit the data processed.

  • For looking at the sample data, we should use the preview option instead of the query. The preview feature in BigQuery is free.

  • We should use a pricing calculator to check the costs before pushing the query to production. On-demand queries are charged based on the number of bytes read, and we can calculate the price based on bytes read.

  • We should use dashboard for viewing costs and logs. The dashboard can provide us with good insights on our usage of BigQuery.

  • Partitioning in BigQuery allows us to scan only the required data and reduce costs.

Query performance

  • We should avoid querying non-required partitions.
  • We should use external data sources suitably. There’s no guarantee of good performance when using external sources.
  • We should avoid excessive use of wildcard tables, for example, FROM bigquery-public-data.noaa_gsod.good*.
  • JavaScript user-defined functions reduce overall performance. We should avoid them.
  • Certain joins, such as a cartesian product, can create more output than input, and should be avoided.
  • We should avoid DML statements with updating or inserting single rows.

Storage optimization

  • The expiration time for datasets/tables/partitions should be set. It allows us to control storage expenditures and optimize the usage of BigQuery storage.
  • Using BigQuery to process the data, we can also use it to store in tables and partitions for an extended period. It has lower charges for the data, which is not modified in the last 90 days.
  • Lastly, we can use the pricing calculator to check storage costs.

Free Resources