question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running a group by query produces results with blank primary keys

See original GitHub issue

CrateDB version: 2.3.11

Environment description:

  • JVM version: 1.8.0_181
  • Kernel: Linux 4.4.38
  • Distribution: Ubuntu 16.04
  • Number of nodes: 3

Problem description: Running a group by query produces results with blank primary keys. That is, I get several results which are valid, and one row with a blank primary key result, which seems to affect the rest of the results values.

I’ve hit this issue before, and previously running a pointless update query fixed it. I also tried changing the replica count, and running an optimize query. I then checked the error log on the master node, which is full of errors (I have attached the whole log as I can’t understand a lot of the errors to know what is relevant), and noticed that the health of the cluster keeps switching between yellow and green with one table saying it has underreplicated shards occasionally but not underreplicated records.

This issue, and issues like it, seem to occur when one or more of the cluster nodes are restarted or go down, as they did recently.

Steps to reproduce: Here’s the query I ran:

SELECT armada.wind_turbine_data_daily.device_uuid AS armada_wind_turbine_data_daily_device_uuid, sum(armada.wind_turbine_data_daily.energy) AS energy, avg(armada.wind_turbine_data_daily.wind_speed) AS wind_speed, sum(armada.wind_turbine_data_daily.availability * armada.wind_turbine_data_daily.samples) / sum(armada.wind_turbine_data_daily.samples) AS availability 
FROM armada.wind_turbine_data_daily 
WHERE armada.wind_turbine_data_daily.timestamp >= '2018-09-01' AND armada.wind_turbine_data_daily.timestamp <= '2018-09-30' GROUP BY armada.wind_turbine_data_daily.device_uuid ORDER BY device_uuid limit 1000;

Here’s the table schema:

CREATE TABLE IF NOT EXISTS "armada"."wind_turbine_data_daily" (
   "activity" FLOAT,
   "availability" FLOAT,
   "created_at" TIMESTAMP,
   "device_uuid" STRING,
   "direction_wind" FLOAT,
   "energy" FLOAT,
   "energy_cumulative" FLOAT,
   "interval_duration" INTEGER,
   "power_active" FLOAT,
   "power_active_filtered_sum" FLOAT,
   "power_active_sum" FLOAT,
   "power_theoretical_filtered_sum" FLOAT,
   "power_theoretical_sum" FLOAT,
   "rpm_generator" FLOAT,
   "rpm_rotor" FLOAT,
   "samples" INTEGER,
   "seconds_observed" INTEGER,
   "status" STRING,
   "timestamp" TIMESTAMP,
   "updated_at" TIMESTAMP,
   "wind_speed" FLOAT,
   "wind_speed_max" FLOAT,
   PRIMARY KEY ("timestamp", "device_uuid")
)
CLUSTERED INTO 4 SHARDS
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.write" = false,
   column_policy = 'dynamic',
   "mapping.total_fields.limit" = 1000,
   number_of_replicas = '0-1',
   "recovery.initial_shards" = 'quorum',
   refresh_interval = 1000,
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "warmer.enabled" = true,
   "write.wait_for_active_shards" = 'all'
)

Here’s the update query I ran that previously has fixed issues like this:

UPDATE armada.wind_turbine_data_daily SET seconds_observed = seconds_observed;

What I did with replicas was:

alter table armada.wind_turbine_data_daily set (number_of_replicas='0');

and then

alter table armada.wind_turbine_data_daily set (number_of_replicas='0-1');

crate.log

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
nocturnaltortoisecommented, Sep 13, 2018

Yeah, so the cluster machines went down, and for a time only one cluster node was up. Nothing should have been able to write to the individual node, although attempts might have been made most errored because of the state of the cluster.

Something interesting does seem to have happened - some data exists in the table I’m querying with blank primary key (device_uuid in this case) values - removing those rows fixes it, but in a very similar case in a different table there are no such blank primary key rows to delete.

1reaction
nocturnaltortoisecommented, Sep 13, 2018

I have also tried recreating the table (by renaming the table, making a new one without more than the bare minimum of the create table statement, copying the data from the old renamed table), and that does not seem to have fixed the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Primary Keys and Group By's — A Brief SQL Investigation
Grouping by primary key results in a single record in each group which is logically the same as not grouping at all /...
Read more >
mysql - GROUP BY only primary key, but select other values
@RaphaëlAlthaus usually that's true, but grouping by the primary key (or any UNIQUE key) ensures that all the other values within the same...
Read more >
GROUP BY Clause: How Well Do You Know It? - LearnSQL.com
The GROUP BY clause comes right after the WHERE clause in SQL query. Here, the WHERE clause is missing, so it's right after...
Read more >
SQL GROUP BY to Summarize and Rollup Query Results
Learn various ways to use GROUP BY to summarize and rollup query results with examples along with using group by rollup, cube and...
Read more >
How To Use GROUP BY and ORDER BY in SQL - DigitalOcean
A GROUP BY statement sorts data by grouping it based on column(s) you specify in the query and is used with aggregate functions....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found