644302 - [input-stage] Consistently getting 500 errors / Zeus "Service Unavailable" message

Reporter

Description

•

13 years ago

I'm not sure what changed but I noticed today that I'm getting 500 errors 100% of the time when I go to:

http://m.input.stage.mozilla.com/en-US/beta/

This is blocking me from testing.

Matt Brandt [:mbrandt]

Reporter

Updated

•

13 years ago

Whiteboard: [stage]

Corey Shields [:cshields]

Comment 1

•

13 years ago

Going to let Jake help you with this.

Jake, this is on mrapp-stage02.mozilla.org..  current root password should get you in to put your key in there.

Assignee: server-ops → nmaul

Matt Brandt [:mbrandt]

Reporter

Comment 2

•

13 years ago

Thanks Corey! And Jake .. thank you.

James Socol [:jsocol, :james]

Comment 3

•

13 years ago

If input.stage is using tm-stage01-slave01 you should talk to Dave Miller, too.

Dave Dash [:davedash, :dd] (assign all bugs to mbrandt)

Updated

•

13 years ago

Summary: Consistently getting 500 errors → [input-stage] Consistently getting 500 errors

Stephen Donner [:stephend] Not actively reading bugmail

Updated

•

13 years ago

Summary: [input-stage] Consistently getting 500 errors → [input-stage] Consistently getting 500 errors / Zeus "Service Unavailable" message

Dave Miller [:justdave]

Comment 4

•

13 years ago

So, Jake tells me this query is the one that's blocking things up:

SELECT `feedback_term`.`id`, `feedback_term`.`term`, `feedback_term`.`hidden`,
       COUNT(`feedback_opinion_terms`.`opinion_id`) AS `cnt`
FROM `feedback_term`
     INNER JOIN `feedback_opinion_terms` ON 
                 (`feedback_term`.`id` = `feedback_opinion_terms`.`term_id`) WHERE `feedback_term`.`hidden` = 0
GROUP BY `feedback_term`.`id`
ORDER BY cnt DESC
LIMIT 10;

Explain on that says:

+----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+
| id | select_type | table                  | type  | possible_keys                   | key                             | key_len | ref                                      | rows   | Extra                                        |
+----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | feedback_term          | index | PRIMARY                         | PRIMARY                         | 4       | NULL                                     | 467796 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | feedback_opinion_terms | ref   | feedback_opinion_terms_f1e83452 | feedback_opinion_terms_f1e83452 | 4       | input_stage_mozilla_com.feedback_term.id |      5 |                                              | 
+----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+

The "Using where" means there's no applicable key for the WHERE part.  It's using the PRIMARY key as the next best index because the column with the primary key is half of the join condition for the table it's joining to.

This *likely* implies that it's doing the JOIN first, and then the filesort on that to sift it for the WHERE condition after it's done (which is what would be causing the on-disk temporary table).

The entire WHERE condition is "WHERE `feedback_term`.`hidden` = 0".  As silly as it sounds (the only possible values are 1 and 0) I'm going to suggest that adding an index on feedback_term.hidden would likely help.

Assignee: nmaul → nobody

Component: Server Operations → Input

Product: mozilla.org → Webtools

QA Contact: mrz → input

Dave Miller [:justdave]

Comment 5

•

13 years ago

This table is 38 MB on disk, it probably won't take very long to generate an index on that much data, if you'd like to me to experiment with this on stage...

Dave Miller [:justdave]

Comment 6

•

13 years ago

OK, davedash said to go ahead and run the experiment...

adding the index took 27 seconds.

Explain after the change:
|  1 | SIMPLE      | feedback_term          | ref  | PRIMARY,hidden                  | hidden                  
                     | 1       | const                                    | 237342 | Using where; Using temporary; Using filesort | 

+1 : it's actually using the index
-1 : it still wants a temporary table :|

Running the query after this change takes 39 seconds instead of 47, so it was an improvement.

Dropping the "ORDER BY" makes the temporary table go away and the query returns almost instantly, but the LIMIT is useless without it.  And leaving off the limit, returns 15 MB of data, and the time to transmit that across the wire probably negates any benefit you'd get out of sorting it client-side in the app.

Dave Miller [:justdave]

Comment 7

•

13 years ago

The error message was being generated by Zeus...  on a whim I checked the Zeus settings for this virtual host, and it had a 40 second timeout in it waiting for a response from the backend.  I bumped that timeout up to 60 seconds, and the page now loads.

That's probably longer than you want a user to wait, but it actually works now.

Fred Wenzel [:wenzel]

Comment 8

•

13 years ago

Does input.stage have proper memcache settings in settings_local.py? This query is supposed to run like once every 5-or-so minutes, then be cached for awhile.

Jake Maul [:jakem]

Assignee

Comment 9

•

13 years ago

CACHE_BACKEND = 'memcached://127.0.0.1:11211/'

 2815 ?        Ssl   69:55 memcached -d -p 11211 -u nobody -c 1024 -m 64

Looks okay to me...

Fred Wenzel [:wenzel]

Comment 10

•

13 years ago

As promised, I am just crafting some SQL to remove the long tail from the terms table (and the opinion_terms many-to-many relationship).

Fred Wenzel [:wenzel]

Comment 11

•

13 years ago

Attached file SQL to truncate long tail on terms table — Details

Here's the SQL I was using to reduce the terms table significantly: It'll only keep the terms around that have more than 100 occurrences.

Matt Brandt [:mbrandt]

Reporter

Comment 12

•

13 years ago

May lot's of rock-stardom, chocolate, and good coffee come your way soon (comment 11)

Fred Wenzel [:wenzel]

Comment 13

•

13 years ago

Jake/Justdave, will you run the SQL in comment 11 on stage? If that reduces our footprint of the database enough, we'll also run it on prod.

Fred Wenzel [:wenzel]

Updated

•

13 years ago

Assignee: nobody → server-ops

Component: Input → Server Operations

Product: Webtools → mozilla.org

QA Contact: input → mrz

Fred Wenzel [:wenzel]

Updated

•

13 years ago

Severity: critical → major

Hardware: x86 → All

Phong Tran [:phong]

Updated

•

13 years ago

Assignee: server-ops → justdave

Jake Maul [:jakem]

Assignee

Updated

•

13 years ago

Assignee: justdave → nmaul

Jake Maul [:jakem]

Assignee

Comment 14

•

13 years ago

This is completed on input_stage_mozilla_com database. As you can see, the vast majority of records were removed.

mysql> CREATE TEMPORARY TABLE IF NOT EXISTS deleteme (id INT);
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO deleteme SELECT ft.id FROM feedback_term AS ft LEFT JOIN feedback_opinion_terms ot ON ft.id = ot.term_id GROUP BY ft.id HAVING count(ft.id) < 100;
Query OK, 463564 rows affected (46.56 sec)
Records: 463564  Duplicates: 0  Warnings: 0

mysql> DELETE ot FROM feedback_opinion_terms ot INNER JOIN deleteme ON ot.term_id = deleteme.id;
Query OK, 1594718 rows affected (7 min 52.46 sec)

mysql> DELETE ft FROM feedback_term ft NATURAL JOIN deleteme;
Query OK, 463564 rows affected (22.14 sec)

mysql> DROP TABLE deleteme;
Query OK, 0 rows affected (0.03 sec)

mysql> select count(*) from feedback_term;
+----------+
| count(*) |
+----------+
|     7108 | 
+----------+

Jake Maul [:jakem]

Assignee

Updated

•

13 years ago

Status: NEW → ASSIGNED

Jake Maul [:jakem]

Assignee

Updated

•

13 years ago

Assignee: nmaul → fwenzel

Jake Maul [:jakem]

Assignee

Updated

•

13 years ago

Assignee: fwenzel → nobody

Component: Server Operations → Input

Product: mozilla.org → Webtools

QA Contact: mrz → input

Corey Shields [:cshields]

Comment 15

•

13 years ago

Fred/Matt, is this resolved?

Corey Shields [:cshields]

Updated

•

13 years ago

Blocks: 636173

Fred Wenzel [:wenzel]

Comment 16

•

13 years ago

Matt, please confirm if this has improved today. if so, we'll need to run the same SQL on prod today, as tomorrow morning the prod DB will be mirrored to stage again.

Matt Brandt [:mbrandt]

Reporter

Comment 17

•

13 years ago

Fred, this appears to have vastly improved our situation on stage. I think we can move forward with running on prod.

Fred Wenzel [:wenzel]

Comment 18

•

13 years ago

I asked Jake to run this on prod now.

Jake Maul [:jakem]

Assignee

Comment 19

•

13 years ago

This has been run on the prod DB as well.

mysql> \u input_mozilla_com

Database changed
mysql> CREATE TEMPORARY TABLE IF NOT EXISTS deleteme (id INT);
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO deleteme SELECT ft.id FROM feedback_term AS ft LEFT JOIN feedback_opinion_terms ot ON ft.id = ot.term_id GROUP BY ft.id HAVING count(ft.id) < 100;
Query OK, 470984 rows affected (17.44 sec)
Records: 470984  Duplicates: 0  Warnings: 0

mysql> DELETE ot FROM feedback_opinion_terms ot INNER JOIN deleteme ON ot.term_id = deleteme.id;
Query OK, 1619791 rows affected (2 min 56.91 sec)

mysql> DELETE ft FROM feedback_term ft NATURAL JOIN deleteme;
Query OK, 470984 rows affected (42.23 sec)

mysql> DROP TABLE deleteme;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from feedback_term;
+----------+
| count(*) |
+----------+
|     7209 | 
+----------+

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Fred Wenzel [:wenzel]

Comment 20

•

13 years ago

Verified per comments 19 and 17.

Assignee: nobody → nmaul

Status: RESOLVED → VERIFIED

David Burns :automatedtester

Comment 21

•

13 years ago

We have had reports of this during the WebQA Automation Test day at 13:52 GMT+1

Status: VERIFIED → REOPENED

Resolution: FIXED → ---

[:Aleksej]

Comment 22

•

13 years ago

Attached image screenshot of "Service Unavailable", made by a Selenium test — Details

Timestamps of the screenshot files (UTC):

"Service Unavailable" in red on white background:
* 2011-03-29 12:45:00.725571518
* 2011-03-29 12:45:39.773568008
* 2011-03-29 12:52:30.573572318
"Search Unavailable" with "Query has timed out." as the error message on blue and white background:
* 2011-03-29 12:50:01.992567647

[:Aleksej]

Updated

•

13 years ago

Whiteboard: [stage] → [stage] [WebQA-TestDay]

Jake Maul [:jakem]

Assignee

Comment 23

•

13 years ago

The input_stage_mozilla_com.feedback_term table on tm-stage01-master01 has grown from ~7000 rows after the cleaning script Fred made, up to almost 500,000 rows. This is likely what caused this problem to reoccur.

The same table in production (input_mozilla_com.feedback_term on tp-a01-master01) is only ~11,000 rows, so the change stuck fairly well there.

Perhaps something happened in testing to import a huge swell of new rows?

Fred Wenzel [:wenzel]

Comment 24

•

13 years ago

I am just guessing, but perhaps the prod dump applied last night was older than when you ran the query on prod (I don't know just how frequently this is applied). Jake, will you rerun the query on stage? We should be fine for next week (when prod is pushed to stage again) then. Sorry about that and thanks.

Jake Maul [:jakem]

Assignee

Comment 25

•

13 years ago

That is an interesting possibility. I'm not sure how exactly the prod->stage transfer occurs, but I can see how that would be the case. I have re-run the pruning job on tm-stage01-master01, so this is once again fixed. I will close this bug again, but please re-open if the problem re-occurs.

Thanks!

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Matt Brandt [:mbrandt]

Reporter

Comment 26

•

13 years ago

Thanks for the group effort in resolving this. The stage environment is very stable now.

QA verfied fixed.

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

13 years ago

Component: Input → General

Product: Webtools → Input

SQL to truncate long tail on terms table 13 years ago Fred Wenzel [:wenzel] 381 bytes, text/plain		Details
screenshot of "Service Unavailable", made by a Selenium test 13 years ago [:Aleksej] 8.52 KB, image/png		Details