Closed Bug 644302 Opened 13 years ago Closed 13 years ago

[input-stage] Consistently getting 500 errors / Zeus "Service Unavailable" message

Categories

(Input :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: mbrandt, Assigned: nmaul)

References

()

Details

(Whiteboard: [stage] [WebQA-TestDay])

Attachments

(2 files)

I'm not sure what changed but I noticed today that I'm getting 500 errors 100% of the time when I go to:

http://m.input.stage.mozilla.com/en-US/beta/

This is blocking me from testing.
Whiteboard: [stage]
Going to let Jake help you with this.

Jake, this is on mrapp-stage02.mozilla.org..  current root password should get you in to put your key in there.
Assignee: server-ops → nmaul
Thanks Corey! And Jake .. thank you.
If input.stage is using tm-stage01-slave01 you should talk to Dave Miller, too.
Summary: Consistently getting 500 errors → [input-stage] Consistently getting 500 errors
Summary: [input-stage] Consistently getting 500 errors → [input-stage] Consistently getting 500 errors / Zeus "Service Unavailable" message
So, Jake tells me this query is the one that's blocking things up:

SELECT `feedback_term`.`id`, `feedback_term`.`term`, `feedback_term`.`hidden`,
       COUNT(`feedback_opinion_terms`.`opinion_id`) AS `cnt`
FROM `feedback_term`
     INNER JOIN `feedback_opinion_terms` ON 
                 (`feedback_term`.`id` = `feedback_opinion_terms`.`term_id`) WHERE `feedback_term`.`hidden` = 0
GROUP BY `feedback_term`.`id`
ORDER BY cnt DESC
LIMIT 10;

Explain on that says:

+----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+
| id | select_type | table                  | type  | possible_keys                   | key                             | key_len | ref                                      | rows   | Extra                                        |
+----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+
|  1 | SIMPLE      | feedback_term          | index | PRIMARY                         | PRIMARY                         | 4       | NULL                                     | 467796 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | feedback_opinion_terms | ref   | feedback_opinion_terms_f1e83452 | feedback_opinion_terms_f1e83452 | 4       | input_stage_mozilla_com.feedback_term.id |      5 |                                              | 
+----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+

The "Using where" means there's no applicable key for the WHERE part.  It's using the PRIMARY key as the next best index because the column with the primary key is half of the join condition for the table it's joining to.

This *likely* implies that it's doing the JOIN first, and then the filesort on that to sift it for the WHERE condition after it's done (which is what would be causing the on-disk temporary table).

The entire WHERE condition is "WHERE `feedback_term`.`hidden` = 0".  As silly as it sounds (the only possible values are 1 and 0) I'm going to suggest that adding an index on feedback_term.hidden would likely help.
Assignee: nmaul → nobody
Component: Server Operations → Input
Product: mozilla.org → Webtools
QA Contact: mrz → input
This table is 38 MB on disk, it probably won't take very long to generate an index on that much data, if you'd like to me to experiment with this on stage...
OK, davedash said to go ahead and run the experiment...

adding the index took 27 seconds.

Explain after the change:
|  1 | SIMPLE      | feedback_term          | ref  | PRIMARY,hidden                  | hidden                  
                     | 1       | const                                    | 237342 | Using where; Using temporary; Using filesort | 

+1 : it's actually using the index
-1 : it still wants a temporary table :|

Running the query after this change takes 39 seconds instead of 47, so it was an improvement.

Dropping the "ORDER BY" makes the temporary table go away and the query returns almost instantly, but the LIMIT is useless without it.  And leaving off the limit, returns 15 MB of data, and the time to transmit that across the wire probably negates any benefit you'd get out of sorting it client-side in the app.
The error message was being generated by Zeus...  on a whim I checked the Zeus settings for this virtual host, and it had a 40 second timeout in it waiting for a response from the backend.  I bumped that timeout up to 60 seconds, and the page now loads.

That's probably longer than you want a user to wait, but it actually works now.
Does input.stage have proper memcache settings in settings_local.py? This query is supposed to run like once every 5-or-so minutes, then be cached for awhile.
CACHE_BACKEND = 'memcached://127.0.0.1:11211/'

 2815 ?        Ssl   69:55 memcached -d -p 11211 -u nobody -c 1024 -m 64

Looks okay to me...
As promised, I am just crafting some SQL to remove the long tail from the terms table (and the opinion_terms many-to-many relationship).
Here's the SQL I was using to reduce the terms table significantly: It'll only keep the terms around that have more than 100 occurrences.
May lot's of rock-stardom, chocolate, and good coffee come your way soon (comment 11)
Jake/Justdave, will you run the SQL in comment 11 on stage? If that reduces our footprint of the database enough, we'll also run it on prod.
Assignee: nobody → server-ops
Component: Input → Server Operations
Product: Webtools → mozilla.org
QA Contact: input → mrz
Severity: critical → major
Hardware: x86 → All
Assignee: server-ops → justdave
Assignee: justdave → nmaul
This is completed on input_stage_mozilla_com database. As you can see, the vast majority of records were removed.

mysql> CREATE TEMPORARY TABLE IF NOT EXISTS deleteme (id INT);
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO deleteme SELECT ft.id FROM feedback_term AS ft LEFT JOIN feedback_opinion_terms ot ON ft.id = ot.term_id GROUP BY ft.id HAVING count(ft.id) < 100;
Query OK, 463564 rows affected (46.56 sec)
Records: 463564  Duplicates: 0  Warnings: 0

mysql> DELETE ot FROM feedback_opinion_terms ot INNER JOIN deleteme ON ot.term_id = deleteme.id;
Query OK, 1594718 rows affected (7 min 52.46 sec)

mysql> DELETE ft FROM feedback_term ft NATURAL JOIN deleteme;
Query OK, 463564 rows affected (22.14 sec)

mysql> DROP TABLE deleteme;
Query OK, 0 rows affected (0.03 sec)

mysql> select count(*) from feedback_term;
+----------+
| count(*) |
+----------+
|     7108 | 
+----------+
Status: NEW → ASSIGNED
Assignee: nmaul → fwenzel
Assignee: fwenzel → nobody
Component: Server Operations → Input
Product: mozilla.org → Webtools
QA Contact: mrz → input
Fred/Matt, is this resolved?
Blocks: 636173
Matt, please confirm if this has improved today. if so, we'll need to run the same SQL on prod today, as tomorrow morning the prod DB will be mirrored to stage again.
Fred, this appears to have vastly improved our situation on stage. I think we can move forward with running on prod.
I asked Jake to run this on prod now.
This has been run on the prod DB as well.

mysql> \u input_mozilla_com

Database changed
mysql> CREATE TEMPORARY TABLE IF NOT EXISTS deleteme (id INT);
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO deleteme SELECT ft.id FROM feedback_term AS ft LEFT JOIN feedback_opinion_terms ot ON ft.id = ot.term_id GROUP BY ft.id HAVING count(ft.id) < 100;
Query OK, 470984 rows affected (17.44 sec)
Records: 470984  Duplicates: 0  Warnings: 0

mysql> DELETE ot FROM feedback_opinion_terms ot INNER JOIN deleteme ON ot.term_id = deleteme.id;
Query OK, 1619791 rows affected (2 min 56.91 sec)

mysql> DELETE ft FROM feedback_term ft NATURAL JOIN deleteme;
Query OK, 470984 rows affected (42.23 sec)

mysql> DROP TABLE deleteme;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from feedback_term;
+----------+
| count(*) |
+----------+
|     7209 | 
+----------+
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Verified per comments 19 and 17.
Assignee: nobody → nmaul
Status: RESOLVED → VERIFIED
We have had reports of this during the WebQA Automation Test day at 13:52 GMT+1
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Timestamps of the screenshot files (UTC):

"Service Unavailable" in red on white background:
* 2011-03-29 12:45:00.725571518
* 2011-03-29 12:45:39.773568008
* 2011-03-29 12:52:30.573572318
"Search Unavailable" with "Query has timed out." as the error message on blue and white background:
* 2011-03-29 12:50:01.992567647
Whiteboard: [stage] → [stage] [WebQA-TestDay]
The input_stage_mozilla_com.feedback_term table on tm-stage01-master01 has grown from ~7000 rows after the cleaning script Fred made, up to almost 500,000 rows. This is likely what caused this problem to reoccur.

The same table in production (input_mozilla_com.feedback_term on tp-a01-master01) is only ~11,000 rows, so the change stuck fairly well there.

Perhaps something happened in testing to import a huge swell of new rows?
I am just guessing, but perhaps the prod dump applied last night was older than when you ran the query on prod (I don't know just how frequently this is applied). Jake, will you rerun the query on stage? We should be fine for next week (when prod is pushed to stage again) then. Sorry about that and thanks.
That is an interesting possibility. I'm not sure how exactly the prod->stage transfer occurs, but I can see how that would be the case. I have re-run the pruning job on tm-stage01-master01, so this is once again fixed. I will close this bug again, but please re-open if the problem re-occurs.

Thanks!
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Thanks for the group effort in resolving this. The stage environment is very stable now.

QA verfied fixed.
Status: RESOLVED → VERIFIED
Component: Input → General
Product: Webtools → Input
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: