Closed
Bug 644302
Opened 13 years ago
Closed 13 years ago
[input-stage] Consistently getting 500 errors / Zeus "Service Unavailable" message
Categories
(Input :: General, defect)
Input
General
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: mbrandt, Assigned: nmaul)
References
()
Details
(Whiteboard: [stage] [WebQA-TestDay])
Attachments
(2 files)
I'm not sure what changed but I noticed today that I'm getting 500 errors 100% of the time when I go to: http://m.input.stage.mozilla.com/en-US/beta/ This is blocking me from testing.
Reporter | ||
Updated•13 years ago
|
Whiteboard: [stage]
Comment 1•13 years ago
|
||
Going to let Jake help you with this. Jake, this is on mrapp-stage02.mozilla.org.. current root password should get you in to put your key in there.
Assignee: server-ops → nmaul
Reporter | ||
Comment 2•13 years ago
|
||
Thanks Corey! And Jake .. thank you.
Comment 3•13 years ago
|
||
If input.stage is using tm-stage01-slave01 you should talk to Dave Miller, too.
Updated•13 years ago
|
Summary: Consistently getting 500 errors → [input-stage] Consistently getting 500 errors
Updated•13 years ago
|
Summary: [input-stage] Consistently getting 500 errors → [input-stage] Consistently getting 500 errors / Zeus "Service Unavailable" message
Comment 4•13 years ago
|
||
So, Jake tells me this query is the one that's blocking things up: SELECT `feedback_term`.`id`, `feedback_term`.`term`, `feedback_term`.`hidden`, COUNT(`feedback_opinion_terms`.`opinion_id`) AS `cnt` FROM `feedback_term` INNER JOIN `feedback_opinion_terms` ON (`feedback_term`.`id` = `feedback_opinion_terms`.`term_id`) WHERE `feedback_term`.`hidden` = 0 GROUP BY `feedback_term`.`id` ORDER BY cnt DESC LIMIT 10; Explain on that says: +----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+ | 1 | SIMPLE | feedback_term | index | PRIMARY | PRIMARY | 4 | NULL | 467796 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | feedback_opinion_terms | ref | feedback_opinion_terms_f1e83452 | feedback_opinion_terms_f1e83452 | 4 | input_stage_mozilla_com.feedback_term.id | 5 | | +----+-------------+------------------------+-------+---------------------------------+---------------------------------+---------+------------------------------------------+--------+----------------------------------------------+ The "Using where" means there's no applicable key for the WHERE part. It's using the PRIMARY key as the next best index because the column with the primary key is half of the join condition for the table it's joining to. This *likely* implies that it's doing the JOIN first, and then the filesort on that to sift it for the WHERE condition after it's done (which is what would be causing the on-disk temporary table). The entire WHERE condition is "WHERE `feedback_term`.`hidden` = 0". As silly as it sounds (the only possible values are 1 and 0) I'm going to suggest that adding an index on feedback_term.hidden would likely help.
Assignee: nmaul → nobody
Component: Server Operations → Input
Product: mozilla.org → Webtools
QA Contact: mrz → input
Comment 5•13 years ago
|
||
This table is 38 MB on disk, it probably won't take very long to generate an index on that much data, if you'd like to me to experiment with this on stage...
Comment 6•13 years ago
|
||
OK, davedash said to go ahead and run the experiment... adding the index took 27 seconds. Explain after the change: | 1 | SIMPLE | feedback_term | ref | PRIMARY,hidden | hidden | 1 | const | 237342 | Using where; Using temporary; Using filesort | +1 : it's actually using the index -1 : it still wants a temporary table :| Running the query after this change takes 39 seconds instead of 47, so it was an improvement. Dropping the "ORDER BY" makes the temporary table go away and the query returns almost instantly, but the LIMIT is useless without it. And leaving off the limit, returns 15 MB of data, and the time to transmit that across the wire probably negates any benefit you'd get out of sorting it client-side in the app.
Comment 7•13 years ago
|
||
The error message was being generated by Zeus... on a whim I checked the Zeus settings for this virtual host, and it had a 40 second timeout in it waiting for a response from the backend. I bumped that timeout up to 60 seconds, and the page now loads. That's probably longer than you want a user to wait, but it actually works now.
Comment 8•13 years ago
|
||
Does input.stage have proper memcache settings in settings_local.py? This query is supposed to run like once every 5-or-so minutes, then be cached for awhile.
Assignee | ||
Comment 9•13 years ago
|
||
CACHE_BACKEND = 'memcached://127.0.0.1:11211/' 2815 ? Ssl 69:55 memcached -d -p 11211 -u nobody -c 1024 -m 64 Looks okay to me...
Comment 10•13 years ago
|
||
As promised, I am just crafting some SQL to remove the long tail from the terms table (and the opinion_terms many-to-many relationship).
Comment 11•13 years ago
|
||
Here's the SQL I was using to reduce the terms table significantly: It'll only keep the terms around that have more than 100 occurrences.
Reporter | ||
Comment 12•13 years ago
|
||
May lot's of rock-stardom, chocolate, and good coffee come your way soon (comment 11)
Comment 13•13 years ago
|
||
Jake/Justdave, will you run the SQL in comment 11 on stage? If that reduces our footprint of the database enough, we'll also run it on prod.
Updated•13 years ago
|
Assignee: nobody → server-ops
Component: Input → Server Operations
Product: Webtools → mozilla.org
QA Contact: input → mrz
Updated•13 years ago
|
Severity: critical → major
Hardware: x86 → All
Updated•13 years ago
|
Assignee: server-ops → justdave
Assignee | ||
Updated•13 years ago
|
Assignee: justdave → nmaul
Assignee | ||
Comment 14•13 years ago
|
||
This is completed on input_stage_mozilla_com database. As you can see, the vast majority of records were removed. mysql> CREATE TEMPORARY TABLE IF NOT EXISTS deleteme (id INT); Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO deleteme SELECT ft.id FROM feedback_term AS ft LEFT JOIN feedback_opinion_terms ot ON ft.id = ot.term_id GROUP BY ft.id HAVING count(ft.id) < 100; Query OK, 463564 rows affected (46.56 sec) Records: 463564 Duplicates: 0 Warnings: 0 mysql> DELETE ot FROM feedback_opinion_terms ot INNER JOIN deleteme ON ot.term_id = deleteme.id; Query OK, 1594718 rows affected (7 min 52.46 sec) mysql> DELETE ft FROM feedback_term ft NATURAL JOIN deleteme; Query OK, 463564 rows affected (22.14 sec) mysql> DROP TABLE deleteme; Query OK, 0 rows affected (0.03 sec) mysql> select count(*) from feedback_term; +----------+ | count(*) | +----------+ | 7108 | +----------+
Assignee | ||
Updated•13 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Updated•13 years ago
|
Assignee: nmaul → fwenzel
Assignee | ||
Updated•13 years ago
|
Assignee: fwenzel → nobody
Component: Server Operations → Input
Product: mozilla.org → Webtools
QA Contact: mrz → input
Comment 15•13 years ago
|
||
Fred/Matt, is this resolved?
Comment 16•13 years ago
|
||
Matt, please confirm if this has improved today. if so, we'll need to run the same SQL on prod today, as tomorrow morning the prod DB will be mirrored to stage again.
Reporter | ||
Comment 17•13 years ago
|
||
Fred, this appears to have vastly improved our situation on stage. I think we can move forward with running on prod.
Comment 18•13 years ago
|
||
I asked Jake to run this on prod now.
Assignee | ||
Comment 19•13 years ago
|
||
This has been run on the prod DB as well. mysql> \u input_mozilla_com Database changed mysql> CREATE TEMPORARY TABLE IF NOT EXISTS deleteme (id INT); Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO deleteme SELECT ft.id FROM feedback_term AS ft LEFT JOIN feedback_opinion_terms ot ON ft.id = ot.term_id GROUP BY ft.id HAVING count(ft.id) < 100; Query OK, 470984 rows affected (17.44 sec) Records: 470984 Duplicates: 0 Warnings: 0 mysql> DELETE ot FROM feedback_opinion_terms ot INNER JOIN deleteme ON ot.term_id = deleteme.id; Query OK, 1619791 rows affected (2 min 56.91 sec) mysql> DELETE ft FROM feedback_term ft NATURAL JOIN deleteme; Query OK, 470984 rows affected (42.23 sec) mysql> DROP TABLE deleteme; Query OK, 0 rows affected (0.00 sec) mysql> select count(*) from feedback_term; +----------+ | count(*) | +----------+ | 7209 | +----------+
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 20•13 years ago
|
||
Verified per comments 19 and 17.
Assignee: nobody → nmaul
Status: RESOLVED → VERIFIED
Comment 21•13 years ago
|
||
We have had reports of this during the WebQA Automation Test day at 13:52 GMT+1
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Comment 22•13 years ago
|
||
Timestamps of the screenshot files (UTC): "Service Unavailable" in red on white background: * 2011-03-29 12:45:00.725571518 * 2011-03-29 12:45:39.773568008 * 2011-03-29 12:52:30.573572318 "Search Unavailable" with "Query has timed out." as the error message on blue and white background: * 2011-03-29 12:50:01.992567647
Updated•13 years ago
|
Whiteboard: [stage] → [stage] [WebQA-TestDay]
Assignee | ||
Comment 23•13 years ago
|
||
The input_stage_mozilla_com.feedback_term table on tm-stage01-master01 has grown from ~7000 rows after the cleaning script Fred made, up to almost 500,000 rows. This is likely what caused this problem to reoccur. The same table in production (input_mozilla_com.feedback_term on tp-a01-master01) is only ~11,000 rows, so the change stuck fairly well there. Perhaps something happened in testing to import a huge swell of new rows?
Comment 24•13 years ago
|
||
I am just guessing, but perhaps the prod dump applied last night was older than when you ran the query on prod (I don't know just how frequently this is applied). Jake, will you rerun the query on stage? We should be fine for next week (when prod is pushed to stage again) then. Sorry about that and thanks.
Assignee | ||
Comment 25•13 years ago
|
||
That is an interesting possibility. I'm not sure how exactly the prod->stage transfer occurs, but I can see how that would be the case. I have re-run the pruning job on tm-stage01-master01, so this is once again fixed. I will close this bug again, but please re-open if the problem re-occurs. Thanks!
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 26•13 years ago
|
||
Thanks for the group effort in resolving this. The stage environment is very stable now. QA verfied fixed.
Status: RESOLVED → VERIFIED
Updated•13 years ago
|
Component: Input → General
Product: Webtools → Input
You need to log in
before you can comment on or make changes to this bug.
Description
•