Closed Bug 428123 Opened 16 years ago Closed 15 years ago

win32 buildbot slaves should reboot ready for use

Categories

(Release Engineering :: General, defect, P2)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: bhearsum)

References

Details

Attachments

(7 files, 2 obsolete files)

Splitting out from bug#417887, as each o.s. will have different gotchas.

Basically, how to make each buildbot master/slave reboot cleanly, reconnect and handle new jobs?
Buildbot has a script that will add it as a Windows Service. IIRC there was problems related to not having a real console or not having a desktop to launch things on. This requires further investigation.
Attached image services.png
example services properties dialog on vista. Note the Interact With Desktop switch.
I believe there's also a way to set a service's running policy through the properties dialog. I don't have a local copy of win2k3 to test this though.
Summary: win32 buildbot masters/slaves should reboot ready for use → win32 buildbot slaves should reboot ready for use
Blocks: 472517
We chatted a bunch about this today and decided that part of this will be doing scheduled, periodic reboots of staging machines both to iron out kinks in the rebooting and to look for potential performance gains.
Status: NEW → ASSIGNED
Component: Release Engineering: Future → Release Engineering
Priority: -- → P3
Assignee: nobody → bhearsum
Starting working on this today. Looking promising, Buildbot was able to launch Firefox after being started as a win32 service. Initial problems:
* $PATH is different, I imagine it doesn't inherit the system or user set $PATH. We can probably fix this from Buildbot.
* Noticed MochiTest saying a lot of things like 'INFO Error: Unable to restore focus, expect failures and timeouts.' yet the tests still pass.

I haven't run a full set of unittests yet, but I plan to soon. I'm sure there's going to be more problems down the road, but I'm encouraged by these initial results.

*fingers crossed*
Priority: P3 → P2
So, it turns out we get *tons* of failure when Buildbot is started as a service. I suspect this is completely because of the fact that firefox.exe isn't running in any sort of "real" Desktop. I tried a few things to work around this, including running as the Local System Account with "allow desktop access" checked - but that made no difference.

Before going further into this blackhole I'm doing a test to see if running unittests from the "console" (that is to say, the real display) makes a difference in terms of time. Going this route is much more well documented. There's lots of information about how to automatically logon in Windows 2003, start processes, etc.

If it increases test run time by a significant amount I'll try and track down the issues with running as a service.
As it turns out, running tests in the "console" session of a win32 VM makes almost no difference in timing. Unit tests overall took 2 minutes longer - which is a trivial amount in the grand scheme of things.

Additionally, unittests pass completely when running here (nb. I tripped a legitimate mochitest leak, filed in bug 477066).

Given the above, I'm going to work on clean reboots using the console session.
Made good progress today. Here's a summary of what I did on moz2-win32-slave21:
* Installed RealVNC
* Turned off Firewall
* Edited its VMX file by hand to add the following:
svga.maxHeight = 1024
svga.maxWidth = 1280
svga.vramSize = 16777216
* Added a couple of batch files (to be attached shortly) to aid in starting buildbot on boot.
* Edited the registry to automatically login cltbld

It's now currently logging in and starting Buildbot on boot, and currently running unittests. I'm going to let things run overnight and if it looks good I'll be looking to apply these changes to all of the staging slaves next week.
Had some additional fails overnight. Some of them are related to the fact that there's no audio driver in the console session.

I tried installing http://www.rigexpert.net/gettingstarted/reaudio.htm, which helped, but some tests ended up hanging.

Then, I installed the demo of http://software.muzychenko.net/eng/vac.html and video tests started passing. The demo claims to be "feature limited", which I suspect means some recording features aren't available. More importantly, the demo doesn't seem to be time limited, so I think we're totally within our rights to use it for as long as we want. I'm going to leave moz2-win32-slave21 running builds and tests over the weekend to get some more results out of it.
Things ran perfectly well on mozilla-1.9.1 unittests over the weekend. The only failure was the mochitest leak mentioned in comment #7. Given that, I think I'm ready to roll this out into the real staging environment. I'm tempted to adjust the mochitest leak threshold to cope with the failures for now...I'm not getting the impression it'll be easy to get energy on bug 477066 right now.
Attachment #362741 - Attachment is patch: false
Attached file registry keys needed for autologin (obsolete) —
Password removed.
(In reply to comment #10)
> Things ran perfectly well on mozilla-1.9.1 unittests over the weekend. The only
> failure was the mochitest leak mentioned in comment #7. Given that, I think I'm
> ready to roll this out into the real staging environment. I'm tempted to adjust
> the mochitest leak threshold to cope with the failures for now...I'm not
> getting the impression it'll be easy to get energy on bug 477066 right now.

Making these changes in staging is fine. However, bug#477066 needs to be fixed (or the test disabled?) before we can make these changes to the production slaves.
Depends on: 477066
(In reply to comment #14)
> (In reply to comment #10)
> > Things ran perfectly well on mozilla-1.9.1 unittests over the weekend. The only
> > failure was the mochitest leak mentioned in comment #7. Given that, I think I'm
> > ready to roll this out into the real staging environment. I'm tempted to adjust
> > the mochitest leak threshold to cope with the failures for now...I'm not
> > getting the impression it'll be easy to get energy on bug 477066 right now.
> 
> Making these changes in staging is fine. However, bug#477066 needs to be fixed
> (or the test disabled?) before we can make these changes to the production
> slaves.

I guess that's an option. But we have a --leak-threshold for Mochitest specifically so we can run tests that are known to cause leaks, and not turn the tree orange.
Here's more detailed instructions on how to deploy:
* Shut down VM, add the following lines to its vmx file:
svga.maxHeight = 1024
svga.maxWidth = 1280
svga.vramSize = 16777216
* Start the VM back up again, login as Administrator
* Download VNC from: http://realvnc.com/products/free/4.1/download.html
* Install with defaults
* When post-install dialog pops up set a password and turn off the java viewer (configure -> 'Serve Java Viewer...')
* Start -> Run -> 'services.msc'
* Disable and turn off Windows Firewall
* Download and install http://software.muzychenko.net/vac409.zip
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362743, edit with proper password, import into registry.
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362741 to ~cltbld/start menu/programs/startup
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362742 to /d/mozilla-build
* Make sure the Buildbot slave is located in /e/builds/moz2_slave (if you have to rename the directory make sure to update buildbot.tac).
* Restart
* Login with VNC and set resolution to 1280x1024

From this point forward you should NOT be logging in as cltbld with RDP.
One last thing, cltbld must be given permission to reboot the system:
* Start menu -> Run -> gpedit.msc
* Computer Configuration -> Windows Settings -> Security Settings -> Local Policies -> User Rights Assignment
* Double click 'Shut down the system', add cltbld to the list.
* Reboot for the changes to take effect.
Removing blocking in favour of setting the leak threshold.
No longer blocks: 472517
Pretty simple patch, just allows you to pass a leak threshold on to the mochitest step.
Attachment #362779 - Flags: review?(catlee)
Pretty simple master side patch. Enable reboots every 5 builds, just like Linux and Mac, and add the 188 byte mochitest leak threshold to win32 builds.
Attachment #362781 - Flags: review?(catlee)
Nick pointed out to me yesterday that it would be better not to download the RealVNC and software audio driver from the internet every time we need it. I'm going to import them into the mofo repo for safekeeping and update my instructions.
Comment on attachment 362781 [details] [diff] [review]
periodic reboots + leak threshold

I need to update the leak thresholds here.
Attachment #362781 - Flags: review?(catlee)
After examining the logs on staging-master I've noticed that sometimes we leak 188 bytes, and sometimes we leak 200. I guess this means the threshold needs to be 200, which kindof sucks since it means it's possible to miss another leak (albeit, a small one). Is there a better way of dealing with this?
Attachment #362781 - Attachment is obsolete: true
Attachment #362904 - Flags: review?(ted.mielczarek)
Alright, those two packages are now checked into the mofo repo:
Checking in vac409.zip;
/mofo/ref-platforms/win32/vac409.zip,v  <--  vac409.zip
initial revision: 1.1
done
RCS file: /mofo/ref-platforms/win32/vnc-4_1_3-x86_win32.exe,v
done
Checking in vnc-4_1_3-x86_win32.exe;
/mofo/ref-platforms/win32/vnc-4_1_3-x86_win32.exe,v  <--  vnc-4_1_3-x86_win32.exe
initial revision: 1.1
done
Attachment #362779 - Flags: review?(catlee) → review+
Comment on attachment 362904 [details] [diff] [review]
leak threshold for tm, 1.9.1, not for m-c

I am saddened, but bhearsum says he is looking into what patch fixed this on m-c.
Attachment #362904 - Flags: review?(ted.mielczarek) → review+
Attachment #362904 - Flags: checked‑in+
Attachment #362779 - Flags: checked‑in+
It's looking like moz2-win32-slave03 is able to run mochitests without leaking. Seems like there's some subtle difference between it and the other two. I'm going to try and track down what it is so the leak threshold isn't necessary.
So I misread before, moz2-win32-slave03 *and* 04 were passing all of the unittests. Only moz-win32-slave21 was failing. The only appreciable difference I found was a software audio driver I was testing being installed on it. After uninstalling that the tests have started passing. I have no idea if this is coincidence or what, I'm not sure how this driver (which isn't a browser plugin AFAIK). I don't see any suspicious checkins to 1.9.1, either.

I have some other things to do right now, so I'm just going to let this run in staging for a few days or a week and monitor it. If things stay green we can turn the leak threshold down to 0 and proceed.
Not a single run of 1.9.1 unittests on moz2-win32-slave21 since Friday. However, as of Friday, 6:30pm EST it was still leaking. I'd like one more run to confirm this before digging deeper...
moz2-win32-slave21 is still failing. As a last resort, I'm going to try recloning the VM and applying the changes exactly as I did to slave03 and 04. Maybe there's something strange from when I was testing RDP and other various things?
After recloning moz2-win32-slave21 it seems that the mochitest leak has gone away. I suspect something I did to it early on tripped the failure. I'm going to let it run for a day or two before declaring it gone for realz, though.
Disable the leak threshold on 1.9.1/tm, since we haven't seen it in forever.
Attachment #365216 - Flags: review?(ccooper)
Attachment #365216 - Flags: review?(ccooper) → review+
Comment on attachment 365216 [details] [diff] [review]
backout leak threshold

changeset:   976:27c75f479ff3
Attachment #365216 - Flags: checked‑in+
I'm planning to roll this out on Monday, March 16th starting in the EDT morning. It's probably going to take half the day or so to fully deploy, but no downtime will be needed.
No longer blocks: 472517
Blocks: 472517
Updated deployment instructions:
* Shut down VM, add the following lines to its vmx file:
svga.maxHeight = 1024
svga.maxWidth = 1280
svga.vramSize = 16777216
* Start the VM back up again, login as Administrator
* Start menu -> Run -> gpedit.msc
* Computer Configuration -> Windows Settings -> Security Settings -> Local
Policies -> User Rights Assignment
* Double click 'Shut down the system', add cltbld to the list.
* Reboot for the changes to take effect.
* Download VNC from: http://realvnc.com/products/free/4.1/download.html
* Install with defaults
* When post-install dialog pops up set a password and turn off the java viewer
(configure -> 'Serve Java Viewer...')
* Start -> Control Panel -> Windows Firewall
* Add TCP/5900 as an exception.
* Download and install http://software.muzychenko.net/vac409.zip
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362743, edit with
proper password, import into registry.
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362741 to
~cltbld/start menu/programs/startup
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362742 to
/d/mozilla-build
* Make sure the Buildbot slave is located in /e/builds/moz2_slave (if you have
to rename the directory make sure to update buildbot.tac).
* Restart
* Login with VNC and set resolution to 1280x1024

From this point forward you should NOT be logging in as cltbld with RDP.
Attached file readable .reg file
Attachment #362743 - Attachment is obsolete: true
I got the last slave updated today. This is done!
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: