SETI@home Technical News Reports - 2000
This page contains the latest technical news on the astronomical and computer-system sides of SETI@home. We'll try to update it frequently.
December 12, 2000
We've actually figured out what the problem is, and it took a while. The problem was actually related to something else that was reported on the newsgroups: Workunits with an angle range of 11.
The problem was that for some reason the telescope was slewing at high rate for several days. Of the last 1.6M workunits we generated, 780,000 were at an angle range of 11 to 12. These workunits complete in half the time that a normal work unit does. That means that our attempted connection rate went up by about 30%. Once these work units disappear from the disk the servers should get back to normal.
Right now there are 285562 of these high angle range workunits on disk (out of 633196 total). I'd expect they will be gone by some time tomorrow.
December 11, 2000
Lots of people have been reporting connection problems. We've looked at our server performance and don't see any apparent problems. We are running fairly close to our campus bandwidth limit, which could be causing some problems. The release of version 3.03 should help alleviate the problem.
November 28, 2000
We had a planned outage this morning to take two CPUs out of the science database server and place them in the user database server (which has been struggling as of late). This procedure took a bit longer than expected, since we had to transfer voltage regulators along with the CPUs - a detail that wasn't made clear until halfway through the outage.
Anyway, the patients have survived the operation, and in the next day or so (when data server traffic stablizes) we'll see if this CPU donation helped break through some bottlenecks.
October 18, 2000
A few hours later... it went much better than expected. The drives were not actually dead but for some reason were marked as dead by the RAID controller. Two of the drives were a mirrored pair and thus that whole logical device was offline, bringing down the database. We were able to initiate a rebuild of the two "dead" drives, bringing them back online. There does not seem, at this point, to be any data corruption.
The science database experienced what may be multiple disk failures. We are attempting a "quick" (~2-4 hour) rebuild. Stay tuned.
October 16, 2000
A power outage on campus resulted in most of the Berkeley campus being disconnected from the rest of the Internet between the hours 0200 and 0500 UT. Power has been restored, but some campus systems are still out.
October 14, 2000
A problem on campus resulted in our connectivity to the outside world being reduced to 20 Mbps overnight last night. This resulted in numerous dropped connections and generally slow service.
October 10, 2000
There was a switch glitch at the Space Sciences Lab last night and we were off the air for a number of hours. This was fixed at 16:30 UT, but things will be slow as the backlog clears for a few more hours.
September 25, 2000
Last night (Sunday, September 24th) and a few nights ago (Thursday, September 21st) we experienced two sudden server outages, both due to running out of swap space. Users were getting "Unknown Error #1" messages during these times. We are re-tuning the server now.
September 7, 2000
There will be a 1 hour outage tomorrow at 19:00 UT in order to add additional disks to the science database.
August 28, 2000
There will be a planned 2 hour outage on 8/30/00, starting at 13:00 UT. While the main reason for this outage is unrelated to SETI@home (Space Sciences Lab is upgrading the UPS configuration on all network gear), we will be using the time for hardware and software upgrades of our own.
August 21, 2000
We are having a half hour outage today to measure the database speed and backup unprocessed results, as we fell significantly behind over the weekend in regards to inserting results in the database.
August 16, 2000
More outage news: We stopped the server shortly this morning to quickly clean up a table in the database which seemed to be impeding the performance of the data server and possibly sending false "duplicate result" error messages to our users.
And tomorrow morning (around 14:30 UT) there will be a campus router upgrade which will cause a campus-wide network outage. We have been told this outage will last no more than thirty minutes.
August 15, 2000
There were two unexpected back-to-back power outages this morning at around 10:00 PST. We have temporarily switched to backup power, but there may be another outage later today or tomorrow when we switch back to the regular power grid. UPDATE: This mornings power outage happened due to a switching problem at Lawrence Berkeley National Labs, on the same hill as the Space Sciences Lab. The situation has been corrected and we should not experience additional outages.
August 9, 2000
We had a lab-wide network outage this morning which went smoothly, during which we updated some tables in the database. Unfortunately, the repercussions of this weren't apparent until later this afternoon. For this length of time, no new user accounts could be created. The server needed some editing (which included a brief shutdown and recompilation), but now we're back to normal.
August 4, 2000
Individual user stat lookups have been turned back on. The compilation of group stats will hopefully be turned on soon.
There will be 2 planned outages next week. Both of these are UC Berkeley network outages unrelated to SETI@home. We will however be taking advantage of these outages for some systems maintenance. The first will begin 8/7/00 at 13:30 UT and should not last more than 30 minutes. The second will be on 8/9/00, beginning at 13:00 UT and should not last more than 2 hours.
August 3, 2000
Our databases are under major stress at this point and cannot handle the extra traffic of all the CGI programs. Under normal conditions, we receive as much as 5-10 user stats requests a second, 24 hours a day. If we turn the user stats program back on, our science database will not be able to keep up with the data server, and we could potentially lose science results.
As well, we are not running the programs that routinely update the stats pages (like domain/team/country totals, for example).
We understand that the statistics are a major part of the interest in SETI@home, and are currently doing our best to get the database back in working order.
August 1, 2000
We successfully migrated the science database to the new devices. However, we have run into a pre-existing informix limitation on the number of "extents" that we can allocated to the spike table. We have created a secondary spike table so that results can be accepted, but this breaks our query code. Thus, CGI's will remain off until we can find a work around, but the data server should be up. Please note that there may be scattered restarts of the server over the next couple of days.
A quite unrelated problem is that the Berkeley campus backbone is experiencing Internet connectivity problems which may make it appear that the data server is down.
July 28th, 2000
The extended outage on Monday, 7/31/00, will begin at 15:30 UT. It will last 16 hours or more. As it progresses, we will have a better feel for the total outage time and will update this page.
July 26th, 2000
Tomorrow, the router upgrade should happen. The outage is being advertised as starting at 13:30 UT and lasting 1/2 hour.
We will very likely have an extended outage on Monday, 7/31/00. We will post the time soon. This outage will last many hours, as it involves rebuilding the science database. When we rebuilt our database 2 weeks ago, expedience and available hardware forced us to rebuild in a less than optimal manner. Our entire db is now being driven off a single SCSI channel. Not very good! Monday's rebuild will allow us to spread the db over multiple PCI and SCSI channels. We will also arrange the disk arrays so that we can expand in a load balanced manner.
We apologize for the extended outage and for the inevitable but temporary sluggishness once we come back online. In the long run though, this should give the project a more responsive database with a larger capacity.
July 24th, 2000
Update: The outage for tomorrow has been cancelled by the building admin.
There will be a 1/2 hour outage tomorrow, 7/25/00, beginning at 13:30 UT. This outage is due to a router upgrade in the building where the server is housed.
July 13th, 2000
Yesterday we had to perform a complete restore of the science database and that had us offline for many hours. Here is what happened.
Because of the limitations of Informix (2 GB chunks) and the limitations of Solaris (7 partitions per disk) we had been limited to using 9 Gb drives for the science database, and were rapidly approaching the number of disks our controllers could handle. As a work around we were investigating using Veritas to get by these limitations (which would allow us to use 18 Gb drives, in effect doubling our disk space).
Before setting out to migrate the drives over to the new system we decided to perform some tests to make sure it would work. The information and advice we had was to create a separate database space on the new drives, so a failure wouldn't affect the existing database. Well, it turns out one of the tests did fail several days ago, but the online science database continued to operate just fine, as was predicted. Due to the failure the root chunk of the test database space was corrupted. This wasn't a problem until we restarted the database machine to bring a new tape drive on-line. After the reboot, informix complained that it couldn't access the corrupted chunk and wouldn't allow inserts into any database, including those unrelated to the missing chunk. We couldn't remove the bad chunk because it was corrupted, Informix couldn't fix the bad chuck because it too was corrupted. We couldn't restore the bad chunk from a backup because no backup of the bad chunk existed. So we were stuck with a database that was readable, but not writeable.
We eventually came to the concusion that the only way to get back up in any reasonable amount of time was to restore the database to the new 18 Gb disks using the last full backup. After a bit more than 12 hours the restore was complete and we restarted the server.
User stats shouldn't be affected, but science data that are more recent than the last full backup won't be in the new database. It looks like we've gone backwards on the graphs page. We've still got the missing science on the old 9 Gb drives, but it'll take time and lots of informix tech support to get it into the new database.
We've also got the problem that results will be coming back that don't match workunits in the database. We are stashing these until we figure out what to do with them. They'll probably have to sit on disk until we get the database merged again.
July 12th, 2000
We are fixing some bugs with the Informix database today, so the data server will be down until these are fixed.
July 2nd, 2000
On July 3rd, starting at 17:00 UT, we will have a server outage to add RAID hardware to the science database server. This outage should last less than 2 hours.
June 29th, 2000
We discovered a recent problem that is causing users to erroneously get messages that they are sending duplicate results. This has been fixed as of 4:45pm PDT, but the reasons why this was happening are currently unknown.
May 31th, 2000
We will be having a half hour outage today at 1PM PDT to test the redundant system disk on our web server. Both the web site and the data server will be down during this time.
May 30th, 2000
We're building another index today. Science processing temporarily halted.
May 22nd, 2000
We will have a 3 hour outage today for hardware and software upgrades. Unlike most server outages, this one will include our web server. The outage will begin at 1PM PDT.
May 17th, 2000
We've temporarily halted insertion of science data into the database in order to build some indices that are necessary for some new post processing software. No new spikes or gaussians will be added until the tables have been updated.
The May 1 changes to the user database have caused some problems with how rapidly we are able to update group and domain statistics. We're working on a solution, but don't have a deployment date.
May 6th, 2000
We had to rebuild some large indices in the science database, which took quite a long time. The server is up and all appears OK. Any time we have such a lengthy outage, response time will seem poor for a while. This is because the server must deal with a large backlog of connection attempts (clients retrying to connect). Once we get back to a normal connection rate, response will pick up.
May 5th, 2000
SERVER OUTAGE: 12:30pm PDT - The data server is currently down due to unexpected problems with how the server communicates with our database. We have been debugging this since early this morning, and there is currently no definite end to this outage in sight. We'll update this message when a solution is nigh.
April 28, 2000
TODAY : 14:00 PDT. This will be a short outage to replace a SCSI controller card on the data server. New evidence suggests that controller card is responsible for the recent server crashes.
Monday, May 1, 10:00 PDT. This will be a 4 hour outage for RAID card and drive replacements as well as database maintenance.
April 26, 2000
Sun helped us isolate the data server problem - one of the work unit drives had some file system inconsistencies. We cleaned up the drive, as well as incorporated a new feature that causes drives to be forcefully unmounted if there are any panics. Hopefully this will prevent the entire system from crashing due to the faults of a single drive.
April 25, 2000
The data server crashed around 5:30pm and the rebooting process, fraught with scsi errors, took an inordinate amount of time. It crashed again later in the late evening with similar problems.
April 5, 2000
The firmware upgrade was successful. We also took the opportunity to test the emergency boot disks on the data and informix servers.
April 4, 2000
On April 5, at 13:00 PDT, there will be a 1 hour outage to upgrade the firmware on one of our RAID cards.
March 31, 2000
Yesterday evening we had one of our 16 workunit disks fail, putting the data server into bad state. We re-routed the workunit traffic destined for this disk to another online drive.
March 29, 2000
We are back on line at 11:40 PST. The installation of the additional RAID devices for the science database went well.
March 28, 2000
On March 29, at 10:00 PST, there will be a server outage so that we can expand the size of our science database. This outage should not last more than 2 hours.
March 15, 2000
On March 16, from 03:00 to 07:00 PST, there may be intermittant network dropouts while campus reconfigures a number of routers.
March 6, 2000
Added "Last result returned" time to user stats page.
March 3, 2000
Yesterday, around 1:00pm PST, the University network lost much of its external connectivity. During this outage (which lasted about an hour), SETI@home traffic to/from the outside world dropped as much as 50%. Central computing has bandaged the problem and is still determining its cause.
February 21, 2000
We fixed a bug in the HTTP headers being generated by the server. This may solve some proxy problems.
February 15, 2000
We had 2 unplanned outages this past weekend.
The first occurred on Friday evening. Our outgoing packet count rose rather suddenly to an alarming level and campus, seeing that we were interfering with the rest of the university, shut us down at the router. Shortly after midnight PST, campus turned us back on with the agreement that we would rate limit outselves. We did this by decreasing the number of simultaneous connections to the server. This kept traffic down at the expense of connection servicing.
We have looked at a number of reasons for the packet increase and discovered that some recent server mods rendered the server incompatible with some early but still usable unix clients. This incompatibility was such that, apparently, the older clients saw their work units as corrupt and immediately requested another. The short term fix was to fall back to an earlier server. We have done this and connection servicing should now be fine, with no timeouts. We think that this will fix the problem and will be carefully watching the traffic stats to make sure.
The cost of this fix is that we lost some recent mods, some of which fixed an MS proxy problem. So some folks whose connection problems appeared to be fixed over the weekend are now back to not being able to connect. We felt that we had to fall back, as everyone was suffering with the restricted connectivity solution. We are working on a long term fix.
The second outage occurred on Sunday and was more mundane. It was due to a hardware problem with one of the SSL ethernet switches.
January 21, 2000
We arrived this morning to find that the astrobiology lab next to the server closet had sprung a water leak and that there was standing water under all of our servers! Luckily nothing was damaged, but we brought everything down until the leak was under control. Our apologies to those participants who we cut off abruptly. All should be OK now.
January 13, 2000
Still experiencing problems on the science database server, we replaced the RAID card again, this time with no nasty side effects. Whether or not this fixes our problems remains to be seen.
January 6, 2000
The RAID card on the science database server has been acting up, and Sun came to replace it this afternoon. The outage should had been brief, but the server wouldn't reboot after swapping out the card. This card was replaced with a second new card to no avail. The problem was eventually tracked down to a suddenly bad SCSI cable. We should've been able to swap the cable with another, but this cable has a certain special extra-thin jack as to fit on the rather cramped RAID card, and we had no spare cables of this type. So we had to take a standard cable to the machine shop and shave it down to make it fit.