|SETI@home Technical News Reports - 2001|
|Tech News - 1999 | Tech News - 2000|
This page contains the latest technical news on the astronomical and computer-system sides of SETI@home. We'll try to update it frequently.
September 12, 2001
We have recently diagnosed a long-standing problem: certain work units never get results returned. There are at least two cases: 1) if a work unit produces a result file larger than 60 KB, the client silently discards the result file and fetches a new work unit; 2) if a work unit has corrupt data, the client immediately discards it. In both cases the user doesn't receive credit for processing the work unit.
Because our server never gets results for these work units, they remain on disk indefinitely, and are sent over and over again to clients. Users aren't getting credit for an increasing fraction of the work units they complete.
As a short-term solution, we are purging old work units from our server. This will hopefully reduce the fraction of "uncredited" work units close to zero. Longer-term, we will fix the client so that it handles large result files correctly. We hope to have this fix in the Unix and command-line versions soon, and in the Windows and Mac versions after that.
July 10, 2001
We made additions to the user info table in our database and this had the unpredicted side effect of making new user signups impossible for the past 16-20 hours. New users were getting a cryptic "-36" database error instead of a new user account. Recompiling the data server seems to have cleared up this problem. Sorry for the confusion.
July 5, 2001
We've started the reload of the workunit table. With luck it'll be finished tomorrow. Because of the reload we've had to turn off result insertion and science database stats updates.
July 2, 2001
We need to do an unload and reload of the workunit table to fix still more bad pages in the science database. Because of that we need to turn off the splitters temporarily.
June 26, 2001
For the past few weeks we were having problems with the new web server connecting to our database. This is why user/team stats pages were being generated by a secondary web server.
This afternoon we brought down the data server briefly and rebooted the two web servers to fix this problem. The actual fix required us having to fall back to an older version of the Solaris operating system. And so far, so good. We are letting these changes incubate for a day before calling the operation a success.
June 20, 2001
Another busy morning:
We upgraded our version of Informix on the user database, mostly motivated by our need to get the new web server able to handle all the CGI functionality. We were having troubles with this recently (in case you haven't noticed) because there is a conflict with newer versions of Solaris and older versions of Informix. Anyway, the upgrade itself went well.
However, this didn't totally clear up the problems we were having - we still can't do database lookups on the new web server, which is why all CGIs are running on an auxiliary machine, iosef.
As well, after restarting the data server, we found it maxing out very early, to the point where it was dropping connections and failing when doing user database lookups. Some people may have seen erroneous "unknown user" errors. It turns out there the old configuration file for Informix didn't exactly jibe with the new version. After a bit of research we were able to make Informix happy and resume normal operations.
At this point we turned the data server back on, as well as the user lookup CGIs. We are leaving the "View Recently Completed Results" function off, though, since this needs to connect to the science database which is currently turned off in order to allow us to build some desperately needed indexes. Building indexes will vastly speed up various database queries.
By the way, "View Recently Completed Results" is what used to be "View Last 10 Results." Why the change? In order to make the lookups much faster. This new method restricts the database searches to only the past few weeks, whereas the old method had to sift through and sort two years' worth of result data. Users were experiencing many dropped connections from these queries taking too long to process.
So.. As it stands now we are still working on 1) getting the new web server to contact the user database, and 2) getting the science database indexes built.
June 19, 2001
1) The troubles that were causing random power fluctuations (and the one long power outage on June 14th) have been fixed by the building electrician. This required a two hour outage this morning. Our data server and web servers were out of contact during this time. They are all currently back on line.
2) We turned the "Last 10 Workunits" back on yesterday, but due to a firewall snafu people were unable to look up any regular or detailed user stats. This was also fixed this morning.
3) The data server is still suffering from fallout of the lengthy science database restore this past month. The upshot of this is the data server occasionally stops working for short periods of time (usually about 15-20 minutes). This has been happening regularly for the past week as we've been diagnosing and fixing this problem.
4) Static stats pages and graphs haven't been regularly updated for the past 10 days. This is due to the frequent network/database outages and bugs introduced as we are altering the functionality of several machines here at the lab. We're on top of this as well.
June 16, 2001
Planned outages for the coming week:
June 14, 2001
We had an unexpected loss of power last night that resulted in an outage of several hours. The building electrician has found a fault in a power distribution panel. He has applied a temporary fix but will be replacing the panel soon, most likely Monday. This will result in an additional 2 hour outage, which we will announce on the main page.
June 13, 2001
Today we replaced two of the three RAID cards on the science database (the third card was replaced a couple of weeks ago). Hopefully this will reduce, if not eliminate, the problems we have been having with the science database RAID system.
June 11, 2001
The RAID cards have continued to give us problems. Fortunately the new disk architecture (with mirroring across controllers) has prevented downtime. Unfortunately, there has again been corruption of the index that controls the order that workunits are sent. We rebuilt the index on Friday. We're considering turning off write caching on the disk controllers to prevent further problems of this sort. We're hoping that won't cause too much of a performance hit.
We've also (finally) found and eliminated the bug that caused the occasional flood of "duplicate work unit" messages.
June 5, 2001
This morning we replaced the old web server machine with a much better one. All was well until we realized the new machine couldn't talk to the user and science databases. We are still working on the problem, and the solution may involve having to install new versions of database software to resolve conflicts with the latest version of Solaris. In the meantime, we installed a second web server to handle most of the basic CGI calls and are making links to this auxiliary web server. This is a very temporary solution. Team/stats pages may not get updated with regularity during this time.
June 4, 2001
Over the weekend the web server stopped working. At this point we're not exactly sure why - most likely the process table filled up. The good news is we recently received a brand new dual processor Sun Ultra to replace the current web server. This switch will happen within the week. Please note that in the process of switching over user/team stats might get out of sync for a day or so (this will be strictly cosmetic and have nothing to do with the actual data in our database). As well, we might shut off user stats lookups for a short time. There will be notices on the front page as things progress.
June 1, 2001
What a week. We completed the restore of our online science database last week. However, in so doing, we also restored a corrupt index that we had recently repaired. So we had to re-repair it. We also needed to run some update jobs on the DB, as it was now inconsistent with both of our ondisk workunit and result queues. We are now running a final check to make sure that everything is consistent. The server is functioning normally, but the splitters are temporarily off. This means that the workunit queue is static and this makes for a small chance that a fast user may get a duplicate workunit.
At the same time, a disk on our offline master science database crashed. We were able to swap the hardware and quickly restore this DB. We are now using this database to reject radio frequency interference (RFI) and look for persistent signals. SETI@home participants have produced a rich data set! Stay tuned.
We also had a security problem. A malicious person or persons obtained a number of user email addresses. There was no server breakin. The perpetrator made use of a hole in our client/server communications protocol. They obtained around 50,000 email addresses and posted these on a web site. We see this as a significant theft of our (and your) data and are pursuing legal action against this person or persons. If you think you have received email from the perpetrator, please go here. We closed the security hole with the side effect that several fields in the user_info.sah are now blank or zero. We realize that this is a problem for some very cool third party add-ons and are putting some of the fields back.
May 16, 2001
After several days of attempting to fix the damage to the database, we've come to the conclusion that a restore is the only remaining option. We're taking the opportunity to modify the server configuration for what we hope will be enhanced reliability. Malfunction of the RAID controllers has been the cause of most of our major outages. We've decided to stop using hardware RAID and move to a software RAID configuration. Software RAID will allow is to mirror and stripe across controllers, so the failure of a single controller should not shut us down or lead to an unrecoverable situation. We're in the process of reconfiguring now. The restore should start tomorrow. There are 13 tapes that need to be restored. Last time it took about 4 hours per tape, so that's a bit more than two days, assuming we don't sleep.
May 14, 2001
On Saturday, one of the RAID cards in the science database machine indicated that all of the drives attached to it had failed. Since then the server has been running as well as it can without the science database. We're working on fixing the problem. Hopefully it won't require a restore from backup.
May 11, 2001
Last night the server started generating spurious "duplicate result" messages. The cause appears to be related to a server revision made yesterday afternoon. We've reverted to an earlier server version and are investigating the problem.
April 30, 2001
Yesterday, the informix engine running our science database hung, apparently with a resource contention. It did so again today. We are looking into the cause. We did take advantage of the downtime to upgrade the science DB machine (one of the Sun E450's). We added 2 more CPUs and another .5GB of real memory.
April 5, 2001
We are now well into the data quality step of analyzing the results from the 80 million workunits thus far distributed. To check data quality, each work unit is analyzed by 2 or 3 (and sometimes more) clients and the redundant results are checked against one another. We choose one result out of this per-workunit result set and insert it into our new master science database. The master science database will thus have one and only one canonical result for each workunit analyzed. References to all participants who contributed results for each workunit are maintained on the online science database.
We call the program which chooses the canonical result the "redundancy checker" or "RC" for short. The RC chooses the canonical result based on agreement with other results from the same workunit set. Agreement is seen when signals contained in one result match value for value with signals contained in another result from the set. We don't insist on a exact match, as signals that really do agree may vary slightly because of floating point roundoff and the effects of differential chirping. We are pleased to find that only a tiny fraction of results do not pass the redundancy test. Look for a detailed discussion of the RC algorithm and statistics in a future science newsletter. As this is an ops news page, on to a discussion of our new server set up!
SETI@Home originally had 2 databases on independent servers to serve workunits to the end processors (clients), receive the results and keep track of user statistics. We had intended to perform science analysis on the database that received and stored the results of the workunits from the end processors. After a while we found that it was not possible to do any analysis and also receive results effectively. One function would impact the other. So it was decided that we needed to replicate the on-line database to a separate sever so that we could do off-line analysis without impacting on-line response times.
The on-line server is a SUN 450 Enterprise multi-processor configuration with 2 GB ECC RAM and around 400 GB of RAIDed disks. We use several SCSI controllers with 64 MB cache with RAID 1+0 configurations. We use Informix Dynamic Server software to implement our high performance databases. This on-line database accepts a number of redundant results for each workunit. Thus, the size of the on-line database could be reduced by 66% to 75% once the canonical results have been chosen.
The new master science database is currently being loaded by the RC. The new server is based on a Compaq 570r with 2 CPUs and 1 GB ECC RAM running Solaris and Informix. It has 2 SCSI-2W controllers and 144 GB on 16 drives. We expect the major part of the process to load this server to take around 50 to 90 days. Then we will be able to run a periodic process to update it with new results from the on-line database.
We were able to define the database using a 'fragment' function for the tables. This allows Informix to distribute a table by row id across as many disk drives as the administrator thinks is needed. While this will not get the raw IO capacity of RAIDed configurations, it is possible to tune the IO capacity depending on the characteristics of the application and specific tables' needs.
Currently Spike table (where we store all reported spikes) is the largest and most frequently accessed structure in the database. So particular care was taken to ensure that Spike table and its indices would be spread over a number of exclusive disk drives and not have IO conflicts with other tables. We had some problems getting the indices allocated exactly where we wanted them, since Informix has a fixed location for the required/implicit indices. In the end, we ensured that the fixed location would have enough disk drives and access paths to minimize the IO queues during high volume activity. In practice, the new database has been performing quite well.
March 5, 2001
Around 11:00 GMT (3:00am PST) on Tuesday, February 27, 2001, network fibers were broken, cutting off the entire Space Sciences Laboratory and Lawrence Berkeley Labs from the internet. It turns out this was the work of vandals who cut the fiber in the process of gathering a bunch of expensive copper wire. Due to the difficulty in finding the break, and then repairing in the rain and mud (on a steep slope no less), the outage lasted over five days.
The SETI@home website and data server were unaccessible for several days during the entire length of the outage. During this time we took the opportunity to take care of a few offline items, such as split our server equipment into two separate "data closets" - when put in one single closet, the equipment was heating up beyond spec.
We came back on line on Saturday, March 3rd, and rather quickly recovered from the back log of clients trying to send results and get new workunits. We are operating normally now, though we are currently dropping a few connections a second, and are trying to determine why (possibly a disk mounting issue).
February 7, 2001
We had a planned outage this morning for two hours to rearrange the server closet in order to make room for a new database system. We took this opportunity to replace one of the old RAID cards on the science database server with a new one, as well as clean up the terrible nest of power/ethernet/scsi/serial cables behind the server table.
January 30, 2001
Due to a syntax error in an apache .conf file, one disk partition filled up, causing a general slowdown of the web server and the data server last night. This was easily fixed, and plans are being made to reduce server dependencies as much as possible (by moving them on bigger, separate disk partitions).
January 15, 2001
An article discussing some of the details of SETI@home has been published in the IEEE Computer Society magazine "Computing in Science and Engineering." It it available in HTML form from their web site http://www.computer.org.
January 8, 2001
Over the past several weeks, and especially last weekend,
hardware problems on campus have been affecting the router which
carries the SETI@home traffic. These problems cause
large dips in our maximum bandwidth allowance, and therefore
we've been dropping many connections to our server. Central
campus is committed to fixing this problem as soon as possible.
Return to SETI@home Page
Copyright ©2001 SETI@home