SETI@home Technical News Reports - 1999Tech News - 2000 | Tech News - 2001 This page contains the latest technical news on the astronomical and computer-system sides of SETI@home. We'll try to update it frequently. December 28, 1999 Due to a screwup elsewhere on campus, there were a series of short power outages at the lab early this morning (starting around 7:00am PST). We waited until we were convinced that power was stable before getting all the machines back up and running (around 10:00am PST). December 24, 1999 Sun engineers traced our recent server failures to two bad DIMM modules. We replaced these yesterday. December 16, 1999 On Tuesday, 12/15, we came down for what we had thought would be a 3 hour outage to rebuild the science database. The rebuild was necessary because we had run out of database extents for the workunit table. The Informix default extent size was much too low for us and we ran out abruptly and without warning. After reloading the database from flat files, the rebuilding of indices then took much longer than expected. Finally, once the rebuild was complete, we suffered what appears to be a malfunction in one of our RAID controllers, making it necessary to restore the user root database from backup. Sun is currently looking at the RAID problem. December 11, 1999 The data server automatically rebooted because of SCSI errors, and then failed on reboot since some drives needed to be cleaned up by hand. Being a weekend and all, nobody was at the lab, so it took an hour for somebody to come and babysit the cleanup procedure. The data server is now back online and doing fine. December 7, 1999 We have successfully transferred the user database over to the final Enterprise 450, with the desired results. It has greatly speeded up our handling the user database portions of the server connections. In addition we are finally clearing the backlog of stats (results recieved, cpu time, team accounting, country stats, etc). For an indication of how much faster the stats are getting added, look at the rate of change graphs for total CPU time and number of results recieved per day. Up until now these stats have been showing how fast our user database is. In a few days, after the backlog has cleared, they'll show how many results and how much CPU time we really get in a day. December 6, 1999 We found that our science database was missing a index. That was slowing it down drastically and accounts for the low rates of spike and gaussian insertion. We fixed the problem. We expect the result file backlog to be gone in couple days. December 1, 1999 Tomorrow, between 06:30 and 07:00 PST, network traffic from Space Sciences Lab (which houses SETI@home) will be moved to a new faster backbone on the UC campus net. There will be a short (minutes) outage during the cutover. The science database server crashed at about 3 a.m. PDT. Unfortunately, it took us until nearly 9 a.m. to get things back up and running. We're back to normal and no data has been lost, but we expect quite a backlog of connections for the next several hours. November 28, 1999 There have been several periods of server connection problems this weekend for as yet unidentified reasons. Being a holiday weekend, the bulk of the SETI@home team was not near Berkeley and many periods went unrecognized. We apologize for any inconvenience. November 23, 1999 Another stats bug was fixed today. User URLs containing quotation marks were gumming up various stats pages. This is now checked before attempting to display. It will take a day or so for all affected pages to be cleaned up automatically. November 19, 1999 The stats pages were taken temporarily offline to clean out stale files and fix a bug. File permissions were being set incorrectly on team stats pages after they were edited, making them impossible to open during daily page regeneration. This has been fixed. November 18, 1999 During a planned server outage today we upgraded the SETI@home data server hardware. Previously the data server was a Sun Ultra10. This has been replaced by an Enterprise 450, recently donated by Sun Microsystems. November 16, 1999 Due to hurricane preparation at Arecibo, the flat feed has been lowered and the receiver power is off. We took the SETI@home data recorder off line until the emergency is over. November 7, 1999 We're running a test on an alternate workunit deletion policy. Because of this the splitters will be generating new work units at a lower rate especially near the beginning of the test. The test will last between 5 days and one week. November 5, 1999 Up until today, we regenerated some of the stats pages every hour. This includes the totals, top 1000 users, cpu types, operating systems, platforms, countries, and venue pages. However, the database has gotten sufficiently large that it now takes over two hours to regenerate these pages, causing multiple stats-page-processes to start before others are finished. So now we're on a four hour schedule. November 4, 1999 A fiber between campus and the Space Lab broke around 10:00am PST this morning. Campus fixed it within three hours. During this time, both our web server and data server were unreachable outside the lab. We've cleaned up the top gaussians page a bit. The bulk of the top gaussians were due to a single user who apparently has a MIPS machine with a faulty floating point processor. (I don't know of any properly functioning floating point processor that can square a real number and get a negative result). The page now ignores results with that user/platform combination. Other MIPS results should not be affected. November 2, 1999 Yesterday an unnamed film crew plugged a gigawatt of lighting equipment into one of our uninterruptible power supplies. Not unexpectedly, it was interrupted, temporarily bringing down the user database machine. At about the same time on the science database machine one of the disks failed. Thanks to RAID, the hot swap disk kicked in. Unfortunately the hot swap also failed, revealing a more serious problem. It didn't automatically kick over to the second hot swap. Today we rebooted the science database machine which brought the second hot swap online. The RAID controller worked as advertised and the science database kept operating with just a momentary lapse and no data loss. October 26, 1999 Changed the server listen queue size from 5 (SOMAXCONN... why?) to 128. This reduced the number of dropped connections considerably. October 25, 1999 Once again the server was down for a scheduled outage to add memory. The data server now has one gigabyte of RAM. How much this enhances its performance remains to be seen. Arecibo Observatory is back on the air - Hurricane Jose missed. October 20, 1999 Arecibo Observatory went off the air in preparation for Hurricane Jose. October 19, 1999 The server was down for a scheduled outage today to add memory. Unfortunately, things didn't go as planned. After replacing the memory we got numerous ECC messages and, eventually, much crashing. After removing the upgrade, the machine wouldn't power up. We eventually did a brain transplant from one of the splitter machines to get things back to where they were. So we're down a splitter for the time being. Later in the evening the server crashed a couple more times, but with different symptoms. We found out later this was due to a flakey switch. Generally speaking, this wasn't a good day for the server. October 6, 1999 Due to a bug in the "credit transfer" page, we have taken it off line indefinitely. Several people exploited this facility to increase their stats. We have been logging all this activity and removed all the bogus stats, but will keep this feature turned off until we figure out how to fix it. October 1, 1999 Two of our local ftp servers went down. One refuses to power back on - it may need a new power supply, or we may pronounce it dead. The other might eventually be up and running again with some effort. Both of these machines are over 10 years old and have served on many different projects here at the Space Lab. We still have our third local ftp server, as well as support from cdrom.com, which has always successfully handled a huge chunk of the client downloads. September 22, 1999 The server was down for about 20 minutes in order to incorporate shorter SCSI cables and another UPS into the database/data server structure. The shorter cables should reduce SCSI errors that could cause the data server to crash. September 18, 1999 Thanks to Ron Hipschmann, we have a new page about radio frequency interference (RFI). Ron's original page included the phrase "naked eye", which caused it to be censored by some Surf-Watch type programs. We changed it to "unaided eye". September 17, 1999 We've had server performance problems in the past week. Some of these have involved the UCB network, which is beyond our control. However, the continued increase in usage seems to be once again pushing the limits of our servers. There are two bottlenecks: the data server and the Informix server for the user database. The data server runs many (256 to 512) instances of our server program. It's on a machine that originally had 256 MB of RAM. This machine was thrashing (excessive paging); it took us a while to figure this out. We increased the RAM to 512 MB, and are ordering another 512 MB. Thrashing, however, is still a problem, and we eventually plan to upgrade this host to one with a memory capacity of 4 GB or more. The Informix server has two problems. It isn't providing quite the throughput we need, and it does lengthy (20-30 second) "checkpoints" during which no transactions are handled. We have fiddled with many configuration parameters, with mixed results. We're trying to get Informix to help us with this. Our long-term plan is to upgrade this machine to a machine with more CPU power, more memory, and a disk controller with a large write cache. Intel has donated a couple of machines to SETI@home. Originally we planned to use these as splitters. However, we may end up using them as servers (see above). September 8, 1999 The new server configuration is finally online. The "science" part of the database (tapes, workunits, spikes, gaussians) has been moved to the 450. Several tables were frozen while this was in progress; that's why our statistics flatlined for a few days. Most of our software had to be modified to use the correct database. Over the past weekend we experienced a combination of server crashes and university-wide network problems. The network problems were resolved on Monday. We have installed the most up to date kernel patch to our data server. In addition, we are exploring possible SCSI bus issues, also on the data server. Because of the lengthy outages, we had a backlog of connection requests all day yesterday which resulted in widespread inability to connect. At this point however, things appear stable and all connections are succeeding. September 6, 1999 One of the 16 disks storing WUs failed late last night; the server host crashed and failed to reboot. We got it running again around noon, minus the failed disk. September 5, 1999 Various software issues made it impossible to configure our disk storage on the 450 as RAID 5. We're running it as mirrored disks instead. There were some glitches (e.g. 2 GB file size limit) in transferring the science database to the 450, but this is now done. When everyone's back from Labor Day holidays we'll bring up the new configuration for real. September 3, 1999 All of the server machines are now up in our new machine room and the RAID on the 450 has been configured. We will soon be moving the science database over to this machine. August 31, 1999 We're juggling our server machines to move them off desktops and into a separate machine room. Possible brief server outages over the next few days. Continued difficulties configuring the 450+RAID for Informix. The apparent decrease in number of new results over the last week, and the associated decrease in added CPU time, new spikes and new gaussians, was due to a decrease in performance of the database server. The decrease in performance was due to tweaks we hoped would increase performance, but apparently didn't. The decrease resulted in a backlog of files waiting to be processed. We've undone the damage, and hope the backlog will disappear soon. August 27, 1999 A problem with our network connection to the UCB campus has caused a high packet-loss rate, and hence a general performance problem, over the last day or two. The problem has new been fixed. August 18, 1999 Added audio files (.wav, .au) to web site. August 17, 1999 Added a histogram of number of results returned per user. We're having delays in setting up Informix on the 450 with the RAID. Informix is sending out someone later this week. Temporarily put the Enterprise 450 on splitter duty, hence the rise in the splitting rate. August 15, 1999 Put the poll mechanism online at about 1:30 AM. Within 10 minutes there were a dozen responses. August 12, 1999 User-account creation hasn't worked for the last 24+ hours. (We added a field to the database but forgot to recompile all programs.) August 11, 1999 The Informix server software has been installed on the 450 server and is working OK using a single disk. Tomorrow we'll try it out with the RAID. Added an alphabetical list of groups overall and within categories. It will be available soon. Implemented a new "poll" mechanism. Will test it for a day or two, then go online. Changed time display from "hours (years)" to simply years if it's more than one year. August 6, 1999 The replacement of the circuit breaker went smoothly and SETI@home is back online. August 5, 1999 More on the power outage: We've just learned that the power to the Space Sciences Laboratory will be taken down tomorrow August the 6th, at noon PDT. The power will be down for approximately two hours. During those hours, SETI@home's servers will be offline and unavailable. August 4, 1999 We have just received word that power to the Space Sciences Laboratory building will be down for a period of two hours sometime this Thursday or Friday in order to replace a faulty elevator circuit breaker. We are currently working to find out the exact time and date which we will post to this page once we know the specifics. During this power outage all of SETI@home's servers, including the data server and the web server, will be unreachable.
July 30, 1999 Server upgrade update: we have received the RAID controller board for the new Sun 450, and have set up about 150 GB of disks as a RAID-5 array. This will be used for the science part of the database. There have been intermittent hardware problems with a couple of the machines running the splitter; that's why our rate of splitting has been going up and down. Fiddled with gnuplot to make nicer-looking graphs. A few ISPs were rejecting mail sent from SETI@home, so some users were unable to get their passwords. We reconfigured our mail server, so these users should now be able to receive our e-mails. July 27, 1999 We have added a new section to the web site showing a sky map of work units generated along with line graphs of some of the project's interesting statistics. July 23, 1999 A power outage on the Berkeley campus brought down our network between the hours of 2:30pm and 5:00pm. July 21, 1999 Added "Last 24 Hours" column to Totals page under statistics; added FLOPs row (based on 2 TeraFLOPs per result). Added top 100 users under Locations (home, work etc.). July 17, 1999 We are awaiting the arrival of a RAID controller card and database server software for our Sun 450 server. When these items arrive we'll do a major (and hopefully final) upgrade on our server architecture. July 15, 1999 The SETI@home/SERENDIP receiver at Arecibo Observatory, which was struck by lightning, has been repaired by the Observatory staff. The receiver is back on line and we are collecting data. July 12, 1999 We recalculated the scores for each group, so each group now has the credit of all its current members. We did this to fix the problem created by the Mac bug that processed workunits in 10 seconds. We corrected the scores of the users that were affected by this bug, but because people can switch from one group to another there is no record of which group they were currently on when the bug was inflating their scores. So, we just started over. This is a one-time event, however. From now on, group scores will be calculated as they were before. Fixed the group creation problem and removed directory browsing from the /stats and /stats/team directories. Also, the box that contains the group name in the "Top 100" is now only 100 pixels wide. July 9, 1999 We modified the personal statistics CGI so that it now states your rank based upon percentage, in addition to place. July 8, 1999 One of the splitters has died a horrible death, putting us back down to two. Tech support has been called... We modified the "Top 100" Group listings so that HTML tags in the Group's name are no longer interpreted as HTML, but are printed as-is. You can still add HTML to your Groups' page, however. July 7, 1999 We've found and fixed the bug in the Mac 1.05 version. We're testing a new version and will release it probably tomorrow. The bug introduced a lot of bad data into our accounting database; some people got credit for thousands of results that took 1 or 2 CPU seconds to compute. In the interest of fairness, we're going to attempt to "undo" these credits, including the totals under teams, CPUs, countries, etc. So some people may see their totals go down. July 6, 1999 The personal statistics look-up now has your credit, your group, groups you have founded and shows your overall SETI@home rank. You can access these stats via either the "User Account Area" or the "Current Statistics" pages. July 4, 1999 Someone broke into our web server and replaced our home page with a picture of Alf. This understandably caused concern about the security of our FTP servers (separate machines from the web server). These are highly protected, but we will double-check our security mechanisms. July 3, 1999 The SETI@home clients for MIPS/Irix (5.3, 6.2 and 6.4) had a problem affecting the data analysis due to a bug in the source code. They have been removed from the download page, and results from these versions are being discarded. July 2, 1999 At 1 AM this morning we noticed a bug in the new Mac client (1.05); it sometimes returns results after only a few CPU seconds. The problem apparently goes away if you reboot. We are trying to fix this bug, and will hopefully have a new Mac version soon. In the meantime, we have reverted to the previous version (1.0) for downloading, and we changed the server to ignore results generated in less than 10 CPU minutes. July 1, 1999 Due to a burst steam pipe in the vicinity of the campus-wide domain servers and routers, the entire Berkeley campus spent most of the day disconnected from the Internet. Sorry for any inconvenience. June 30, 1999 The garbage collector, the server, and the splitters have reached an equilibrium state. Work units are being generated as fast as they are being deleted, and they are being deleted about as fast as results are coming back. June 29, 1999 The third fast splitter machine is now online. We've taken the slow one offline, and transferred the tape it's working on to the new machine. June 28, 1999 We added a mechanism that allows the founding member of a group to edit the groups' information, such as the name, description, url and type. To edit a group, go to the groups' page and click on "Edit this Group". We now have 750,000 work units on disk. The splitters are running at about 140,000 work units per day, a bit slower than our first estimate. A third fast splitting machine should come on-line this week. We're approaching the disk free-space threshold where the garbage collector will kick in and start deleting work units for which results have been received, or which have been sent to multiple users. The SETI@home server has been moved to a new, faster machine. The 360 GB of work unit storage, which was previously divided between two machines, is now consolidated here. The server was down for a couple of hours today while we made these changes.
June 27, 1999 We modified the SETI@home server to user a fixed set of processes, each with a persistent database connection. This should make the server somewhat more efficient, because it eliminates a process fork and the creation of a database connection. It also eliminates the need for a separate process to handle user accounting updates. June 26, 1999 The fast splitters finished tapes 09ja99aa and 02mr99aa. The fast splitters started on tapes 10ja99aa and 01mr99ab (not to be confused with 01mr99aa). The slow splitter is still chugging away on 01mr99aa. June 24, 1999 As a result of a server bug (now fixed) credits for the last 24 hours were lost. We modified the stats pages so that HTML tags in the User's name are no longer interpreted as HTML, but are printed as-is. Also, the mailto tag was added to the User's email address when displayed in stats. June 23, 1999 Two new fast machines are splitting data now. The effective rate should be near 180,000 work units per day. Modified the SETI@home server to detect when a user is repeatedly sending us the same result. June 22, 1999 We cross-check interesting results by analyzing the work unit ourselves. We did this for the work unit containing the highest-score Gaussian, and didn't get the same result, so we removed that result from the database. New hardware from Sun (3 Ultra-10 workstations and 6 DLT drives) has arrived. We are configuring the machines. June 21, 1999 We are adding mechanisms for group leaders to edit their group information and to merge their group with another group. June 20, 1999 The modified accounting scheme is in place and seems to be working; credits will now appear synchronously in the screensaver. Writing work units from WS4 to WS3 (via NFS) caused WS3 to get NFS timeouts and run out of processes. Splitting is turned off until tomorrow. June 17, 1999 We changed the way accounting is done so that the User Info fields in the screensaver will be updated immediately after a result is sent (this has become our #1 user complaint). When a result is received, the user record is updated immediately (not by the process that handles the request, but by a dedicated process using a shared-memory scheme similar to that used for work units). The offline accounting program (which has now almost caught up, by the way) still handles other credits. June 16, 1999 Due to popular demand, we reset the group credits (CPU time, #results) to the sum of the credits of all people belonging to the group. June 15, 1999 Accounting for groups has not been working correctly. Credit was being added when people originally joined, but not for their work thereafter. This is now fixed, but we can't restore the missing credits. We've decided to stop doing accounting of "work units sent" for anything other than individual users. This will be removed from the Statistics pages soon. The work unit pipeline is now fully operational again. Users will be seeing data recorded on a variety of dates over the last few months. June 14, 1999 As a result of the inode problem we've made some changes to the server architecture. The idea is a sort of triage: 1) time-critical things: do them right away; 2) necessary but not time-critical things: queue them up; increase efficiency by merging database updates; 3) other things: apply all remaining resources, but do not queue. Here's a summary of our server functions and how we're handling them:
June 12, 1999 Result files have been building up and we've been running out of inodes on that file system. This tends to bring down the data server. June 11, 1999 We are adding a mechanism for users to edit their accounts (email address, name, etc.) through a web interface. June 9, 1999 We now have a fourth Sun workstation, WS4, a Sun Sparc 20. We moved the DLT drive to WS4 and are running the Splitter again. We plan to modify the SETI@home server to use the database more efficiently, querying it for a thousand work or so units at a time to send. June 7, 1999 We have formulated a plan to use the new Sun hardware when it arrives later this month. In a nutshell:
June 4, 1999 We learned that the SETI@home server has been repeatedly issuing the same 115 work units for the last few days. This was fixed immediately, and we put a notice on the web site. We're back to issuing our existing 31,415 work units. The problem was due to a subtle operating system bug, but we feel bad that we didn't catch it sooner. June 1, 1999 Sun Microsystems has offered a substantial donation of new hardware: 4 workstations, 4 DLT drives, a large amount of disk storage, and a 450 4-CPU Enterprise Server. The workstations and tape drives are schedule to arrive 6/15, and the 450 is slated for end of June. May 29, 1999 The receiver front end at Arecibo is not working; engineers are working to fix. Fortunately we still have a large backlog of tapes starting from December 1998. May 28, 1999 Quantum has donated about 360 GB of disk drives. May 27, 1999 To address performance problems, we have changed the way our server works. Before, when a user requests a work unit, the SETI@home server queried the database server to find the best work unit to send, and it updated the "# work units sent" field of the user, their country, their team, etc. This was overloading the database. In our new scheme, the SETI@home server avoids using the database as much as possible. It generates "flat files" that record work units sent and results received. An off-line program processes these files and updates the accounting records in our database. As a result of this "delayed accounting", users won't see credit for their results immediately on the web site or in their screensaver. There will a delay, depending on how far behind the off-line processing gets. We changed the word "Team" to "Group" on the web site because the SETI Institute has a fundraising program called "Team SETI", and there was concern about confusion. May 25, 1999 Our fix to the team mechanism has ANOTHER loophole. We changed the mechanism again (this time for sure??). We didn't zero out group membership this time. May 24, 1999 Our Team mechanism has a problem: it's possible to add any user to your team. Some teams have used this loophole to "hijack" users with lots of CPU time. We fixed this (we thought) and emptied out all teams. May 23, 1999 We added the "Team" mechanism. May 22, 1999 Because of the unexpectedly large number of users, our server has become overloaded. Clients are unable to connect and get error messages. We're getting thousands of email messages every day about this. To reduce the CPU load on WS3 we have temporarily stopped running the Splitter. We have 31,415 work units on disk right now, and we'll issue these until we can start generating new ones. May 17, 1999 The official project launch. Our server system consists of three Sun workstations:
The Splitter, which is CPU-intensive, can generate work units only at about 35% of the rate at which data arrives. Once we get more tape drives and more CPUs, we'll be able to keep up with the flow of data. Until then, we'll accumulate a backlog of tapes. An important component of SETI@home, Phase 2 Processing, is under development. This phase involves sifting through our database of Spike and Gaussian records, searching for events that are identical in space and frequency by different in time (typically by several months). May 14, 1999 We added www.cdrom.com as an FTP mirror site (thanks, guys!). May 13, 1999 The Windows and Mac versions of the client were put online today. This happened to coincide with a national TV news story, and immediately our FTP server was overwhelmed. We hastily borrowed several old Sparc IPCs and set them up as FTP servers, but this didn't solve the problem. |