我正在寻找有趣的系统pipe理员意外故事。 删除首席执行官的电子邮件,格式化错误的硬盘驱动器等
我将添加自己的故事作为答案。
我很高兴发现了linux“killall”命令(杀死所有匹配指定名称的进程,对停止僵尸非常有用)和solaris“killall”命令(杀死所有进程并暂停系统,用于停止生产服务器高峰时段的中间,让所有的同事嘲笑你一周)。
我负责当时的网景公司的产品。 当在pipe理表单(这是一个基于Web的界面)中玩耍时,有一个很大(我发誓它是红色的)button,说删除用户数据库 。 没问题,我想。 让我们看看它给我的选项是什么时候我打的。 如果没有选项,肯定会有确认提示。
是的,没有确认。 没有select。 没有更多的用户。
于是,去了Solaris Sysadmin先生,说我急需从磁带上恢复,他回答说:“我没有把那个盒子放回去。
“呃,再来,”我反驳道。
“我并没有把那个盒子放回去,这是我要join备份轮换的清单,但是我还没有完成。”
“这台服务器已经生产了近8个月了!” 我尖叫。
耸耸肩 ,他回答。 “抱歉。”
很多年前,我工作的公司有一个客户端,他们将NT 4.0服务器每晚备份到一个Jaz驱动器 (如高容量的压缩磁盘)上。
我们build立了一个batch file,作为一个计划的工作在一夜之间运行。 每天早上,他们会从驱动器中收集昨晚的磁盘,在晚上离开之前,他们会按顺序插入下一个磁盘。
无论如何,batch file看起来像这样(贾兹驱动器是驱动器F:)…
@echo off F: deltree /y *.* xcopy <important files> F:
无论如何,一天晚上他们忘记把磁盘放进去。驱动器F的变化失败了(驱动器中没有磁盘),并且batch file继续运行。 batch file的默认工作目录? C:。 我第一次看到备份程序破坏了它备份的服务器。
那天我学到了一点关于系统pipe理(和exception处理)的知识。
吉姆。
PS:解决? “deltree / y F:\ *。*”。
root @ dbhost#find / -name core -exec rm -f {} \;
我:“你不能进去吗?好的,数据库名称是什么?
铜:“核心”。
我:“哦”。
我喜欢每个人都用“青春/绿色”来expression自己的故事,好像他们再也不会这样做。 即使是最经验丰富的专业人员也可能发生事故。
我自己的最糟糕的时刻是如此糟糕,我仍然心慌意乱…
我们有一个有生产数据的SAN。 对公司至关重要。 我的“导师”决定扩展一个分区来释放一些磁盘空间。 你能看到这个标题吗? 他说,SAN软件可以在生产时间内实时生效,而且没有人会注意到。 警钟应该已经响起,但显得沉默。 他说,他已经做了“以前的时间”没有问题。 但是,这是事情 – 他让我点击button,说:“你确定?”! 由于我是新来的公司,我以为这个人知道他在说什么。 大错。 好消息是LUN得到了延伸。 坏消息是……当我开始在Windows机器上看到磁盘写入错误时,我知道有一个坏消息。
我很高兴我穿着棕色的裤子。
我们必须解释为什么1TB的数据在午餐时间消失了。 那真是一个非常糟糕的一天。
实际上这是一个很好的原则 – 在你做一些你怀疑的事情之前,想象一下如果出现问题,就必须向pipe理层解释。 如果你不能想出一个好的答案来解释你的行为,那就不要这样做。
一天早上,Nagios在我们的工作时间开始说我们无法连接到非关键服务器。 好吧,走到服务器机房。 这是一个旧的服务器,在'02购买的戴尔1650,我们知道1650年代一直有硬件问题。 PFY刺死电源button。 没有。 再次点击它,持续五秒钟以“强制开机”…这会超越BMC的错误保护,因为如果没有DRAC,没有电源接入机箱就无法检查BMC日志。
机器启动POST,然后再次死亡。 我站在它上面去,“我闻到烟味。” 我们将服务器拉出轨道,其中一个电源感觉温暖,所以PFY拉动它,并且即将closures盒子。 我说:“不,那不是电源烟雾,那是主板烟雾。”
我们再次打开箱子,寻找燃烧的气味的来源。 发现电感线圈和电容器的东西从主板上的电压调节器吹了,喷出熔铜和电容器goop跨越一切,缩短了一堆东西,基本上一塌糊涂。
对我来说最糟糕的部分是认识到我抽了足够的硬件来识别主板烧焦的气味和电源烧焦的区别。
三天前(真的)我已经远程login到学校服务器,在Windows Server 2008文件服务器上安装Service Pack 2。
我决定在深夜安排所需的重新启动,当教师不会完成他们的年终报告卡login。 我input了如下内容:
23:59“shutdown -r -t 0”
…这可能工作得很好。
但是我第二次猜到了我自己。 我的“关机”语法是否正确? 我试图通过键入查看使用帮助
关机/ h
…并立即失去了我的RDP连接。 恐慌,我打了谷歌的语法。 一个快速search显示,Server 2008版本的关机包括一个/ h开关,(您可能已经猜到)hibernate机器。
老师在几分钟之内就打电话给我,告诉他们不能打开或保存他们一直在工作的报告卡。 由于我在场外,服务器房间被locking,所以我不得不直接打电话给学校校长,让她重新开机。
今天,我把自制cookies带给每个人,作为道歉的一种forms。
在以前的工作中,我们拥有一个伟大的本地系统,logging和存档每一封进入,离开或留在公司内的邮件。
吹走你的整个邮箱? 没问题! 寻找有人送你一个星期/一个月/一年前的邮件,但你不记得是谁发送的,或者是什么主题? 没问题! 我们将从二月份为您重新提供一切到一个特殊的文件夹。
在某种程度上,公司首席执行官需要监控竞争对手和内部销售人员之间的邮件。 所以我们设置了一个脚本,而不是每天晚上运行,并把前一天的相关邮件发送给首席执行官。 没问题!
大约一个月之后,双重紧急问题的言论从高处传下来。 看来,当CEO正在阅读发送给$ OTHERCOMPANY的邮件列表时,他碰到了这个:
To: somebody@$OTHERCOMPANY From: CEO Subject: CEO has read your message (subject line here)
自然,首席执行官是一个重要的人,他忙于点击Outlook中的所有“发送读取收据”对话框,并configuration了他的客户端发送全部。 监视filter捕获的其中一条消息有一个读收请求集。 猜猜Outlook做了什么? 当然搞错了“秘密”的监控。
我们的下一个任务是:向邮件filter添加规则,阻止CEO向该公司发送阅读回执。 是的,这是最简单的方法。 🙂
呃,我大概是10年前的时候,当时我还在湿我的脚。 我喜欢在所有的程序员电脑上安装电池备份。 他们还希望加载软件警告停电并正确closures。
所以我把它设置在我的电脑上,当然要先testing一切,确保一切正常。 所以我断开电源线,并在我的屏幕上显示消息。 “外部电源丢失,开始系统closures”。
所以我想,嘿,酷,它的工作。 但是出于某种奇怪的原因,我甚至不记得,它把这个信息作为一个networking信息发送出去,所以公司里的所有200多台电脑都收到了这个信息,那里有100多个程序员。
是的,谈谈大众怪胎!
我在那个地方低着头一会儿!
我经常在Solaris机器上使用“sys-unconfig”命令重置机器名称服务,IP地址和root密码。 我在一个用户系统,我login到build设安装服务器,并查找了一些东西(作为根),然后忘记我login到另一台机器(非描述性的“#”提示)我运行“sys-unconfig”命令。
# sys-unconfig WARNING This program will unconfigure your system. It will cause it to revert to a "blank" system - it will not have a name or know about other systems or networks. This program will also halt the system. Do you want to continue (y/n) ? y Connection closed #
那个“连接closures”的信息慢慢地变成了恐慌……当我运行这个命令时,我login了什么机器。
其中最糟糕的事情并不是我的同事给我的困难,而是一个月后我做了同样的事情。
我有一个很好的。 无可否认,这是我以前的系统pipe理员,但仍然是技术相关的,所以我想我会添加它。
早在当时,我就是为美国空军工作的一种卫星通信/宽带技术。 刚刚gradle的技术学校,我发现自己驻扎在韩国。 到达车站后不久,就有机会在那里呆了一段时间的“大家伙”往南走,实际上是在一些真实世界的(也就是“生产”)设备上工作。
我和机组人员一起下了车,一个年轻的技术人员急切地想着,为能够通过现场军事声音和数据传输的实际设备获得我的双手十分兴奋。
为了让我慢慢开始,他们递给我一本手册,转到预防性维护部分,指出我在四个机架的方向上装满了几个大型数字多路复用器。 设备很简单,我们在科技学校里覆盖了同样的设备。
手册的第一页阅读; “给数字多路复用器通电,把两个后面的开关都转到ON位置,等待设备加电,然后开始testing。 我抬起头,已经有力量应用了!
我当时很尴尬。 不知如何进行,我拍了我最好的,“嗯,有点失落了”看着大四。
他看了我一眼,大笑起来:“不,不,没关系,你可以忽略这个清单的那一部分。 然后,当他注意到我脸上的表情时,(因为我们在学校里永远不会被教导,所以不要忽略任何一个清单的任何部分,如果要这样做肯定会造成死亡和破坏),他认真看待他面对,并说:“只要忽略那个部分,按照其余部分,去信!”
尽职尽责地,我通过了多层次的PM指令,作为一个蛤蜊感到高兴,他们让这样一个低级(尽pipe聪明)的技术人员做这个重要的工作感到自豪。
在这些巨大的复用器的第五和第六个预防性维护清单之间的某处,我开始注意到周围活动水平的提高。 电话响起,人们迅速行动。 正在交换疑惑的样子。
最后,一群人跑到我身边,由一位把我打倒的高级技术人员领导。
“嘿!我们看到数据stream量的巨大中断,而且我们已经把path隔离/追溯到你正在使用的机架上,你看到了什么奇怪的东西。
(那时候,他被另外一位解决问题的人解雇了,而这个问题解决者已经到达了我一直在执行PM的第一批复用器。)
“圣螺旋!他们已经关掉了!他正在把他们关掉!!!!”
我看着他们匆匆完成手册的第一步,“把后面的两个开关转到ON位置……”高科技完成之后,他走到我面前,怀疑地问我在想什么通过closures关键设备。
我吓得魂飞魄散,把交给我的清单交给他,发誓我没有偏离一切。 我已经按照他的吩咐跟着这封信了。
过了一会儿,他笑了起来,指出问题出在哪里。
在手册中,预防性维护清单中的最后一步是:
“logging最后的探头读数,擦拭前面板,清除所有灰尘和微粒,然后将两个后部电源开关转到OFF位置。
🙂
这是一种系统pipe理员意外事件..就系统pipe理员而言,偶尔需要将大量机器从A点运送到B点(A和B似乎总是在没有电梯的build筑物中被几层楼梯隔开)。 在第一天的行程中,我停下了三个从地下室的装载水平上升到与某人下楼聊天的天空,我扶着开着楼梯间的扶手的全尺寸塔楼而且……呃,你猜对了…稍微失去了抓地力。 它直接从井中直下,当它到达井底时,呃…不是那么多的function! 总共可打捞的零件:两根内存,一个软驱和一个ISDN卡(上帝保佑Hermstedt工程人员!)。 其他的一切都破裂,咔嗒作响或砸成小块。
由于上帝的恩典,没有人走在下面,幸好我是老板,所以我得保住自己的工作。 感觉一个小时左右病得很重。
道德:引力总是胜利!
我正在为某人重新加载系统,在手动备份过程中,我问他“你有使用其他程序吗? 和“在计算机上还有什么重要的事吗?”
他几次说“不”。
我确信和格式化驱动器。
大约30分钟后,他说:“噢,我的上帝”,把双手放在头上。
原来,他在一个专门的程序中已经写了十多年的书籍。 当程序用来将用户数据保存在其程序文件目录中时,我又错过了。
Whhhhooooops。
他并没有生我的气,但这是一种清醒的感觉。
我个人最喜欢的不是我的,我非常高兴。 看看这里。
这并没有发生在我身上,但是…
我在一家公司工作,该公司在客户端提供的Linux机器上运行软件。 我们基本上会“接pipe”这些机器,完全按照我们的规格进行configuration,并且进行所有的pipe理和监控。 本质上,我们是一个由10-15个系统pipe理员组成的团队,为数百个客户pipe理数以千计的服务器。 错误肯定会发生。
我们的一个小组在服务器上发现了一些问题(我相信这是一个备份),并决定他应该运行fsck。 他停止了所有相关的服务,确保系统最近有备份,然后运行fsck,但是它抱怨文件系统被挂载。 由于我们是远程的,没有远程访问(DRAC,国际劳工组织等),他不能做fsck,但是他非常肯定,如果你小心的话,安装文件系统是安全的。
他决定自己尝试一下,在根分区上运行fsck,结果可想而知 – 他破坏了根分区,无法启动。
困惑的是,他走过去和我们的团队领导交谈。 领导说他非常肯定你做不到,团队成员说:“当然可以!”,拿起主持人的键盘,向他展示你可以 – 通过在领导的根分区上运行fsck。 这彻底破坏了HIS根分区。
最终结果? 由于团队成员的testing,没有客户数据丢失。 两天的员工生产力丢失了,但这远远不及客户机器上的数据。 为了logging? 您可以在安装的驱动器上运行fsck,但只能validation数据。 不要修理它。 这是团队成员的错误。
–
为了添加我自己的故事,我在同一家公司工作,并试图重置用户密码。 我们的系统拒绝让我把它设置为他需要的密码,因为它跟踪旧密码哈希,并拒绝让你复制密码。 该机制很简单:它根据数据库中最近的散列validation了密码。
(为了logging,它需要是旧密码,因为它是一个共享帐户,并确保每个人都知道新密码是不切实际的)
我决定进入用户数据库并删除新logging,以便使用旧logging。 这只是SQL(运行Sybase的一个古老版本),所以很简单。 首先,我必须findlogging:
SELECT * FROM users_passwords WHERE username='someuser';
我find了他想保留的旧纪录; 还有两个在它前面。 我决定要聪明,只删除比旧logging更新的东西。 查看结果集,我看到数据库中旧密码是ID#28,新密码是ID#几千(系统非常繁忙)。 这很简单,所有的旧行都是> 28,所以:
DELETE FROM users_passwords WHERE id > 28;
没有什么比做一些简单的行修剪更糟糕了,看到“212,500行受到影响”。 幸运的是,我们有两个主数据库服务器(带有用户标识),但Sybase(至less是我们的版本)不支持自动复制,所以不会自动清除旧logging。 获取users_passwords表的转储并重新导入它是一件微不足道的事情。 不过,还是一个相当大的'哦! 时刻。
另一个我的最爱:
在系统上安装计算机和本地激光打印机时,我有一个明智的想法,即将它们都插入到计算机的UPS中。 当它连接到台式UPS时,是否尝试打印到本地激光打印机? 那么,如果你不知道,它往往拉所有的安培…重新启动电脑…而且打印工作永远不会结束…!
曾经接到电话:“ 每当我打印,它重新启动我的电脑,并不打印! “?
哎呀!
JFV
没有WHERE子句的DELETE语句,在客户的现场顾客数据库上。
键入kill 1
作为根。 init
和她的所有孩子都死了。 和他们所有的孩子。 等等,哎呀。
我打算input的是kill %1
当我意识到我做了什么之后,我跑到了一台大型羊毛分拣机的控制面板上,并按下紧急停止button。 这样就停止了机器的切割,因为我刚刚杀死了控制它的软件。
我们正处于停电状态,看到UPS运行在configuration负载的112%。 当时我们在发电机上运行,这并不是什么大问题。
So we went around pulling backup power cables to reduce the power usage on that UPS (we had two, one much larger than the other). We got to the network switch which ran the server room (this was the server room with all the internal servers for the company, with the customer facing servers in another server room). The switch was a large enterprise class switch with three power supplies in it. The supplies were N+1 so we only needed two in order to run the switch.
We picked a cable and pulled it out. Unfortunately for us the other two were plugged into a single power strip, which promptly blew as the load went up on the two power supplies which were plugged into it. The sysadmin then panicked and plugged the third cable in. The switch tried to fire up, putting the entire load of the switch unto the single power supply. Instead of the power supply shutting down, it exploded in a shower of sparks not 12 inches from my face sending me jumping back into the rack of servers.
Out of instinct I tried to jump to the side, but unfortunately on my left was a wall, and two my right was a very large 6'4" facilities guy. I some how managed to jump over him, or possibly through him bouncing off of the Compaq racks (the ones with the thin mesh fronts) without putting a whole in the rack, and without touching the facilities guy.
At some point in my career a legal investigation at the company I was working for placed a requirement on us that all email be kept from "this day" forward, until told otherwise. After about a year of storing daily full backups of our exchange environment (1TB nightly) we started to run out of space.
The exchange admins suggested that we only keep every 8th copy of the email. To do this, we had them restore a days worth of the exchange databases, extract the email they needed (specific people flagged for investigation) and re-archive it. They did this for every 8th day of email for all of our backups. The 8th day was chose because exchange had a parameter set where "deleted items" are kept in the database for 8 days.
After they would finish each archive, I would go back through and delete any backups which were older than what they had archived.
TSM does not have an easy way to do this, so you have to manually delete objects from the backup database.
I wrote a script which would delete all backups older than some date, by way of a date calculation using the difference between today, and the date in question. On some day I had to delete about a months worth of backups, except when I made the date calculation I made a typo and entered the date as 7/10/2007 instead of 6/10/2007, and ran the script. I deleted an entire extra month worth of data, accidentally which was part of a very important lawsuit.
After that, I added some steps to the script to confirm that you wanted to delete the data, and show you what it was going to delete…
Luckily, they never even used any of the data we worked so hard to preserve, and I still have my job.
After a long day or performance tracing and tuning a huge mainframe (you know the beasts that take a couple of hours before all standby backup-sites have agreed that it is indeed booted up again and fully synced) I stretched my fingers, typed satisfied shutdown -p now in my laptop prompt, closed the lid, yanked the serial cable out of the mainframe, with the anticipation of a nice cold glass of lager.
Suddenly I hear the deafening sound of spinning down mainframe while my laptop was still happily displaying X.
While waiting for the machine to come fully online again I decided that I got time to get my ACPI working on my laptop so I never ever are tempted to cli shutdown my laptop.
This accident didn't happen… but it's worth mentioning:
I was sent to a heavily-used data center to conduct bandwidth tests on a new circuit. I got to the demarc room/IDF, found a spot on one of the racks for my test router, made my connections, and started the tests. Unfortunately, I completely failed to notice the in-production border router not only being exactly on the next rack (almost at the same level), but that it was also the same make and model as my testing router.
When the test was done, I began pressing the power switch to the off position (…imagine it in slow motion…) and, I swear, just as I was applying pressure it dawned on me that the router I was about to turn off was the one in production. My heart stopped and I almost… well, use your imagination.
I left the data center's MDF looking spooked and pale, but at the same time glad I still had a job!
I deleted someone's account by mistake, got the names mixed up with the one I was suspose to delete. Opps
The cool part is they never knew what happened. Got the call they couldn't log in, the penny dropped about the account I deleted.
While on the phone with them, I quickly re-created their account, re-attached their old mailbox to it (thankfully Exchange doesn't delete mailboxes right away) and pointed it back to their old user files.
Then I blamed them for forgetting their password which I had just reset for them 🙂
Accidentally installed a tar.gz file on my Gentoo Linux box in the wrong place and it left files all over the place. This must've been around 1999, 19 at the time (thanks for the comments below)
Being the geek that I am, I decided to try to script myself out of the work of going manually through each file.
所以我试了一下:
tar –list evilevilpackage.tar.gz | xargs rm -rf
It didn't take me very long to notice that tar also listed all the directories the program was using, those included were ''/usr, /var, /etc'' and a few others that I didn't really want gone.
CTRL-C! CTRL-C! CTRL-C! Too late! Everything gone, reinstall time. Fortunately the box didn't contain anything important.
As a smallish part of my former life I administered the company's file server, a netware 4:11 box. It hardly EVER needed any input at all, but if it did, you opened up a remote console window.
Used to using DOS all the time, when I was finished, I naturally would type "Exit". For Netware, "exit" is the command to shut down the OS. Luckily, it won't let you shut down unless you first "Down" the server.(Make it unavailable to the network/clients) So when you type "Exit" in the console, it helpfully says, "You must first type "Down" before you can exit"
Ask me how many times I 1: typed "exit" in the console session and 2: Obediently typed "Down" and then "Exit" so I could "finish what I was trying to do"
And then the phone starts ringing…..
大声笑
Another story that didn't happen (phew):
We were doing incremental backups religiously every day to a tape drive.
We happened to write a tape containing data to ship to someone else. They said 'we can't read your tape'. In fact, neither could we. Or any tape in fact.
We bought another tape drive and held our breath until we installed it.
Moral of the story. Always make sure you test your backups.
The last place I worked, my co-worker had his kids with him in the server room (why? I have NO IDEA!).
He made sure that they were far away from the servers and explained to his 5-year-old that he shouldn't touch ANY of the servers and ESPECIALLY none of the power switches.
In fact, he had them right near the door… (can you see where this is going…?)
The boy didn't touch any of the server power buttons… No, that would be entirely too easy to explain. Instead he hit the BIG RED BUTTON that was near the door… The button that shuts down power to the ENTIRE SERVER ROOM!!!
Phone lines immediately started to light up wondering why Exchange, File Servers, etc. weren't available… Imagine trying to explain THAT to the CEO!
-JFV
I once had a fight with the APC UPS monitoring software. Being a small company, we had a couple of small-ish UPSes and various servers were setup to monitor them. Most of the servers were Linux, but a few were running Windows and so they were the ones used because the APC software is Windows only.
However, the APC software at the time was hard-coded to assume the UPS it is talking to is also powering the PC its running on! This was not the case for this server, but I discovered that too late to tell it to halt. Also unfortunately, the lead programmer was demonstrating the company product to a partner – it was a web-based app, running on the same server I didn't want the APC software to shut down…
I was giving a new sysadmin a tour of a Service Manager app. I said "if you ever needed to stop this service you would click this button, but you should never do it during the day." You would never believe how sensitive her mouse button was!
Two minutes later the service had started up again, and no-one seemed to notice.
Tripping over a tower server that was wedged behind a rack and hitting my head on the back of the main Cisco router on my way down. Thus revealing how loosely the power cords were actually seated in the power supplies on the front of the Catalyst 6500 .
是啊。 We've got a hardhat on a hook in the server room now. With my name on it.