Monday, July 30, 2018

Help! PACS is Down!

There is no question whether your PACS will go down or not, the only question is when, how often and how do you prepare and anticipate it, in other words, how can you minimize the panic factor?

Downtime is generally understood as the period during which a system is non-functional or cannot work. Note that it does NOT include the time that a system is potentially slowing down as it is auto-repairing a failure, such as can be the case for a disk crash which is part of a redundant RAID configuration, or a server failure, which is automatically taken over by a mirrored server. It does not include scheduled downtime either, which is used for software upgrades and maintenance.

According to Mike Cannavo, aka “the PACSman,” a typical RFP for a PACS system requires an uptime of the mystical “five nine’s,” which is 99.999 percent, or 5 minutes/year. However, I have yet to see a PACS that is down only 5 minutes/year, a more realistic number according to Mike is 99.5%, which equates to about 44 hours, which includes scheduled downtime. However, it is critical to have a measure of system performance that constitutes a “downtime,” for example, I would consider retrieval time of an image slowing down to one minute a downtime, while others might not.

What is a typical amount of downtime? I usually ask my PACS system analyst (SA) students when was the last time that their system was down, and I have found that it ranges widely, between one student whose system went down once a week (I would not want to be on call for that system) and those who can’t even remember the last time it was down as it was several years ago. Based on their feedback, it appears that once every six months seems to be the norm and/or average. The average downtime seems to be a couple of hours. Taking into account the numbers from Mike, that seems to be not far off the norm.

Which measures can you implement to take the panic out of a system going down, i.e. considering it being an unscheduled downtime?

1.      Have well-defined downtime procedures, which are visible and have all the users trained on how to use them. The procedures depend on the user, so have a little “cheat-sheet” at their desk telling them what to do. For example, for a technologist at a modality, it might say “select alt PACS” to send images to, for a radiologist it would say “select alt PACS worklist,” text PACS SA, or “Use web viewer,” etc. And as mentioned, train the users so they know what to do.
2.      Have a test system. Surprisingly enough, when I did a poll, I found out that only about two-thirds of users has a test system in place. Not only should there be a test PACS but also a test worklist provider, voice recognition system, and any other critical component. The test system is used to test updates, including patches, train users in new features, and most importantly provide a “life-support” while the system is undergoing scheduled maintenance or experiencing an unscheduled downtime.
3.      Use mirroring. This is different than having a test system, a mirrored system is a fully functional, operational duplicate of the main system, preferably at a different location. For North Texas where I am based, that means sufficiently far away that a tornado would not hit both centers, for southern Louisiana it would mean in another state not subject to the same hurricane or flooding. For California that would mean not on the same fault line subject to an earthquake.
4.      Test your downtime backup. How do you know if your backup solution works? You’ll have to test it, which is a legal requirement for the state of Texas for all government/state institutions. For example, at UT Southwestern in Dallas, they will run their orders from an external system once a year to show it can be done.
5.      Have an alternate workflow for critical areas. One of my students told me that he burns a CD for all cases in the OR and sends them up to the location every day, just in case the system goes down. The same can be done on-demand for critical cases in the ICU in case PACS (or the network) is down. Or, subsequently, one could burn CD’s in the ER for reading at a stand-alone station in radiology.
6.      Have a dual source for the information. Many hospitals used to have a separate web server that stored a copy of the images in a web-accessible format that can be viewed from any PC in case the PACS is down. Unfortunately, from a redundancy perspective, many of these web servers have gone away as PACS systems have integrated those in their main archive. The trend to have a separate VNA as an enterprise archive, however, gives back that duplication.
7.      Have more than one access point. In addition to having multiple sources of the information, having multiple access points is just as important, such as the capability to look at images on PC’s, tablets, or even a phone, not necessarily with the same quality but good enough for an emergency. This is not unheard of, I know of a surgeon who takes a picture with his phone from his view station and shares these with his surgical team on a regular basis.
8.      Reboot, re-initialize on a regular basis. In the early days of Windows implementations there were quite a few hiccups and I remember that we were able to reduce the downtime significantly by auto-rebooting each computer automatically every night at midnight. Software is sometimes funny; there can be “loose threads,” unclaimed or unreleased blocks of memory, or multiple unnecessary background processes running that could impact performance or reliability which is simply cleaned up by a reboot.
9.      Be aware of external factors. One of the most common reasons for system downtime is people cutting through cables, or sharing the wiring closets with plumbing. This is especially common when there are multiple campuses, where there could be someone digging a hole somewhere and impacting power or network availability. Even air-conditioning can bring a system down. Just last week a major brand-new facility here in the north Dallas metroplex had to shut down its server room as the A/C was down. And, for some reason, architects like to position IT in the basement, which obviously is the worst place for flooding and water breaks. Ideally it would be best to locate them on the top floor of a building, but realistically that is in many cases prime real estate.
10.   Constantly monitor your weak points and critical components. When I visited a PACS SA room not too long ago, I saw a monitor on one of the desktops that was scrolling a set of what looked like text strings. Upon asking, he told me that these are his RIS feeds containing all the orders for his PACS. He had no clue as to how to interpret the HL7 order messages, but he knew that as soon as the screen would stop he had a problem, as orders are not coming in anymore. As most of you know, in a typical size department, a one-hour RIS downtime results in a full day of fix-ups at the PACS back-end so he was very keen to monitor that data stream non-stop.

Being down does not have to result in panic. If proper procedures and methods are in place everyone knows what to do and you have time as an imaging and IT professional to fix he problem and get the system back up and running. In addition, having the right infrastructure and architecture as well as tools are essential. But system reliability is a factor, if your system is down once a week you might want to look for another vendor.