jrochkind1 4 days ago

I don't know how long that failure mode has been in place or if this is relevant, but it makes me think of analogous times I've encountered similar:

When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.

In that situation, switching back to yesterday's workflow is something that won't interrupt much.

A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.

The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.

  • ronsor 3 days ago

    > The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.

    The solution to this is to trigger all functionality periodically and randomly to ensure it remains tested. If you don't test your backups, you don't have any.

    • jvanderbot 3 days ago

      That is "a" solution.

      Another solution that is very foreign to us in sweng, but is common practice in, say, aviation, is to have that fallback plan in a big thick book, and to have a light that says "Oh it's time to use the fallback plan", rather than require users to diagnose the issue and remember the fallback.

      This was one of the key ideas in the design of critical systems*: Instead of automating the execution of a big branching plan, it is often preferable to automate just the detection of the next desirable state, then let the users execute the transition. This is because, if there is time, it allows all users to be fully cognizant of the inner state of the system and the reasons for that state, in case they need to take over.

      The worst of both worlds is to automate yourself into a corner, gunk everything up, and then require the user to come in and do a back-breaking cleanup just to get to the point where they can diagnose this. My factorio experiences mirror this last case perfectly.

      * "Joint Cognitive Systems" - Hollnagle&Woods

      • Thorrez 3 days ago

        Isn't what you're proposing exactly what led to this being a major problem? The automated systems disabled themselves, so people had to use the manual way, which was much less efficient, and 1,500 flights had to be cancelled.

        • thereddaikon 3 days ago

          They are referring to air crew procedures, not ATC. WHen the crew of an aircraft encounter a failure that doesn't have a common simple response, they consult a procedure book. This is something professional crews are well acquainted with and used to. The problem in the article was with the air traffic control system. They did not have a proper failback procedure and it caused major disruptions.

          • jrochkind1 2 days ago

            Their fallback procedure was "do it the manual way". Just like for the pilots. They thought it was a proper one...

            • jvanderbot 2 days ago

              This entire subthread is more a response to the suggestion that "The" solution is fuzzing your entire stack to death.

    • ericjmorey 3 days ago

      Which company deployed a chaos monkey deamon on their systems? Seemed to improve resiliency when I read about it.

      • philsnow 3 days ago

        At Google, the global Chubby cell had gone so long without any downtime that people were starting to assume that’s it was just always available, leading to some kind of outage or other when the global cell finally did have some organic downtime.

        Chubby-SRE added quarterly synthetic downtime of the global cell (iff the downtime SLA had not already been exceeded).

        • kelnos 3 days ago

          For those of us who haven't worked at Google, what's "Chubby" and what's a "cell"?

          • philsnow 3 days ago

            Ah, chubby is a distributed lock service. Think “zookeeper” and you won’t be far off.

            https://static.googleusercontent.com/media/research.google.c... [pdf]

            Some random blog post: https://medium.com/coinmonks/chubby-a-centralized-lock-servi...

            You can run multiple copies/instances of chubby at the same time (like you could run two separate zookeepers). You usually run an odd number of them, typically 5. A group of chubby processes all managing the same namespace is a “cell”.

            A while ago, nearly everything at Google had at least an indirect dependency on chubby being available (for service discovery etc), so part of the standard bringup for a datacenter was setting up a dc-specific chubby cell. You could have multiple SRE-managed chubby cells per datacenter/cluster if there was some reason for it. Anybody could run their own, but chubby-sre wasn’t responsible for anybody else’s, I think.

            Finally, there was a global cell. It was both distributed across multiple datacenters and also contained endpoint information for the per-dc chubby cells, so if a brand new process woke up somewhere and all it knew how to access was the global chubby cell, it could bootstrap from that to talking to chubby in any datacenter and thence to any other process anywhere, more or less.

            ^ there’s a lot in there that I’m fuzzy about, maybe processes wake up and only know how to access local chubby, but that cell has endpoint info for the global one? I don’t think any part of this process used dns; service discovery (including how to discover the service discovery service) was done through chubby.

            • starspangled 3 days ago

              Not trying to "challenge" your story, and its interesting anecdote in context. But if you have time to indulge me (and I'm not a real expert at distributed systems, which might be obvious) -

              Why would you have a distributed lock service that (if I read right) has multiple redundant processes that can tolerate failures... and then require clients tolerate outages? Isn't the purpose of this kind of architecture so that each client doen't have to deal with outages?

              • saalweachter 3 days ago

                Because you want the failure modes to be graceful and recovery to be automatic.

                When the foundation of a technology stack has a failure, there are two different axis of failure.

                1. How well do things keep working without the root service? Does every service that can be provided without it still keep going?

                2. How automatically does the system recover when the root service is restored? Do you need to bring down the entire system and restore it in a precise order of dependencies?

                It's nice if your system can tolerate the missing service and keep chugging along, but it is essential that your system not deadlock on the root service disappearing and stay deadlocked after the service is restored. At best, that turns a downtime of minutes into a downtime of hours, as you carefully turn down every service and bring them back up in a carefully proscribed order. At worse, you discover that your system that hasn't gone down in three years has acquired circular dependencies among its services, and you need to devise new fixes and work-arounds to allow it to be brought back up at all.

              • praptak 3 days ago

                First, a global system with no outages (say the gold standard of 99.999% availability) is a promise which is basically impossible to keep.

                Second, a global system being always available definitely doesn't mean it is always available everywhere. A single datacenter or even a larger region will experience both outages and network splits. It means that whatever you design on top of the super-available global system will have to deal with the global system being unavailable anyway.

                TLDR is that the clients will have to tolerate outages (or at least frequent cut offs from the "global" state") anyway so it's better not to give them false promises.

          • nine_k 3 days ago

            Replace this with "API gateway cluster", or basically any simple enough, very widely used service.

      • RcouF1uZ4gsC 3 days ago

        The same company that was in the news recently for screwing up a livestream of a boxing match.

        • chrisweekly 3 days ago

          True, but it's the exception that proves the rule; it's also the same company responsible for delivering a staggeringly high percentage of internet video, typically without a hitch.

          • tovej 3 days ago

            That's not what an exception proving a rule means. It has a technical meaning: a sign that says "free parking on sundays" implies parking is not free as a rule.

            When used like this it just confuses a reader with rethoric. In this case netflix is just bad at live streaming, they clearly haven't done the necessary engineering work on it.

            • nine_k 3 days ago

              The fact that Netflix surprised so many people by an exceptional technical issue implies that as a rule Netflix delivers video smoothly and at any scale necessary.

              • chrisweekly 2 days ago

                Yes! THIS is precisely what I meant in my comment.

            • jjk166 2 days ago

              That's also not what "an exception proving the rule" is either. The term comes from a now mostly obsolete* meaning of prove meaning "to test or trial" something. So the idiom properly means "the exception puts the rule to the test." If there is an exception, it means the rule was broken. The idiom has taken on the opposite meaning due to its frequent misuse, which may have started out tongue in cheek but now is used unironically. It's much like using literally to describe something which is figurative.

              * This is also where we get terms like bulletproof - in the early days of firearms people wanted armor that would stop bullets from the relatively weak weapons, so armor smiths would shoot their work to prove them against bullets, and those that passed the test were bullet proof. Likewise alcohol proof rating comes from a test used to prove alcohol in the 1500s.

            • lelanthran 3 days ago

              > That's not what an exception proving a rule means. It has a technical meaning: a sign that says "free parking on sundays" implies parking is not free as a rule.

              So the rule is "Free parking on Sundays", and the exception that proves it is "Free parking on Sundays"? That's a post-hoc (circular) argument that does not convince me at all.

              I read a different explanation of this phrase on HN recently: the "prove" in "exception proves the rule" has the same meaning as the "prove" (or "proof") in "50% proof alcohol".

              AIUI, in this context "proof" means "tests". The exception that tests the rule simply shows where the limits of the rules actually are.

              Well, that's how I understood it, anyway. Made sense to me at the time I read the explanation, but I'm open to being convinced otherwise with sufficiently persuasive logic :-)

              • taejo 3 days ago

                The rule is non-free parking. The exception is Sundays.

              • lucianbr 3 days ago

                The meaning of a word or expression is not a matter of persuasive logic. It just means what people think it means. (Otherwise using it would not work to communicate.) That is why a dictionary is not a collection of theorems. Can you provide a persuasive logic for the meaning of the word "yes"?

                https://en.wikipedia.org/wiki/Exception_that_proves_the_rule

                Seems like both interpretations are used widely.

              • tsimionescu 3 days ago

                The origin of the phrase is the aphorism that "all rules have an exception". So, when someone claims something is a rule and you find an exception, that's just the exception that proves it's a real rule. It's a joke, essentially, based on the common-sense meaning of the word "rule" (which is much less strict than the mathematical word "rule").

              • seaal 3 days ago

                50% proof alcohol? That isn’t how that works. It’s 50% ABV aka 100 proof.

                • lelanthran 3 days ago

                  > 50% proof alcohol? That isn’t how that works. It’s 50% ABV aka 100 proof.

                  50% proof wouldn't be 25% ABV?

                  • oofabz 3 days ago

                    Since 50% = 0.5, and proof doesn't take a percentage, I believe "50% proof" would be 0.25% ABV.

          • eru 3 days ago

            Yes, though serving static files is easier than streaming live.

      • bitwize 3 days ago

        The chaos monkey is there to remind you to always mount a scratch monkey.

      • amelius 3 days ago

        "Your flight has been delayed due to Chaos Monkey."

        • nine_k 3 days ago

          This means a major system problem. The point of the Chaos Monkey is that the system should function without interruptions or problems despite the activity of the Chaos Monkey. That is, it is to keep the system in such a shape that it could swallow and overcome some rate of failure, higher than the "naturally occurring" rate.

        • bigiain 3 days ago

          "My name is Susie and I'll be the purser on your flight today, and on behalf of the Captain Chaos Monkey and the First Officer Chaos Monkey... Oh. Shit..."

  • pj_mukh 3 days ago

    "When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan"

    Pretty sure this is exactly what happened with Cruise in San Francisco, cars would just stop and await instructions causing traffic jams. City got mad so they added a "pullover" mechanism. Except now, the "pullover" mechanism ended up dragging someone who had been "flung" into the cars path by someone who had hit and run a pedestrian.

    The real world will break all your test cases.

  • telgareith 3 days ago

    Dig into the OpenZFS 2.2.0 data loss bug story. There was at least one ticket (in FreeBSD) where it cropped up almost a year prior and got labeled "look into layer," but it got closed.

    I'm aware closing tickets of "future investigation" tasks when it seems to not be an issue any longer is common. But, it shouldnt be.

    • Arainach 3 days ago

      >it shouldnt be

      Software can (maybe) be perfect, or it can be relevant to a large user base. It cannot be both.

      With an enormous budget and a strictly controlled scope (spacecraft) it may be possible to achieve defect-free software.

      In most cases it is not. There are always finite resources, and almost always more ideas than it takes time to implement.

      If you are trying to make money, is it worth chasing down issues that affect a miniscule fraction of users that take eng time which could be spent on architectural improvements, features, or bugs affecting more people?

      If you are an open source or passion project, is it worth your contributors' limited hours, and will trying to insist people chase down everything drive your contributors away?

      The reality in any sufficiently large project is that the bug database will only grow over time. If you leave open every old request and report at P3, users will grow just as disillusioned as if you were honest and closed them as "won't fix". Having thousands of open issues that will never be worked on pollutes the database and makes it harder to keep track of the issues which DO matter.

      • Shorel 3 days ago

        I'm in total disagreement with your last paragraph.

        In fact, I can't see how it follows from the rest.

        Software can have defects, true. There are finite resources, true. So keep the tickets open. Eventually someone will fix them.

        Closing something for spurious psychological reasons seems detrimental to actual engineering and it doesn't actually avoid any real problem.

        Let me repeat that: ignoring a problem doesn't make it disappear.

        Keep the tickets open.

        Anything else is supporting a lie.

        • lmm 3 days ago

          > There are finite resources, true. So keep the tickets open. Eventually someone will fix them.

          Realistically, no, they won't. If the rate of new P0-P2 bugs is higher than the rate of fixing being done, then the P3 bugs will never be fixed. Certainly by the time someone gets around to trying to fix the bug, the ticket will be far enough out of date that that person will not be able to trust it. There is zero value in keeping the ticket around.

          > Anything else is supporting a lie.

          Now who's prioritising "spurious psychological reasons" over the things that actually matter? Closing the ticket as wontfix isn't denying that the bug exists, it's acknowledging that the bug won't be fixed. Which is much less of a lie than leaving it open.

          • YokoZar 3 days ago

            Every once in a while I get an email about a ten plus year old bug finally getting fixed in some open source project. If it's a good bug accurately describing a real thing, there's no reason to throw that work away rather than just marking it lower priority.

            • lmm 3 days ago

              > If it's a good bug accurately describing a real thing, there's no reason to throw that work away rather than just marking it lower priority.

              Perhaps. But the triage to separate the "good bugs accurately describing real things" from the chaff isn't free either.

              • mithametacs 3 days ago

                Storage is cheap, database indexes work. Just add a `TimesChecked` counter to your bug tracker.

                Now when considering priorities, consider not just impact and age, but also times checked.

                It's less expensive than going and deleting stuff. Unless you're automating deletions? In which case... I don't think I can continue this discussion.

                • lmm 3 days ago

                  In what world is adding a custom field to a bug tracker and maintaining it cheaper than anything? If someone proves out this workflow and releases a bug tracker that has the functionality built in then I'll consider adopting it, but I'm certainly not going to be the first.

                  • mithametacs 3 days ago

                    Wild.

                    Okay, keep deleting bug tickets.

                • Arainach 2 days ago

                  We already have a TimesChecked counter:

                  If (status != Active) { /* timesChecked > 0 */ }

          • lesuorac 3 days ago

            But then when somebody else has the issue they make a new bug and any data/investigation from the old one is basically lost.

            Like what's wrong with having 1000 open bugs?

            • Arainach 3 days ago

              In becomes functionally impossible to measure and track tech debt. Not all of those issues are tech debt - things which will never be fixed don't matter.

              Put another way: you're working on a new version of your product. There are 900 issues in the tracker. Is this an urgent emergency where you need to shut down feature work and stabilize?

              If you keep a clean work tracker where things that are open mean work that should get done: absolutely

              If you just track everything ever and started this release with 1100 issues: no, not necessarily.

              But wait, of those 900 issues are there any that should block the release? Now you have 900 to go through and determine. And unless you won't fix some of them you'll have the same thing in a few months.

              Work planning, triage, and other team tasks are not magic and voodoo, but on my experience the same engineers who object to the idea of "won't fix" are the ones who want to just code all day and never have to deal with the impact of a huge messy issue database on team and product.

              • lesuorac 3 days ago

                > Not all of those issues are tech debt - things which will never be fixed don't matter.

                Except that I've spent a good amount of time fixing bugs originally marked as `won't fix` because they actually became uh "will fix" (a decade later; lol).

                > Put another way: you're working on a new version of your product. There are 900 issues in the tracker. Is this an urgent emergency where you need to shut down feature work and stabilize?

                Do you not prioritize your bugs?

                If the tracker is full of low priority bugs then it doesn't block the release. One thing we do is even if the bug would be high priority; if it's not new (as-in occurs in older releases) it doesn't (by default) block the next release.

                > But wait, of those 900 issues are there any that should block the release? Now you have 900 to go through and determine. And unless you won't fix some of them you'll have the same thing in a few months.

                You should only need to triage the bug once. It should be the same amount of work to triage a bug into low-priority as it is to mark it as `won't fix`. With (again), the big difference between that if a user searches for the bug they can find it and ideally keep updating the original bug instead of making a dozen new ones that need to be de-duplicated and triaged which is _more work_ not _less work_ for triagers.

                > Work planning, triage, and other team tasks are not magic and voodoo, but on my experience the same engineers who object to the idea of "won't fix" are the ones who want to just code all day and never have to deal with the impact of a huge messy issue database on team and product.

                If your idea of the product is ready for release is 0 bugs filed then that's something you're going to want to change. Every software gets released with bugs; often known bugs.

                I will concede that if you "stop the count" or "stop testing" then yeah you'll have no issues reported. Doesn't make it the truth.

              • kelnos 3 days ago

                That's absurd. Closing those issues doesn't make them go away, it just causes you to forget them (until someone else reports them again and someone creates a new issue, losing all the previous context). If you just leave them open, some will eventually get fixed, many will not, and that's fine.

                The decision between feature work vs. maintenance work in a company is driven by business needs, not by the number of bugs open in the issue tracker. If anything, keeping real bugs open helps business leaders actually determine their business needs more effectively. Closing them unfixed is the equivalent of putting your head in the sand.

            • lmm 3 days ago

              > But then when somebody else has the issue they make a new bug and any data/investigation from the old one is basically lost.

              You keep the record of the bug, someone searching for the symptoms can find the wontfix bug. Ideally you put it in the program documentation as a known issue. You just don't keep it open, because it's never going to be worked on.

              > Like what's wrong with having 1000 open bugs?

              Noise, and creating misleading expectations.

          • kelnos 3 days ago

            > If the rate of new P0-P2 bugs is higher than the rate of fixing being done, then the P3 bugs will never be fixed.

            That's quite a big assumption. Every company I've worked at where that was the case had terrible culture and constantly shipped buggy crap. Not really the kind of environment that I'd use to set policy or best practices.

            • lmm 3 days ago

              If you're in the kind of environment where you fix all your bugs then you don't have a ballooning bug backlog and the problem never arises. I've worked in places that fixed all their bugs, but to my mind that was more because they didn't produce the kind of product that has P3 bugs than because they had better culture or something.

        • Arainach 3 days ago

          It's not "spurious psychological reasons". It is being honest that issues will never, ever meet the bar to be fixed. Pretending otherwise by leaving them open and ranking them in the backlog is a waste of time and attention.

          • ryandrake 3 days ago

            I've seen both types of organizations:

            1. The bug tracker is there to document and prioritize the list of bugs that we know about, whether or not they will ever be fixed. In this world, if it's a real issue, it's tracked and kept while it exists in the software, even though it might be trivial, difficult, or just not worth fixing. There's no such thing as closing the bug as "Won't Fix" or "Too Old". Further, there's no expectation that any particular bug is being worked on or will ever be fixed. Teams might run through the bug list periodically to close issues that no longer reproduce.

            2. The bug tracker tracks engineering load: the working set of bugs that are worthy of being fixed and have a chance to be fixed. Just because the issue is real, doesn't mean it's going to be fixed. So file the bug, but it may be closed if it is not going to be worked on. It also may be closed if it gets old and it's obvious it will never get worked on. In this model, every bug in the tracker is expected to be resolved at some point. Teams will run through the bug list periodically to close issues that we've lived with for a long time and just won't be fixed ever.

            I think both are valid, but as a software organization, you need to agree on which model you're using.

          • gbear605 3 days ago

            There have been a couple times in the past where I’ve run into an issue marked as WONT FIX and then resolved it on my end (because it was luckily an open source project). If the ticket were still open, it would have been trivial to put up a fix, but instead it was a lot more annoying (and in one of the cases, I just didn’t bother). Sure, maybe the issue is so low priority that it wouldn’t even be worth reviewing a fix, and this doesn’t apply for closed source projects, but otherwise you’re just losing out on other people doing free fixes for you.

          • exe34 3 days ago

            it's more fun/creative/CV-worthy to write new shiny features than to fix old problems.

            • astrange 3 days ago

              I think a more subtle issue is that fixing old bugs can cause new bugs. It's easier to fix something new, for instance because you understand it better. At some point it can be safest to just not touch something old.

              Also, old bugs can get fixed by accident / the environment changing / the whole subsystem getting replaced, and if most of your long tail of bugs is already fixed then it wastes people's time triaging it.

              • wolrah 2 days ago

                > I think a more subtle issue is that fixing old bugs can cause new bugs.

                Maybe it's years of reading The Old New Thing and similar, maybe it's a career spent supporting "enterprise" software, but my personal experience is that fixing old bugs causing new bugs happens occasionally, but far more often it's that fixing old bugs often reveals many more old bugs that always existed but were never previously triggered because the software was "bug compatible" with the host OS, assumptions were made that because old versions never went outside of a certain range no newer versions ever would, and/or software just straight up tinkered with internal structures it never should have been touching which were legitimately changed.

                Over my career I have chased down dozens of compatibility issues between software packages my clients used and new versions of their respective operating systems. Literally 100% of those, in the end, were the software vendor doing something that was not only wrong for the new OS but was well documented as wrong for multiple previous releases. A lot of blatant wrongness was unfortunately tolerated for far too long by far too many operating systems, browsers, and other software platforms.

                Windows Vista came out in 2006 and every single thing that triggered a UAC prompt was a thing that normal user-level applications were NEVER supposed to be doing on a NT system and for the most part shouldn't have been doing on a 9x system either. As recently as 2022 I have had a software vendor (I forget the name but it was a trucking load board app) tell me that I needed to disable UAC during installs and upgrades for their software to work properly. In reality, I just needed to mount the appropriate network drive from an admin command prompt so the admin session saw it the same way as the user session. I had been telling the vendor the actual solution for years, but they refused to acknowledge it and fix their installer. That client got bought out so I haven't seen how it works in 2024 but I'd be shocked if anything had changed. I have multiple other clients using a popular dental software package where the vendor (famous for suing security researchers) still insists that everyone needs local admin to run it properly. Obviously I'm not an idiot and they have NEVER had local admin in decades of me supporting this package but the vendor's support still gets annoyed about it half the time we report problems.

                As you might guess, I am not particularly favorable on Postel's Law w/r/t anything "big picture". I don't necessarily want XHTML style "a single missing close tag means the entire document is invalid" but I also don't want bad data or bad software to persist without everyone being aware of its badness. There is a middle ground where warnings are issued that make it clear that something is wrong and who's at fault without preventing the rest of the system from working. Call out the broken software aggressively.

                tl;dr: If software B depends on a bug or unenforced boundary in software A, and software A fixing that bug or enforcing that boundary causes software B to stop working, that is 100% software B's problem and software A should in no way ever be expected to care about it. Place the blame where it belongs, software B was broken from the beginning we just hadn't been able to notice it yet.

      • mithametacs 3 days ago

        Everything is finite including bugs. They aren’t magic or spooky.

        If you are superstitious about bugs, it’s time to triage. Absolutely full turn disagreement with your directions

        • jfactorial 3 days ago

          > Everything is finite including bugs.

          Everything dies including (probably) the universe, and shortly before that, our software. So you're right, the number of bugs in a specific application is ultimately finite. But most of even the oldest software still in use is still getting regular revisions, and if app code is still being written, it's safe to assume bugs are still being created by the fallible minds that conceived it. So practically speaking, for an application still in-development, the number of bugs, number of features, number of lines of code, etc. are dynamic, not finite, and mostly ever-increasing.

      • acacar 3 days ago

        No, uh-uh. You can't sweep a data loss bug under the rug, under any circumstances, especially in a filesystem. Curdle someone's data just once and they'll never trust you again.

      • gpderetta 3 days ago

        the CADT model of software engineering.

  • sameoldtune 3 days ago

    > switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.

    Catastrophe is most likely to strike when you try to fix a small mistake: pushing a hot-fix that takes down the server; burning yourself trying to take overdone cookies from the oven; offending someone you are trying to apologize to.

  • crtified 3 days ago

    Also, as codebases and systems get more (not less) complex over time, the potential for technical debt multiplies. There are more processing and outcome vectors, more (and different) branching paths. New logic maps. Every day/month/year/decade is a new operating environment.

    • mithametacs 3 days ago

      I don’t think it is exponential. In fact, one of the things that surprises me about software engineering is that it’s possible at all.

      Bugs seem to scale log-linearly with code complexity. If it’s exponential you’re doing it wrong.

  • InDubioProRubio 3 days ago

    Reminds me of the switches we used to put into production machines that could self destroy

    if -- well defined case else

    Scream

    while true do

      Sleep(forever)
    
    Same in

    switch- default

    Basically every known unknown its better to halt and let humans drive the fragile machine back into safe parameters- or expand the program.

    PS: Yes, the else- you know what the else is, its the set of !(-well defined conditions) And its ever changing, if the well-defined if condition changes.

  • agos 3 days ago

    Erlang was born out of a similar problem with signal towers: if a terminal sends bogus data or otherwise crashes it should not bring everything down because the subsequent reconnection storm would be catastrophic. so, "let it fail" would be a very reasonable approach to this challenge, at least for fault handling

  • akavel 3 days ago

    Take a look at the book: "Systemantics. How systems work and especially how they fail" - a classic, has more observations like this.

  • wkat4242 3 days ago

    It's been in place for a while, it happens every few months.

jp57 4 days ago

FYI: nm = nautical miles, not nanometers.

  • joemi 3 days ago

    It's quite amusing that they used the incorrect, lowercase abbreviation for "nautical mile" which means something else ("nanometer") in an article about a major issue caused by two things sharing the same abbreviation.

  • cduzz 3 days ago

    I was wondering; it seemed like if the to airports were 36000 angstroms apart (3600 nanometers), it'd be reasonable to give them the same airport code since they'd be pretty much on top of each other.

    I've also seen "DANGER!! 12000000 μVolts!!!" on tiny little model railroad signs.

    • atonse 3 days ago

      That's so adorable (for model railroads)

  • andyjohnson0 3 days ago

    Even though I knew this was about aviation, I still read nm as nanometres. Now I'm wondering what this says about how my brain works.

    • lostlogin 3 days ago

      It says ‘metric’. Good.

      • jp57 3 days ago

        Though one could argue that the (original) definition of a meter and the definition of a nautical mile are equally arbitrary and yet similarly based.

        Originally 1 meter was one ten-millionth of the distance over the surface of the earth from the equator to the pole.

        One nautical mile is the length one arc-minute of latitude along a meridian. (About 1.85km).

        • eru 3 days ago

          Yes, the nautical mile is actually les arbitrary than the 'normal' mile.

          > Originally 1 meter was one ten-millionth of the distance over the surface of the earth from the equator to the pole.

          Even more originally, they wanted to use the length of a pendulum that takes one second to swing. But they discovered that this varies from place to place. So they came up with the newer definition based on the size of the earth. And just like with all the subsequent redefinitions (like the one based on the speed of light etc), the new length of the metre matches the old length of the metre:

          > [The] length of the string will be approximately 993.6 millimetres, i.e. less than a centimetre short of one metre everywhere on Earth. This is because the value of g, expressed in m/s^2, is very close to π^2.

          The definitions matching is by design, not an accident.

          See https://en.wikipedia.org/wiki/History_of_the_metre and https://en.wikipedia.org/wiki/Seconds_pendulum

          If you want something less arbitrary, you can pick 'Natural Units': https://en.wikipedia.org/wiki/Natural_units

        • lmm 3 days ago

          They were. The difference is one is part of a standard system of units with sensible relations between different units in the system, and the other isn't; the specifics aren't what's important, the relations are.

      • tialaramex 3 days ago

        Indeed. There are plenty of things in aviation where they care so much about compatibility that something survives decades after it should reasonably be obsolete and replaced.

        Inches of mercury, magnetic bearings (the magnetic poles move! but they put up with that) and gallons of fuel, all just accepted.

        Got a safety-of-life emergency on an ocean liner, oil tanker or whatever? Everywhere in the entire world mandates GMDSS which includes Digital Selective Calling, the boring but complicated problems with radio communication are solved by a machine, you just need to know who you want to talk to (for Mayday calls it's everyone) and what you want to tell them (where you are, that you need urgent assistance and maybe the nature of the emergency)

        On an big plane? Well good luck, they only have analogue radio and it's your problem to cope with the extensive troubles as a result.

        I'm actually impressed that COSPAS/SARSAT wasn't obliged to keep the analogue plane transmitters working, despite obsoleting (and no longer providing rescue for) analogue boat or personal transmitters. But on that, at least, they were able to say no, if you don't want to spend a few grand on the upgrade for your million dollar plane we don't plan to spend billions of dollars to maintain the satellites just so you can keep your worse system limping along.

        • rounce 3 days ago

          > Inches of mercury, magnetic bearings (the magnetic poles move! but they put up with that) and gallons of fuel, all just accepted.

          Here in Europe we use hectopascals for pressure, as does pretty much everywhere else. It’s important to have a magnetic bearing in case your glass dies and you’re reliant on a paper map and compass, if you didn’t plan with magnetic bearings you’d be screwed if this happened in an area of high magnetic variation.

        • mnw21cam 3 days ago

          Air pressure is reported in hectopascals and fuel quantity in kilograms (or tons). It's only in America where this isn't the case. We're still using feet for altitude in most places though (it's mainly Russia that uses metres).

          • p_l 2 days ago

            Feet are "optionally accepted" due to USA being China of aviation and flooding the market with feet-marked hardware

    • skykooler 3 days ago

      "Hacker News failure caused by two units 12 orders of magnitude apart sharing 2-letter code"

      • Nevermark 3 days ago

        We better get this sorted out before open source manned Mars missions.

        That all programming languages, down to statically typed assembly, don’t support something as simple to validate as unit consistency says something strange about how the science of replacing unreliable manual processes with automated systems is really bad at the practice of replacing its own risky manual processes with automated systems.

        If numeric types just required a given unit, without even supporting automated conversions, it would make incorrectly unit-ed/scaled literals vastly less likely.

        • eru 3 days ago

          > That all programming languages, down to statically typed assembly, don’t support something as simple to validate as unit consistency [...]

          Many programming languages are flexible and strong enough support this. We just don't do it by default, and you'd need libraries.

          Btw, units by themselves are useful, but not enough. Eg angular momentum and energy have the same units of Newton * metre, but adding them up is not recommended.

          • jjk166 2 days ago

            > Eg angular momentum and energy have the same units of Newton * metre, but adding them up is not recommended.

            The unit of angular momentum is kg.m^2.s^-1, you're thinking of torque. Although even then we distinguish the Newton meter (Nm) from the Joule (J) even if they have the same dimensionality.

            • eru 2 days ago

              Thanks, yes, I meant torque.

              Well, 1 J = 1 Nm, the differentiation you mention only helps humans a bit, but would be really hard to make work for a computer.

              • jjk166 2 days ago

                I don't think it would be too difficult for a computer to handle. We already deal with situations like float(1) = int(1), it doesn't seem any harder to handle torque(1) = energy(1).

                • eru a day ago

                  The problem is that you want torque not being equal to energy.

                  Specifically you have:

                  torque = force * distance

                  energy = force * distance

                  The only difference being that in the former the farce is perpendicular to the distance, and in the latter it's in line with the distance.

                  A vector based system could distinguish the two, but you don't always want to deal with vectors in your computations. (And I'm fairly sure there are problems where even using vectors ain't enough to avoid this problem.)

        • astrange 3 days ago

          Some languages do; F# and Ada have units.

          I agree no sexy languages have it, and almost all languages have terrible support or anti-support for correctness in numerical programming. It's very strange.

          (By anti-support I mean things that waste your time and make it harder. For instance, a lot of languages think "static typing" means they need to prevent you from doing `int a,b; short c = a * b;` even if this is totally well-defined.)

          • KerrAvon 3 days ago

            `short c = a * b` can be both well-defined and a serious bug if (a * b) is greater than `sizeof(short)`. Whether it _is_ a bug depends on what you're doing.

            Swift has units as part of the standard library. In the sense that matters here, Rust and C++ could also have units. It requires a level of expressiveness in the type system that most modern languages do have, if you put it to use.

            • astrange 3 days ago

              It can be a serious bug if it overflows and you didn't intend it to happen or you expected overflow behavior to do something different. But that's also true of `int c = a*b`, and yet that's not a compiler error.

              int/short should be thought of as storage size optimizations for memory. They're very bad ways to specify the correct range of values for a variable.

              (Ada has explicitly ranged integers though!)

          • rkagerer 2 days ago

            Whenever it's vague, I include the units as part of the parameter name (eg. delay_ms). Not perfect, but for practical purposes it helps. It's simple and I'm not sure why there aren't more people/libraries doing this.

    • krick 3 days ago

      I wouldn't have guessed until I read the comments. My assumption was somebody just mistyped km and somehow nobody cared to fix it.

    • jug 3 days ago

      Yeah, I went into the article thinking this because I expected someone had created waypoints right on top of each other and in the process also somehow generating the same code for them.

  • QuercusMax 3 days ago

    Ah! I thought this was a case where the locations were just BARELY different from each other, not that they're very far apart.

  • barbazoo 4 days ago

    Given the context, I'd say NM actually https://en.wikipedia.org/wiki/Nautical_mile

    • jp57 4 days ago

      I was clarifying the post title, which uses "nm".

      • pvitz 3 days ago

        Yes, it looks like they should have written "NM" instead of "nm".

        • andkenneth 3 days ago

          No one is using nanometers in aviation navigation. Quite a few aviation systems are case insensitive or all caps only so you can't always make a distinction.

          In fact, if you say "miles", you mean nautical miles. You have to use "sm" to mean statute miles if you're using that unit, which is often used for measuring visibility.

          • ianferrel 3 days ago

            Sure but I could imagine some kind of software failure caused by trying to divide by a distance that rounded two zero because the same location was listed in two databases that were almost but not exactly the same location. In fact I did when I first read the headline, then realized that it was probably nautical miles.

            That would be roughly consistent with the title and not a totally absurd thing to happen in the world.

            • scarlehoff 3 days ago

              This is exactly what I thought q when I first read the title.

          • jp57 3 days ago

            Indeed, having locations internally represented in software with a resolution of nanometers is as ridiculous as having your calendar's internal times represented as milliseconds since some arbitrary moment more than fifty years ago!

          • anigbrowl 3 days ago

            Indeed, but you can easily imagine a software glitch over what looks like a single location but which the computer sees as two separate ones.

    • rob74 3 days ago

      Yes. And, to quote the Wikipedia article: "Symbol: M, NM, or nmi". Not nm (as used in the title, but the article also uses it).

  • dietr1ch 4 days ago

    Thanks, from the title I was confused on why there was such a high resolution on positions.

  • hughdbrown 3 days ago

    Wow, I read this article because I could not understand how two labeled points on an air path could be 3600 nanometers apart. Never occurred to me that someone would use 'nm' to mean nautical miles.

  • ikiris 3 days ago

    Nanometers would be a very short flight.

    • cheschire 3 days ago

      I could imagine conflict arising when switching between single and double precision causing inequality like this.

  • ainiriand 3 days ago

    Exactly the first thing that came to my mind when I saw that abbreviation.

  • larsnystrom 3 days ago

    Ah, yes, like when people put in extraordinary amounts of effort to avoid sending a millibit (mb) of data over the wire.

  • animal531 3 days ago

    And for those like myself wondering how much 3600nm is, it is of course 0.0036mm

  • endoblast 3 days ago

    We all need to stop using abbreviations, in my opinion.

    EDIT: I mean the point of abbreviations is to facilitate communication. However with the world wide web connecting multiple countries, languages and fields of endeavour there are simply too many (for example) three letter acronyms in use. There are too many sources of ambiguity and confusion. Better to embrace long-form writing.

  • fabrixxm 3 days ago

    It took me a while, tbh..

  • noqc 3 days ago

    man, this ruins everything.

  • snakeyjake 4 days ago

    [flagged]

    • jp57 4 days ago

      Sorry no. One interpretation, especially on this site, is that the problem was some kind of database bug, maybe where the same location was entered twice and a tiny location error ended up creating two locations.

      I expect that out of any random sample 500 million literate and mentally healthy English speakers, more than 450 million of them are totally unaccustomed to thinking about nautical miles ever. Even people in science, or even who might have dealt with nanometers in school do not typically think about nautical miles unless they are sailors or airplane pilots.

      • snakeyjake 4 days ago

        [flagged]

        • s_tec 3 days ago

          I assumed the post title meant nanometers. Why? Floating-point rounding bugs. A nanometer is about 9e-15 degrees of latitude, which is right about where a double-precision floating point number runs out of digits. So, if a piece of software uses exact `==` equality, it could easily have a bug where two positions 3600 nanometers apart are seen as being different, even though they should be treated as the same.

          • SilasX 3 days ago

            Thank you. People can be very bad about judging which scenarios are truly implausible.

            Here’s a previous thread where someone thought it was absurd that there could exist native English speakers who don’t regularly go shopping, and treated that supposed impossibility as a huge “checkmate”!

            https://news.ycombinator.com/item?id=32625340

        • pests 3 days ago

          I assumed nanometers.

          I see it every day in chip/design/a few others. I was confused how I caused an issue with the points being so close.

          I only realized NM means nautical miles only due to these comments.

    • monktastic1 4 days ago

      It doesn't "imply" nanometers, it literally reads nanometers. The standard abbreviation for nautical miles is "NM."

    • contravariant 4 days ago

      Well, I certainly read nanometers, but I have to admit I may not be mentally well.

    • ortusdux 4 days ago

      Well TIL I must be illiterate or mentally unwell.

    • ajford 3 days ago

      I too first read it as nanometers. Perhaps it's just those with engineering/science backgrounds who have far more experience with nanometers than nautical miles?

      I've never had cause to see the abbreviated form of nautical miles, but I know nanometers. Also, given purely the title I could see it being some kind of data collision due to precision errors between to locations that should be the same airport but perhaps two different sensors.

    • remram 4 days ago

      I assumed it probably wasn't nanometers but I have no idea what a nautical mile is.

      Turns out 3600 NM = 6666 km

    • moomin 4 days ago

      The question isn't really how many would think it meant nanonmeters, it's how many would recognise it as nautical miles.

      • RandomThoughts3 3 days ago

        I used to work on military boats combat systems. I scanned the title, read it nanometers, thought it was weird, mentally corrected to nautical miles being familiar with the context and cursed the person who wrote the title for the incorrect capitalisation.

        So, yes, people who have worked in a related field get it. Still annoying though.

      • briandear 3 days ago

        In aviation? Literally everyone. Context clues should make it obvious.

        • ajford 3 days ago

          And how many would assume that the incorrect unit abbreviation would mean something else? If I said Kbps and KBps, it's an entirely different unit of measure. NM and nm are VASTLY different, and unless you are already familiar with measurements in Aviation and know it's Nautical Miles, your first instinct is gonna be to read it as the unit it's supposed to be.

          The article title is talking about location data and computers, I've seen many people forget floating point precision when comparing and getting bit by tiny differences at the 10^-9 or smaller. That seems just as obvious on the outset as non-unique location designations in what the average person would assume to be a dataset that's intentionally unique and unambiguous.

        • JadeNB 3 days ago

          > In aviation? Literally everyone. Context clues should make it obvious.

          But this is Hacker News, not Aviation News, and there are plenty of people, like me, who might find this interesting but aren't in aviation. I also thought it meant nanometers.

        • mulmen 3 days ago

          The context is a bug. The bug could have been using nanometers when nautical miles were intended.

    • sbelskie 4 days ago

      Can’t claim to be “mentally well”, whatever that might mean, but I definitely read it as nanometers.

    • briandear 3 days ago

      In aviation? You know how big an aircraft is right?

      • JadeNB 3 days ago

        > In aviation? You know how big an aircraft is right?

        Exactly, which makes the headline particularly intriguing—how could a 3600 nanometer difference matter? The standard resolution, which I pursued, is to read the article to find out, but it doesn't mention the distance at all.

FateOfNations 7 days ago

Good news: the system successfully detected an error and didn't send bad data to air traffic controllers.

Bad News: the system can't recover from an error in an individual flight plan, bringing the whole system down with it (along with the backup system since it was running the same code).

  • wyldfire 4 days ago

    > he system can't recover from an error in an individual flight plan, bringing the whole system down with it

    From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.

    The systems outside of the scope of this one failed to preserve a uniqueness guarantee that was depended on by this system. Was that dependency correctly identified as one that was the job of System X and not System Y?

    • akira2501 3 days ago

      > obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights?

      Flights are tracked by radar and by transponder. The appropriate thing to do is just flag the flight with a discontinuity error but otherwise operate normally. This happens with other statuses like "radio failure" or "emergency aircraft."

      It's not something you'd see on a commercial flight, but a private IFR flight (one with a flight plan), you can actually cancel your IFR plan mid flight and revert to VFR (visual flight rules) instead.

      Some flights take off without an IFR clearance as a VFR flight, but once airborne, they call up ATC and request an IFR clearance already en route.

      The system is vouchsafing where it does not need to.

      • cryptonector 3 days ago

        The appropriate thing to do was to reject the flight plan (remember, the flight plan is processed before the flight starts, and anyways there were hours over U.S. in which the flight could have been diverted if manual resolution was not possible), not to let the flight continue with the apparent discontinuity, nor to shutdown the whole system.

        • CPLX 2 days ago

          > flight plan is processed before the flight starts

          Not necessarily. Also they regularly change while the flight is in the air.

      • CPLX 3 days ago

        This isn’t quite correct. The oceanic tracks do not have radar coverage and aren’t actively controlled.

        There are other fail safe methods of course all the way up to TCAS, but it’s not great for an oceanic flight to be outside of the system.

    • martinald 3 days ago

      Yes I agree. The reason the system crashed from what I understand wasn't because of the duplicate code, it was because it had the plane time travelling, which suggests very serious corruption.

      • kevin_thibedeau 3 days ago

        Waves hand... This is not the SQL injection you're looking for. It's just a serious corruption.

    • outworlder 3 days ago

      > From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.

      Flagging the error is absolutely the right way to go. It should have rejected the flight plan, however. There could be issues if the flight was allowed to proceed and you now have an aircraft you didn't expect showing up.

      Crashing is not the way to handle it.

    • aftbit 3 days ago

      It seems fundamentally unreasonable for the flight processing system to entirely shut itself down just because it detected that one flight plan had corrupt data. Some degree of robustness should be expected from this system IMO.

      • mannykannot 3 days ago

        It does not seem reasonable when you put it like that, but when could it be said with confidence that it only affected just one flight plan? I get the impression that it is only in hindsight that this could be seen to be so. On the face of it, this was just an ordinary transatlantic flight like thousands of others, with no reason to think there was anything unusual about it to make it more vulnerable than the rest - and really, there was not, it just had an unlucky combination of parameters.

        In general, the point where a problem first becomes apparent is not a guideline to its scope.

        Air traffic control is inherently a coordination problem dependent on common data, rules and procedures, which would seem to limit the degree to which subsystems can be siloed. Multiple implementations would not have helped in this case, either.

        • cryptonector 3 days ago

          Shutting down a flight control system might have other knock-on effects on flight safety. Even if it merely only grounded flights not yet in the air, the resulting confusion might lead to manual mistakes and/or subsequent air lane congestion that might cause collisions.

          • mannykannot 2 days ago

            every option had risks associated with it, and they are hard to assess until you know how deep the problem goes.

        • MBCook 3 days ago

          I think you’re on the right track, I assume it’s safety.

          If one bad flight plan came in, what are the chances other unnoticed errors may be getting through?

          Given the huge danger involved with being wrong shutting down with a “stuff doesn’t add up, no confidence in safe operation” error may be the best approach.

      • HeyLaughingBoy 3 days ago

        It depends on what the potential outcomes are.

        I've worked on a (medical, not aviation) system where we tried as much as possible to recover from subsystem failures or at least gracefully reduce functionality until it was safe to shut everything down.

        However, there were certain classes of failure where the safest course of action was to shut the entire system down immediately. This was generally the case where continuing to run could have made matters worse, putting patient safety at risk. I suspect that the designers of this system ran into the same problem.

    • cryptonector 3 days ago

      There is no need to shut down the whole system just because of one flight plan that the system was able to reject. Canceling (or forcing manual updates to) one flight plan is a lot better than canceling 1,500 flights.

steeeeeve 4 days ago

You know there's a software engineer somewhere that saw this as a potential problem, brought up a solution, and had that solution rejected because handling it would add 40 hours of work to a project.

  • nightowl_games 3 days ago

    I don't know that and I don't like this assumption that only 'managers' make mistakes, or that software engineers are always right. I thinks needlessly adversarial, biased and largely incorrect.

    • elteto 3 days ago

      Agreed. And most of the people with these attitudes have never written actual safety critical code where everything is written to a very detailed spec. Most likely the designers of the system thought of this edge case and required adding a runtime check and fatal assertion if it was ever encountered.

    • zer8k 3 days ago

      Spoken like a manager.

      Look, when you're barking orders at the guys in the trenches who, understandably in fear for their jobs, do the stupid "business-smart" thing, then it is entirely the fault of management.

      I can't tell you how many times just in the last year I've been blamed-by-proxy for doing something that was decreed upon me by some moron in a corner office. Everything is an emergency, everything needs to be done yesterday, everything is changing all the time because King Shit and his merry band of boot-licking middle managers decide it should be.

      Software engineers, especially ones with significant experience, are almost surely more right than middle managers. "Shouldn't we consider this case?" is almost always met with some parable about "overengineering" and followed up by a healthy dose of "that's not AGILE". I have grown so tired of this and thanks to the massive crater in job mobility most of us just do as we are told.

      It's the power imbalance. In this light, all blame should fall on the manager unless it can be explicitly shown to be developer problems. The addage "those who can, do, and those who can't, teach" applies equally to management.

      When it's my f@#$U neck on the line and the only option to keep my job is do the stupid thing you can bet I'll do the stupid thing. Thank god there's no malpractice law in software.

      Poor you - only one of our jobs is getting shipped overseas.

      • nightowl_games 3 days ago

        Wow that was adversarial. You are making an assumption about me that is wrong. I'm a high level engineer and have written an absolute boat load of code over my career. I've never been a manager.

      • astrange 3 days ago

        I don't think either of your jobs are getting shipped overseas.

        > Thank god there's no malpractice law in software.

        This is aerospace, there are such things, but there's also blameless postmortems. And it happened overseas!

      • kortilla 3 days ago

        Your attitude is super antagonistic and your relationship with management is not representative of the industry. I recommend you consider a different job or if this pattern repeats at every job that you reflect on how you interact with managers to improve.

  • CrimsonCape 3 days ago

    C dev: "You are telling me that the three digit codes are not globally unique??? And now we have to add more bits to the struct?? That's going to kill our perfectly optimized bit layout in memory! F***! This whole app is going to sh**"

    • throw0101a 3 days ago

      > C dev: "You are telling me that the three digit codes are not globally unique???

      They are understood not to be. They are generally known to be regionally unique.

      The "DVL" code is unique with-in FAA/Transport Canada control, and the "DVL" is unique with-in EASA space.

      There are pre-defined three-letter codes:

      * https://en.wikipedia.org/wiki/IATA_airport_code

      And pre-defined four-letter codes:

      * https://en.wikipedia.org/wiki/ICAO_airport_code

      There are also five-letter names for major route points:

      * https://data.icao.int/icads/Product/View/98

      * https://ruk.ca/content/icao-icard-and-5lnc-how-those-5-lette...

      If there are duplicates there is a resolution process:

      * https://www.icao.int/WACAF/Documents/Meetings/2014/ICARD/ICA...

      • skissane 3 days ago

        > They are understood not to be. They are generally known to be regionally unique.

        Then why aren’t they namespaced? Attach to each code its issuing authority, so it is obvious to the code that DVL@FAA and DVL@EASA are two different things?

        Maybe for backward compatibility/ human factors reasons, the code needs to be displayed without the namespace to pilots and air traffic controllers, but it should be a field in the data formats.

      • CrimsonCape 3 days ago

        It seems like tasking a software engineer to figure this out when the industry at large hasn't figured this out just isn't fair.

        Best I can see (using Rust) is a hashmap on UTF-8 string keys and every code in existence gets inserted into the hash map with an enum struct based on the code type. So you are forced to switch over each enum case and handle each case no matter what region code type.

        It becomes apparent that the problem must be handled with app logic earlier in the system; to query a database of codes, you must also know which code and "what type" of code it is. Users are going to want to give the code only, so there's some interesting mis-direction introduced; the system has to somehow fuzzy match the best code for the itinerary. Correct me if i'm wrong, but the above seems like a mandatory step in solving the problem which would have caught the exception.

        I echo other comments that say that there's probably 60% more work involved than your manager realizes.

      • marcosdumay 3 days ago

        Hum... Somebody has a list of foreign local-codes sharing the same space as the local ones?

        I assumed IATA messed up, not I'm wondering how that even happens. It's not even easy to discover the local codes of remote aviation authorities.

        • skissane 3 days ago

          > I assumed IATA messed up,

          This isn’t IATA. IATA manages codes used for passenger and cargo bookings, which are distinct from the codes used by pilots and air traffic control we are talking about here-ultimately overseen by ICAO. These codes include a lot of stuff which is irrelevant to passengers/freight, such as navigation waypoints, military airbases (which normally would never accept a civilian flight, but still could be used for an emergency landing-plus civilian and military ATC coordinate with each other to avoid conflicts)

          • marcosdumay 3 days ago

            Hum, no. It's not about ICAO codes. It's about local UK codes, and IATA ones used by mistake.

            • skissane 3 days ago

              The code that caused the issue is DVL, which isn’t a “local UK code”, it is a code used by the FAA for a location in the US and a code used by EASA for a location in France. And I didn’t say ICAO issued the codes, I said the process of issuing them by regional/national aviation authorities is “ultimately overseen by ICAO”, which I believe is correct.

  • ryandrake 4 days ago

    ... or there's a software engineer somewhere who simply assumed that three letter navaid identifiers were globally unique, and baked that assumption into the code.

    I guess we now need a "Falsehoods Programmers Believe About Aviation Data" site :)

    • metaltyphoon 4 days ago

      Did aviation software for 7 years. This is 100% the first assumption about waypoint / navaid when new devs come in.

    • SoftTalker 3 days ago

      And this is why you always use surrogate keys and not natural keys. No matter how much you convince yourself that your natural key is unique and will never change, if a human created the value then a human can change the value or create duplicates, and eventually will.

      • dx034 3 days ago

        But that wouldn't help you here. The flight plan will come in with the code and you'll still have to resolve that to your keys.

        • jjk166 2 days ago

          Sure, but that means you are putting in the infrastructure to resolve it, instead of assuming there will never be a need to.

    • MichaelZuo 4 days ago

      Or even more straightforward, just don’t believe anyone 100% knows what they are doing until they exhaustively list every assumption they are making.

      • madcaptenor 3 days ago

        Even more straightforward, just don’t believe anyone 100% knows what they are doing.

      • gregmac 4 days ago

        Which also means never assume the exhaustive list is 100%.

        • Filligree 3 days ago

          I wouldn't be able to produce such a list, even for areas where I totally do know everything that would be on the list.

        • MichaelZuo 3 days ago

          Bingo, without some means of credible verification, then assume it’s incomplete.

    • em-bee 3 days ago

      or falsehoods programmers believe about global identifiers

Jtsummers 4 days ago

There's been some prior discussion on this over the past year, here are a few I found (selected based on comment count, haven't re-read the discussions yet):

From the day of:

https://news.ycombinator.com/item?id=37292406 - 33 points by woodylondon on Aug 28, 2023 (23 comments)

Discussions after:

https://news.ycombinator.com/item?id=37401864 - 22 points by bigjump on Sept 6, 2023 (19 comments)

https://news.ycombinator.com/item?id=37402766 - 24 points by orobinson on Sept 6, 2023 (20 comments)

https://news.ycombinator.com/item?id=37430384 - 34 points by simonjgreen on Sept 8, 2023 (68 comments)

  • perihelions 4 days ago

    There's also a much larger one,

    https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)", 446 comments)

    • mstngl 3 days ago

      I remembered this extensive article immediately (only that I've read it, not what and where to find). Thanks for saving me from endlessly searching it.

jmvoodoo 4 days ago

So, essentially the system has a serious denial of service flaw. I wonder how many variations of flight plans can cause different but similar errors that also force a disconnect of primary and secondary systems.

Seems "reject individual flight plan" might be a better system response than "down hard to prevent corruption"

Bad assumption that a failure to interpret a plan is a serious coding error seems to be the root cause, but hard to say for sure.

  • mjevans 3 days ago

    Reject the flight plan would be the last case scenario, but where it should have gone without other options rather than total shutdown.

    CORRECT the flight plan, by first promoting the exit/entry points for each autonomous region along the route, validating the entry/exit list only, and then the arcs within, would be the least errant method.

    • d1sxeyes 3 days ago

      You can’t just reject or correct the flight plan, you’re a consumer of the data. The flight plan was valid, it was the interpretation applied by the UK system which was incorrect and led to the failure.

      There are a bunch of ways FPRSA-R can already interpret data like this correctly, but there were a combination of 6 specific criteria that hadn’t been foreseen (e.g. the duplicate waypoints, the waypoints both being outside UK airspace, the exit from UK airspace being implicit on the plan as filed, etc).

      • sandos 2 days ago

        If this is the case, then every system along the flightplan should pre-validate it before it gets accepted at the source?

        • d1sxeyes 2 days ago

          That adds a huge amount of overhead in terms of messaging and possibly adds delays in flight plans being accepted.

          The way this is supposed to work is that downstream systems should accept valid flight plans.

          I would say that it’s not the upstream system’s responsibility to reject valid flight plans because of implementation details on a downstream system.

    • mcfedr 3 days ago

      Reject the plan surely should have come many places before shutdown the whole system!

convivialdingo 3 days ago

I guarantee that piece of code has a comment like

  /* This should never happen */
  if (waypoints.matchcount > 2) {
  • crubier 3 days ago

    Possibly even just

        waypoint = waypointsMatches[0]
    
    Without even mentioning that waypointsMatches might have multiple elements.

    This is why I always consider [0] to be a code smell. It doesn't have a name afaik, but it should.

    • CaptainFever 3 days ago

      Silently ignoring conditions where there are multiple or zero elements?

  • dx034 3 days ago

    From the text it sounds like it looked up if a code was in the flight plan and at which position it was in the plan. It never looked up two codes or assumed there code only be one, just comparing how the plan was filed.

    I'm sure there'd be a better way to handle this, but it sounds to me like the system failed in a graceful way and acted as specified.

  • gitaarik 3 days ago

    Don't you mean > 1 ?

GnarfGnarf 3 days ago

Funny airport call letters story: I once headed to Salt Lake City, UT (SLC) for a conference. My luggage was processed by a dyslexic baggage handler, who sent it to... SCL (Santiago, Chile).

I was three days in my jeans at business meetings. My bag came back through Lima, Peru and Houston. My bag was having more fun than me.

  • watt 3 days ago

    Why not pop in to a shop, get another pair of pants?

amiga386 3 days ago

This is old news, but what's new news is that last week, the UK Civil Aviation Authority openly published its Independent Review of NATS (En Route) Plc's Flight Planning System Failure on 28 August 2023 https://www.caa.co.uk/publication/download/23337 (PDF)

Let's look at point 2.28: "Several factors made the identification and rectification of the failure more protracted than it might otherwise have been. These include:

• The Level 2 engineer was rostered on-call and therefore was not available on site at the time of the failure. Having exhausted remote intervention options, it took 1.5 hours for the individual to arrive on-site to perform the necessary full system re-start which was not possible remotely.

• The engineer team followed escalation protocols which resulted in the assistance of the Level 3 engineer not being sought for more than 3 hours after the initial event.

• The Level 3 engineer was unfamiliar with the specific fault message recorded in the FPRSA-R fault log and required the assistance of Frequentis Comsoft to interpret it.

• The assistance of Frequentis Comsoft, which had a unique level of knowledge of the AMS-UK and FPRSA-R interface, was not sought for more than 4 hours after the initial event.

• The joint decision-making model used by NERL for incident management meant there was no single post-holder with accountability for overall management of the incident, such as a senior Incident Manager.

• The status of the data within the AMS-UK during the period of the incident was not clearly understood.

• There was a lack of clear documentation identifying system connectivity.

• The password login details of the Level 2 engineer could not be readily verified due to the architecture of the system."

WHAT DOES "PASSWORD LOGIN DETAILS ... COULD NOT BE READILY VERIFIED" MEAN?

EDIT: Per NATS Major Incident Investigation Final Report - Flight Plan Reception Suite Automated (FPRSA-R) Sub-system Incident 28th August 2023 https://www.caa.co.uk/publication/download/23340 (PDF) ... "There was a 26-minute delay between the AMS-UK system being ready for use and FPRSA-R being enabled. This was in part caused by a password login issue for the Level 2 Engineer. At this point, the system was brought back up on one server, which did not contain the password database. When the engineer entered the correct password, it could not be verified by the server. "

  • dx034 3 days ago

    > The Level 2 engineer was rostered on-call and therefore was not available on site at the time of the failure. Having exhausted remote intervention options, it took 1.5 hours for the individual to arrive on-site to perform the necessary full system re-start which was not possible remotely.

    Which shows that sometimes, remote isn't a viable option. If you have very critical infrastructure, it's advisable to have people physically very close to the data center so that they can access the servers if all other options fail. That's valid for aviation as well as for health care, banks, etc. Remote staff just isn't enough in these situations.

    • jjk166 2 days ago

      Or you configure your infrastructure to be remotely operated.

  • mcfedr 3 days ago

    But no mention of this insane failure mode? If the article is to be believed

    • amiga386 3 days ago

      I'm not sure what you think is the insane failure mode?

      The UK is part of the IFPS Zone, centrally managed by EUROCONTROL using AFTM. IFPS can accept/reject IFR flight plans, but the software at NATS can't. By the time NATS gets the flight plan, it has already been accepted. All their software can do is work out which parts enter the UK's airspace. If it's a long route, the plane has already taken off.

      NATS aren't even thinking of a mixed-mode approach (for IFR flight plans) where they have both automated processing and manual processing of things the automated processing can't handle. They don't have a system or processes capable of that. And until this one flight, they'd never had a flight plan the automated system couldn't handle.

      The failures here were:

      1) a very unlikely edge case whose processing was specified, but wasn't implemented correctly, in the vendor's processing software

      2) no test case for the unlikely edge case because it was really _that_ unlikely, all experts involved in designing the spec did not imagine this could happen

      3) they had the same vendor's software on both primary and secondary systems, so failover failed too; a second implementation might have succeeded where the first failed, but no guarantees

      4) they had a series of incident management failures that meant they failed to fix the broken system within 4 hours, meaning NATS had to switch to manual processing of flight plans

      • mcfedr a day ago

        But that's the thing, part of the plan was in a bad case we can switch to manual processing, but at no point someone thought to suggest manually processing one failing plan

        I work with great QAs all day, and if one of them heard that there are duplicate area codes, there would be a bunch of test cases appearing with all the possible combinations

sam0x17 4 days ago

I've posted this here before, but they really need globally unique codes for all the airports, waypoints, etc, it's crazy there are collisions. People always balk at this for some reason but look at the edge cases that can occur, it's crazy CRAZY

  • crote 3 days ago

    Coming up with a globally unique waypoint system is trivial. Convincing the aviation industry to spend many hundreds of millions of dollars to change a core data type used in just about every single aviation-related system, in order to avoid triggering rare once-a-decade bugs? That's a lot harder.

    • lostlogin 3 days ago

      > That's a lot harder.

      I wonder what 1,500 cancelled flights and 700,000 disrupted passengers adds up to in cost? And that’s just this one incident.

      • amiga386 3 days ago

        ...an incident where they didn't parse the data as other systems already parsed the data.

        It sounds like the solution is better validation and test suites for the existing scheme, not a new less-ambiguous scheme

junon 3 days ago

For the people skimming the comments and are confused: 3600nm here is nautical miles, not nanometers.

My first thought was that this was some parasitic capacitance bug in a board design causing a failure in an aircraft.

fyt2024 3 days ago

Is nm the official abbreviation for nautical miles? I assume it is natural miles. For me it is nanometers.

  • andkenneth 3 days ago

    Contextually no one is using nanometers in aviation nav applications. Many aviation systems are case insensitive or all caps only so capitalisation is rarely an important distinction.

    • joemi 3 days ago

      Similarly, no pilot in the Devil’s Lake region is using DVL to mean Deauville, and vice versa. :)

IlliOnato 3 days ago

What brought me to read this article was a confusion: how can two locations related to air traffic be 3600 nanometers apart? Was it two points within some chip, or something?

Only way into the article it dawned to me that "nm" could stand for something else, and guess it was "nautical miles". Live and learn...

Still, it turned out to be an interesting read)

NovemberWhiskey 3 days ago

So, exactly the same airline (French Bee) and exactly the same route (LAX-ORY) and exactly the same waypoint (DVL) as last September, resulting in exactly the same failure mode:

https://chaos.social/@russss/111048524540643971

Time to tick that "repeat incident?" box in the incident management system, guys.

  • riffraff 3 days ago

    It's an article about that same accident

    • NovemberWhiskey 3 days ago

      D'oh - the flight number was different and the lack of year on the post had me thinking it was just the same again.

jll29 3 days ago

Unique IDs that are not really unique are the beginning of all evil, and there is a special place in hell for those that "recycle" GUIDs instead of generating new ones.

Having ambiguous names can likewise lead to disaster, as seen here, even if this incident had only mild consequences. (Having worked on place name ambiguity academically, I met people who flew to the wrong country due to city name ambiguity and more.)

At least artificial technical names/labels should be globally unambiguous.

cbhl 3 days ago

Hmm, is this the same incident which happened last year? Or is this a new incident?

From Sept 2023 (flightglobal.com):

- https://archive.is/uiDvy

- Comments: https://news.ycombinator.com/item?id=37430384

Also some more detailed analysis:

- https://jameshaydon.github.io/nats-fail/

- Comments: https://news.ycombinator.com/item?id=37461695

  • javawizard 3 days ago

    First sentence of the article:

    > Investigators probing the serious UK air traffic control system failure in August last year [...]

_pete_ 4 days ago

The DVL really is in the details.

  • spatley 3 days ago

    Har! should have seen that one coming :)

tempodox 4 days ago

When there's no global clearing house for those identifiers, maybe namespaces would help?

Related: The editorialized HN title uses nanometers (nm) when they possibly mean nautical miles (nmi). What would a flight control system make of that?

  • buildsjets 3 days ago

    Every aircraft I’ve ever flown as either Pilot in Command or required crewmember, and also every marine navigation system I have used in my life has displayed distance information as nm, Nm, or NM, interchangeably. I have never been confused by this, and I have never seen any other crew be confused. I have not ever seen any version of nmi used, in any variation of capitalization. This includes Boeing flight decks, Airbus flight decks, general aviation Garmin equipment, and a few MIL aircraft. And some boats.

  • bigfatkitten 3 days ago

    The reason idents for radio navaids (VOR/NDB) are only three characters is because they are broadcast via morse code. They need to be copyable by pilots who are otherwise somewhat busy and not particularly proficient in Morse. For this purpose, they only need to be unique to that frequency within plausible radio range.

    'nm' and 'NM' are the accepted abbreviations for nautical miles in the aviation industry, whether official or not.

mkj 3 days ago

Sounds like the kind of thing fuzzing would find easily, if it was applied. Getting a spare system to try it on might be hard though.

chefandy 3 days ago

As an aside, that site's cookie policy sucks. You can opt out of some, but others, like "combine and link data from other sources", "identify devices based on information transmitted automatically", "link different devices" and others can't be disabled. I feel bad for people that don't have the technical sophistication to protect themselves against that kind of prying.

mirages 3 days ago

"and it generated a critical exception error. This caused the FPRSA-R primary system to disconnect, as designed,"

as designed here sounds a big PR move to hide the fact they let an uncaught exception crash the entire software ...

How about : don't trust your inputs guys ?

mmaunder 3 days ago

There’s little to no authentication on filing flight plans which makes this a potentially bigger problem. I’m sure it’s fixed but the mechanism that caused the failure is an assertion that fails by disconnecting the critical systems entirely for “safety”. And the backup failed the same way. Bet there are similar bugs.

cryptonector 3 days ago

> Just 20s elapsed between the receipt of the flightplan and the shutdown of both FPRSA-R systems, causing all automatic processing of flightplan data to cease and forcing reversion to manual procedures.

That's quite a DoS vulnerability...

polskibus 3 days ago

I would’ve thought that in flight industry they got the „business key” uniqueness right ages ago. If a key is multi-part then each check should check all parts not just one. Alternatively, force all airport codes to be globally unique.

klysm 2 days ago

I’m curious what part of the code rejected the validity of the flight plan. Im also curious what keys are actually used for lookups when they aren’t unique??

whiteandmale 3 days ago

"What are these? Airports for ants?" I would HN dudes expect to fix the headline regarding SI / nautical units. Sloppy copy.

ggm 3 days ago

Could you front-end the software with a proxy which bounces code-collision requests and limit the damage to the specific route, and not the entire systems integrity?

This is hack-on-hack stuff, but I am wondering if there is a low cost fix for a design behaviour which can't alter without every airline, every other airline system worldwide, accommodating the changes to remove 3-letter code collision.

Gate the problem. Require routing for TLA collisions to be done by hand, or be fixed in post into two paths which avoid the collision. (intrude an intermediate waypoint)

  • kccqzy 3 days ago

    The low cost fix is to fix the FPRSA-R software. The front-end proxy cannot easily detect code collisions because such code collisions are not readily apparent. The ICAO flight plan allows omitting waypoints and it is one of the omitted waypoint that collides with a non-omitted waypoint. If you have taken the trouble of introducing such a sophisticated code collision detection in the front-end proxy, you might as well apply the same fix to the real FPRSA-R software.

    • ggm 3 days ago

      My primary fear was the compliance issues re-coding the real s/w would drive this into the ground, where the compliance for a front fix might be lower. You could hash-map the collisions seen before and drive them to a "don't do that" and reduce the risk to new ones.

      "we fall over far less often now"

      Thank you for neg votes kind strangers. Remember ATC is rife with historical kludges, including using whiteout on the giant green screens to make the phosphor get ignored by the light gun. This is an industry addicted to backwards compatibility to the point that you can buy dongle adapters for dot-matrix printers at every gate, such that they don't have to replace the printer but can back-end the faster network into it.

      C/F rebooting a 787 inside the maximum-days-without-a-reboot

aeroevan 3 days ago

What's crazy is that this hasn't happened before, waypoints that share a name isn't uncommon

dboreham 3 days ago

Headline still hasn't been fixed? (Correct abbreviation is NM).

entropyie 3 days ago

Initially read this as 3600 nanometres... :-)

Optimal_Persona 7 days ago

Well, 3600 billionths of a meter IS kinda close...just sayin'

  • bilekas 4 days ago

    I was thinking the same and thinking that’s a super weird edge case to happen. I’m obviously tired.

  • dh2022 7 days ago

    I read it the same way….

mjan22640 3 days ago

The title sounds like an AMD cpu issue.

craigds 3 days ago

oh nautical miles !

not nanometres as you might assume from being used to normal units

ipunchghosts 4 days ago

Title should be nmi

  • yongjik 3 days ago

    NGL, two locations 3600 non-maskable interrupts apart would have been a much more interesting story.

    • astrange 3 days ago

      It's like doing the Kessel run in less than twelve parsecs.

  • jordanb 4 days ago

    I do a lot of navigation and have never seen nautical miles abbreviated as "nmi."

    • lxgr 4 days ago

      I bet not everybody on here does, so picking the unambiguous unit sign would definitely avoid some double-takes.

  • buildsjets 3 days ago

    Maybe that is true in your industry. It is not true in my industry. NM is the legally accepted abbreviation for nautical miles when used in the context of aircraft operations.

    • joemi 3 days ago

      Still, either "nmi" or "NM" would be better than the current and less correct "nm", even if "nm" is what's used in the article.

  • barbazoo 4 days ago

    The unit of "nm" is common among pilots but yeah technically it should be "NM".

  • Andys 3 days ago

    Non-maskable Interrupt?

jojohohanon 3 days ago

Is it just me or was it basically impossible to decipher what those three letter codes were?

Joel_Mckay 3 days ago

In other news, goat carts are still getting 100 furlong–firkin–fortnight on dandelions.

=3

hobs 4 days ago

People posting on this forum saying "ah well software's failure case isn't as bad"

> This forced controllers to revert to manual processing, leading to more than 1,500 flight cancellations and delaying hundreds of services which did operate.

  • egypturnash 4 days ago

    Zero fatalities though. You could do a lot worse for a massive air traffic control failure.

    • lxgr 4 days ago

      Unfortunately shutting down air traffic generally does not result in zero excess deaths: https://pmc.ncbi.nlm.nih.gov/articles/PMC3233376/

      • d1sxeyes 3 days ago

        Your source says “the fatality rate did not change appreciably”.

        • lxgr 3 days ago

          Injuries did increase, though, and I can't think of a plausible mechanism that would somehow cap expected outcomes at "injury but not death".

          • d1sxeyes 3 days ago

            So we were talking about excess deaths, which means that supporting your argument with a paper that argues that a previous finding of excessive deaths was flawed is probably not the strongest argument you could make.

            Increased number of injuries but not deaths could be, for example, (purely making things up off the top of my head here) due to higher levels of distractedness among average drivers due to fear of terrorism, which results in more low-speed, surface-street collisions, while there’s no change in high speed collisions because a short spell of distractedness on the highway is less likely to result in an accident.

            • lmm 3 days ago

              > which results in more low-speed, surface-street collisions, while there’s no change in high speed collisions because a short spell of distractedness on the highway is less likely to result in an accident.

              That's not a remotely plausible model though. There are recorded cases of e.g. 1.6 seconds of distractedness at high speed causing a fatal collision. Anything that increases road injuries is almost certainly also increasing deaths in something close to proportion, but a given study size obviously has a lot more power to detect injuries than deaths.

              • d1sxeyes 3 days ago

                Yeah, I was not really trying to argue that that was actually the case, so won’t waste time trying to defend the merits of the model I pulled out of my arse.

                Alternatively then, perhaps safety developments in cars made them safer to drive around the same time? Or maybe advances in medicine made fatal crashes less likely? Or perhaps there’s some other explanation that doesn’t immediately spring to mind, it’s irrelevant.

                The only point I’m really making is that the data OP referred to does not show an increase in excess deaths, and in fact specifically fails to find this.

                • lmm 3 days ago

                  > The only point I’m really making is that the data OP referred to does not show an increase in excess deaths, and in fact specifically fails to find this.

                  That data absolutely does show an increase in deaths, it's right there in the table. It fails to find a statistically significant increase in deaths. The most plausible explanation for that is that the study is underpowered because of the sample size, not that some mystical effect increased injuries without increasing deaths.

                  • d1sxeyes 3 days ago

                    > That data absolutely does show an increase in deaths

                    Please be careful about removing the word 'excess' here. The word 'excess' is important, as it implies statistical significance (and is commonly understood to mean that - https://en.wikipedia.org/wiki/Excess_mortality).

                    I didn't argue that there was no change in the number of deaths, and I did not say that the table does not show any change in the number of deaths.

                    If your contention is that the sample size is too small, we can actually look at the full population data: https://en.wikipedia.org/wiki/Motor_vehicle_fatality_rate_in...

                    Notably, although 2002 had a higher number of fatalities, the number of miles traveled by road also increases. However, it represents a continuation of a growing trend since 1980 which continued until 2007, rather than being an exceptional increase in distance travelled.

                    Also, while 2002 was the worst year since 1990 for total fatalities, 2005 was worse.

                    Fatalities per 100 000 population in 2002 was 14.93, which was around a 1% worsening from the previous year. But 2002 does not really stand out, similar worsenings happened in 1988, 1993, 1994, 1995, 2005, 2012, 2015, and 2016 (to varying degrees).

                    One other observation is that in 2000, there was a population increase of 10 million (from 272 million to 282 million), while other years on either side are pretty consistently around 3 million. I'm not sure why this is the case, but if there's a change in the denominator in the previous year, this is also interesting if we're making a comparison (e.g. maybe the birth rate was much higher due to parents wanting 'millenial babies', none of whom are driving and so less likely to be killed in a crash, again, just a random thought, not a real argument from my side that I would try to defend).

                    The reason 'statistical significance' is important is because it allows us to look at it and say 'is there a specific reason for this, or is this within the bounds of what we would expect to see given normal variation?'. The data don't support a conclusion that there's anything special about 2001/2 that caused the variation.

                    • lmm 2 days ago

                      So what is your actual model here? Do you believe that the rate of injuries didn't actually go up (despite the statistically significant evidence that it did), that some mysterious mechanism increased the rate of injuries without increasing the rate of deaths, or that the rate of deaths really did increase, but not by enough to show in the stats (which are much less sensitive to deaths than to injuries, just because deaths are so much rarer)? I know which of those three possibilities I think is more plausible.

                      • d1sxeyes 2 days ago

                        We are talking about excess deaths, not injuries. It feels like the direction you are taking is going further and further away from the original point, and you are misrepresenting what I have said to fit your argument.

                        I did not say that injuries did not go up. I did not say that the rate of deaths did not go up. I did not say that the rate of deaths did not “show in the stats” (unless by that you mean “was not statistically significant”).

                        I don’t need a model for the injuries question because that isn’t the point we’re arguing, but I might suggest something like “after 9/11, people took out more comprehensive health insurance policies, and so made claims for smaller injuries than they would have in previous years”.

                        A suitable model for explaining excess deaths might be something like “after 9/11, people chose to drive instead of fly, and driving is more dangerous per mile than flying is, resulting in excess deaths”. I’m not sure if that is your exact model, but it’s typically what people mean, happy for you to correct.

                        The problem with that model is that there’s no statistically significant increase in miles driven either. I can’t think of a model which would explain a higher fatality/injury rate per mile driven.

                        Out of interest, what would be your model for explaining why there were more injuries per mile?

                        If you found a single story of someone deciding to drive instead of take a plane in October 2001 because of 9/11, and that person died in a car crash, would that be enough for you to be satisfied that I am wrong?

                        • lmm 2 days ago

                          > The problem with that model is that there’s no statistically significant increase in miles driven either. I can’t think of a model which would explain a higher fatality/injury rate per mile driven.

                          > Out of interest, what would be your model for explaining why there were more injuries per mile?

                          Well, is there a statistically significant difference in the injuries per mile? Or even a difference at all? That the difference in injuries was statistically different and the difference in miles driven wasn't does not imply that the former changed by a larger proportion than the latter.

                          Pretty much everyone including all of the papers we've talked about assumes there was an increase in driving. Do you actually think driving didn't increase? Or is this just another area where these concepts of "statistically significant" and "excess" are obscuring things rather than enlightening us?

                          > If you found a single story of someone deciding to drive instead of take a plane in October 2001 because of 9/11, and that person died in a car crash, would that be enough for you to be satisfied that I am wrong?

                          I'm interested in knowing whether deaths actually increased and by how much; for me statistical significance or not is a means to an end, the goal is understanding the world. If we believe that driving did increase, then I don't think it's a reasonable null hypothesis to say that deaths did not increase, given what we know about the dangers of driving. Yes, it's conceivable that somehow driving safety increased by exactly the right amount to offset the increase in driving - but if that was my theory, I would want to actually have that theory! I can't fathom being satisfied with the idea that injuries somehow increased without increasing deaths and incurious not only about the mechanism, but about whether there really was a difference or not.

                          All of these things - miles driven, injuries, deaths - should be closely correlated. If there's evidence that that correlation actually comes apart here, I'm interested. If the correlations hold up, but the vagaries of the statistics are such that the changes in one or two of them were statistically significant and the other wasn't, meh - that happens all the time and doesn't really mean anything.

                          • d1sxeyes 2 days ago

                            The study does not quote miles driven in their specific sample so I can’t talk to this or answer your question directly, and the national numbers don’t cover injuries. The numbers in the survey are also a bit vague about injuries, and include “possible injuries”, “injury with severity unknown”, and they are taken separately from incapacitating injury data. The data on injuries is objectively lower quality than the data on deaths, and anyway deaths was the focus of the original question, so I’d rather avoid getting sidetracked.

                            To respond to your question (admittedly not to answer directly), nationally there was indeed an increase in miles driven, an additional 59 billion miles in 2002 compared to 2001, and indeed there was an increase in deaths. I would also expect an increase in injuries as well.

                            Looking at this in isolation, you can say “oh so because of 9/11, people drove 59 billion more miles which resulted in more deaths and injuries”, but in my opinion real question if you want to understand the world better is “how many more miles would folks have driven in 2002 compared to 2001 in case 9/11 never happened”.

                            We obviously can’t know that, but we can look at data from other years. For example, from 1999 to 2000, the increase in miles driven was 56 billion, from 2000 to 2001, the increase was 50 billion and from 2002 to 2003 the increase was 34 billion, from 2003-2004 the increase was 75 billion.

                            Miles driven, injuries, deaths, are indeed all closely correlated. But so is population size, the price of oil, and hundreds of other factors. If your question is “did more people die on the roads in 2002 than in 2001”, the answer is yes. Again, I assume that the same is also true of injuries although I can’t support that with data.

                            That wasn’t OP’s assertion though, OP’s assertion was that closing down airspace does not lead to zero excess deaths. My argument is that the statistics do not support that conclusion, and that the additional deaths in 2002 cannot rigorously be shown even to be unusually high, let alone caused by 9/11.

                            • lmm a day ago

                              > That wasn’t OP’s assertion though, OP’s assertion was that closing down airspace does not lead to zero excess deaths. My argument is that the statistics do not support that conclusion, and that the additional deaths in 2002 cannot rigorously be shown even to be unusually high, let alone caused by 9/11.

                              What we can show rigorously and directly is a tiny subset of what we know. If your standard for saying that event x caused deaths is that we have a statistically significant direct correlation between event x and excess deaths that year, you're going to find most things "don't cause deaths". Practically every dangerous food contaminant is dangerous on the basis of it causes an increase in x condition and x condition is known to be deadly, not because we can show directly that people died from eating x. Hell, even something like mass shootings probably aren't enough deaths to show up directly in death numbers for the year.

                              I think it's reasonable to say that something we reasonably believe causes deaths that would not have occurred otherwise causes excess deaths. If you actually think the causal chain breaks down - that 9/11 didn't actually lead to fewer people flying, or that actually didn't lead to more people driving, or that extra driving didn't actually lead to more deaths - then that's worth discussing. But I don't see any value in applying an unreasonably high standard of statistical proof when our best available model of the world suggests there was an increase in deaths and it would actually be far more surprising (and warrant more study) if there wasn't such an increase.

                              • d1sxeyes 18 hours ago

                                > If your standard for saying that event x caused deaths is that we have a statistically significant direct correlation between event x and excess deaths that year

                                It isn't.

                                > Hell, even something like mass shootings probably aren't enough deaths to show up directly in death numbers for the year.

                                Yes, there's (probably) no statistically significant link between mass shootings and excess deaths (maybe with the exception of school shootings and excess deaths in the population of school children). But you can directly link 'a shooting' and 'a death', so you don't need to look at statistics to work out if it's true that shootings cause deaths. Maybe if mass shootings started to show up in excess deaths numbers, we'd see a different approach to gun ownership, but that's a separate discussion. You can't do the same with 'closing airspace' and 'deaths on the road'.

                                When you're looking at a big event like this, there's a lot of other things that can happen. People could be too scared to take trips they might otherwise have taken (meaning a reduction in mileage as folks are not driving to the airport), or the opposite, 9/11 might have inspired people to visit family that they otherwise might not have visited (meaning in increase in mileage). Between 2000 and 2003, the price of gas went down, which might have encouraged people to drive more in general, or to choose to drive rather than fly for financial reasons (although if you wanted to mark that down as 'caused by 9/11' that's probably an argument you could win). You can throw ideas out all day long. The way we validate which ideas have legs is by looking at statistical significance.

                                Here's some more numbers for you, in 2000, there were 692 billion revenue passenger miles flown in the US. In 2002, it was 642 billion. So we can roughly say that there were 50 billion fewer miles flown in 2002. But the actual number of miles driven in 2002 was 100 billion higher than in 2000 (and note, this is vehicle miles, not passenger miles, whereas the airline numbers are counting passengers). So clearly something else is at play, you can't (only) attribute the increase in driving to people driving instead of flying.

                                > If you actually think the causal chain breaks down - that 9/11 didn't actually lead to fewer people flying, or that actually didn't lead to more people driving, or that extra driving didn't actually lead to more deaths - then that's worth discussing

                                I believe that the causal chain does exist but it's weakened at every step. Yes, I think 9/11 led to fewer people flying, but I think that only a proportion of those journeys were substituted for driving, and some smallish percentage of that is further offset by fewer journeys to and from the airport. I think that the extra driving probably did lead to a small number of additional deaths, but again this question of 'is one additional death enough for you to think I'm wrong' comes back.

                                If I throw aside all scientific ways of looking at it, my belief is that in terms of direct causation, probably, in the 12 months following 9/11, somewhere between 50 and 500 people died on the roads who would not have died on the roads if 9/11 had not happened. But a lot of those were travelling when the airspace was not closed.

                                If we look at the number of people who died because they made a trip by car that they would have otherwise made by plane but couldn't because US airspace was closed (i.e. on 9/11 itself and during the ground stop on the 12th), you're looking at what I believe to be a very, very small number of people, maybe even zero.

    • hobs 4 days ago

      It's true, not saying they did a bad job here, just that even minor problems in your code can exacerbated into giant net effects without you even considering it.

  • lallysingh 3 days ago

    The software needs a way to reject bad plans without falling over.