A Waymo and a Tesla (driving for Lyft) pass on the street
Brad Templeton
A hot topic in self driving is the comparison between Waymo, the robotaxi leader, and Tesla, the innovative EV company with the system it calls Supervised Full Self Driving–but which it promises will soon become an unsupervised self-driving and robotaxi system. To most in the industry, Waymo is the clear leader and Tesla is barely in the game, but many Tesla fans believe Tesla will leap ahead into victory in this race.
Here’s a video of this story with graphics:
In order to compare the two, it’s important to figure out just how to judge the quality of a self-driving system. Has it reached the “bet your life” reliability required to operate vacant and put people in the back seat? This is not a new question. Since 2016, before FSD was even spoken of by Tesla, I’ve been writing articles about this and stating that figuring out how to prove you’ve made a working self-driving car is the single biggest blocker to success. Building one is, of course, hard, but proving you’ve done it–to yourself, to your lawyers, to your board, to your customers and to the public and the government–that’s the essential next step. Many others have also written in detail about this tasks.
The big problem is this. You can’t tell a car is good at all from taking a ride in a prototype car. You can’t tell much even from taking 100 rides or 1,000 rides. You definitely can’t tell anything by watching a video of a ride. For most people though, that’s the only thing they can do, so they naturally want to conclude something after a ride.
You can tell a car is bad from a ride. If it makes any safety mistake at all–something that might cause a contact with another vehicle or object–you’ve learned quickly that it’s almost surely an immature prototype. You can tell a car is bad but annoyingly, if the ride is perfect, you learn nothing about how good it is, just that it now has a slim chance of not being dreadful.
A good self-driving car, one that’s ready to be on the roads unsupervised, needs to do even better than the average human. But the average human only has one significant crash in their lifetime. They may have 3-5 minor dings, the kind that don’t get reported to police or insurance, but a large fraction of human drivers never have a police reported crash in their 60-so years of driving.
What that means is in order to declare that a car is at human level, and sort of ready, and not that impressive, you would have to ride in it for your whole life of road travel. Nobody can do that. Indeed, no car design or software would ever last that long. Even after that whole lifetime you would only suspect it’s as good as a human driver, you wouldn’t be sure. To declare it was impressive, you would need to drive for over 4 lifetimes.
Our instinct, of course, is we can judge from just a single drive, or a year of driving. That’s because we can and do judge humans that way. A driving examiner will go out with a teenager for 30 minutes and hand them a license. We understand how humans drive, what sort of mistakes they make, and of course we have no other choice. Indeed, almost every driver who ever crashed, passed that test. We don’t expect a computer to be able to drive at all, so if it does it at all we’re immediately impressed. We’re wrong.
Lots of Data
The only way to actually judge a self-driving system is with lots and lots of aggregate data from thousands of cars driving millions of miles–many human lifetimes. There we can track injuries, property crashes, infractions of the law and bad road citizenship and get enough data to come to mostly confident conclusions. We can also test special cases on test tracks. We can, and do, also test the cars in simulators, where we can test billions of virtual miles, and millions of strange and dangerous situations with 1,000 variations of each. This is what teams do, though sadly, only real world testing provides the true answer, and even then it is slow to do so. They don’t just do a simple calculation either; safety research and the calculation of risk is a field of its own, with journals and conferences and lots of debate on the right way to do it.
The problem is, only the companies have this aggregate data. Most of them keep it close to the vest, unfortunately. We can only get clues from how they act.
Imagine you had to compare two cars, one causes a crash every 3 months, the other every 200 years. You could take them out for dozens of drives, and they should all be flawless. You might be highly impressed, but you would be unable to tell the difference between them from your personal experience. Yet one is ready to be a robotaxi, the other will kill you soon if you try.
For several years, Waymo and Cruise and other companies tested their cars with a professional human “safety driver” supervising the car. The safety driver took over if the car did something scary. This worked very well, but once they got confident they could take the safety drivers out, a whole bunch of new problems started showing up, particularly in areas like interactions with emergency crews and difficult traffic situations. The human supervising drivers did their job, and stopped the cars from getting too deep into situations. It took Waymo over five years from when they first tested with no human on board until they were ready to scale. There was so much left to learn, and prove. Cruise never made it. (GM just absorbed the remains of Cruise, laying off 50% of the staff on Feb 4.) While a lot of their downfall came from a culture of secrecy that led to a brief cover-up of the details, it all centered on an incident which would not have happened with a safety driver, itself triggered by an automatic behaviour (“pull over immediately after a crash”) probably put in to handle other mistakes being discovered only with the absence of safety drivers.
Speed of improvement
At first, your vehicle improves rapidly. Once it gets more mature, improvements are hard to notice
Brad Templeton
A factor people can notice from individual riding experiences is a level of improvement in a system. Here, again, our intuitions mislead us. If a system appears to be improving, particularly by obvious jumps, that’s actually a sign that the system is still immature and has a long way to go. A mature system will only be experiencing small, incremental improvements that aren’t generally visible to an individual observer, only in the statistics. If you come away from a drive especially impressed by how much better the vehicle is, that both means it was pretty poor before, and almost certainly didn’t just leap to a state of near-perfection. You’ve got years to go.
The typical safety and quality progress of a self-driving car looks like this graph. Early on, progress seems faster and improvements come quickly. Over time, they come much more slowly, with much more work. Your rate of improvement will show you where you are on the path.
One thing we still don’t quite know how to measure is the risk of fatalities. So far, there hasn’t been one. An early Uber test vehicle did tragically hit and kill a pedestrian, but only because a poorly managed and negligent safety driver was watching TV rather than doing her job. The prototype was performing as prototypes are expected to do. We can’t measure this because humans only cause a fatality about every 80 million miles, or 2 million hours of driving. That’s over 150 lifetimes of driving–as we should hope, but it’s very challenging to test. A software version is unlikely to last that long. If it did, Tesla might have a shot of driving that many miles, but it would be supervised, without a good ability to know what would have happened without interventions.
A controversial study from Rand argued that testing was impossible because you would need over a billion miles to get decent statistical certainty. If you’ve driven 100 million miles and had one fatality, as humans would, you don’t really know if you just had some luck and actually will have two, or even three, on average. Fortunately you don’t need such certainty because the odds of that bad luck are constrained, and the alternative of waiting assures humans will be out their driving, and we know how many crashes and fatalities they will have. Better to “take the risk” and deploy with less certainty because the alternative is millions of deaths at the hands of human drivers if it turns out you weren’t just unusually lucky.
Safety experts keep researching just how to measure the quality and risk levels of self-driving cars. They hope to be able to map from more minor mistakes which are more frequent for humans and robots into the more serious mistakes we actually care about. This is difficult, because robots don’t work the way humans do. One approach Waymo has tried is to ask automobile insurance experts to calculate the number of liability insurance claims that would come from their driving, and compare it to the numbers humans get. Insurance actuaries understand the claims that come from human driving and their cost very well. Waymo likes the results, which showed them having 1/6th as many claims, and yet they still haven’t felt the confidence to take the general public on highways in their service areas.
Once again, proving that a self-driving car is ready is hard. You also need to prove it everywhere you want it to drive. You don’t want to bet your company, and customers don’t want to bet their lives, on a car driving unsupervised on a road it’s never been tested on before. That means the road is literally very long. The first people any self-driving team has to prove success to is themselves, and eventually their bosses and lawyers and board. That’s because they’ll be betting the project, maybe the company on getting this right. If people didn’t know this before Cruise got pulled from the roads and shut down, they know it now.
There is no Claim, only Do
Most companies are still very closed-vest about their internal numbers. They would prefer not to air all their dirty laundry for a while, and to disclose only the minimum required by law. Some will publish the details later, when they know they are good. We’ve never seen anybody willingly publish bad news. Instead, we can only look at what they do rather than what they say.
For example, being willing to remove that safety driver is a huge deal. It shows that their own internal numbers–and they have all the numbers–say it’s an acceptable risk to do. Likewise letting the public into their vehicles, particularly without any advance agreement to not talk about the experience. Videos from companies and carefully planned demo rides on fixed routes don’t show much. Rides where the customer picked the route say a lot more. Letting riders put videos up in public also says the company is confident the result won’t be too embarrassing.
Cruise also published some data including some 3rd party analysis, but not nearly as much as Waymo. This is the bar to which companies should be held because only analyzing lots of data lets us really figure out what’s going on. There have been attempts in laws to get companies legally required to report “disengagements” or crash events, but they have not helped that much, because companies still find ways to keep things they don’t like out of the data, or hide them among it. We still have some way to go before we can make judgments. In China, we also have to deal with the different style of reporting from the press and even on social media which can make information very filtered. Companies must understand that it’s important for them that the public feels the vehicles are safe, and has confidence in that belief. The companies are, not incorrectly, scared of the public’s irrational fears, but keeping the truth under wraps isn’t going to be the answer either. Tesla has notoriously only released highly misleading data, possibly deliberately misleading, leaving people only with personal driving experience as their guide, even though that’s almost useless. Much more needs to be done.