Beavers and Bias Down at the Data Lake
My kids like to play this game called “Would you rather...” - offer up an impossible choice and then proceed to tease the other person for their answer.
Would you rather have a butt on your forehead or go to school naked?
Would you rather kiss your brother or eat a poop sandwich?
The questioner sets the stakes, and bounds the possible answers. Just by agreeing to play, you’ve already lost. And it’s hilarious for the elementary school set - to divide the world into buttheads and poop eaters. The classification lasts as long as their attention span.
But when the same kinds of rigged questions get asked in the real world - a real world driven by data collection and analysis - the results can be devastating.
Through the looking glass
In the digital world, you are your data. When we represent a “person” online or in our applications, we encode them as a series of data points. A username. A birthdate. A password. A time since last login. A purchase history. And who you digitally become is only as rich and nuanced as the data allows. Those who design the data systems frame the questions. And those who foot the bill decide what gets done with the answer.
In a world increasingly governed by software applications, the story of our digital selves is written by the companies who build these tools. They decide how many drop down choices we’ll have when choosing a gender. They decide if our purchase history is then paired with household income, or facebook likes. Maybe they collect your legal name, but not what you’d prefer to be called. Or perhaps they collect your marital status, not whether you have any children or aging parents who depend on you.
They may leave things out, or add things in, warping the final picture of you in their systems. Many render us as nothing more than a pixelated data snapshot of our hands resting on our wallets. But especially when we talk about healthcare companies, the data paints a fun-house mirror version of ourselves that we’re not always privy to.
So who cares, right? It’s not me. It’s just data - just some neutral facts about where you click and how long you spend on a page.
Except, there’s no such thing as neutral facts. And there’s definitely no such thing as neutral data.
There’s no such thing as neutral data
All data is the result of choice. What and how to measure, who enters the information and who has the power to change it. Shall we measure someone’s height? If so, will we trust them to enter the information on their own? Should that measurement be in feet? Inches? Or perhaps we only care if someone is “Tall”, “average” or “short”? And what about age? Do we care if someone is 4’11” when they are 10 years old? What about when they’re 60?
All these choices subtly change what meaning we can make from the data that are collected. And all of these choices are made by humans; sometimes very experienced, careful or thoughtful humans. Other times amateur, careless or greedy humans.
These choices then go on to shape what an application (and often an organization) knows about you. They become a part of the institutional memory. A shared understanding of who a customer or target or patient might be, replete with any biases, shortcut or omissions made by the original framers of the question. Even very well-intentioned people make leaps of logic that can have unintended consequences.
A recent example that made headlines was a major algorithm used by hospitals and insurance companies that was shown to be biased against Black people. The tool was designed to predict who might need extra care and attention in managing complex medical conditions. But instead of looking at people’s underlying conditions, the algorithm just looked at how much a person was spending each year on healthcare. The thinking was that spending was a good stand in for the amount of healthcare you actually need. Because no one ever skimps on necessary care just because it's too expensive, right? Yeah...
The tool turned out to be biased against Black patients, drastically underestimating their risk for complex conditions. Why? Probably for a couple of reasons. Race is correlated with lower income and poorer people use fewer health services, even if they have insurance. There also is a well established history of the medical system abusing, mistreating and ignoring illness in black bodies. So it kinda makes sense why Black folks might have problems trusting the healthcare system, and may be hesitant to seek out medical care.
The data collected about healthcare spend was probably accurate, the interpretation of what that meant was not.
Even data collection is not a neutral act
A physician friend of mine who works with kids in juvenile detention centers was grateful to see that these facilities are now screening for history of sexual or physical trauma, so that kids can get connected to the services they need. Apparently, it had been a fight to get even these two tiny checkboxes added to the intake forms. But often times she would see a form marked with “no past trauma”, where later the form noted “history of gunshot wound”. What?! There are only two ways a person gets shot: by accident or on purpose. Both of those are traumatic!
Why would someone have recorded “no past trauma” for a gunshot victim? Who was the grown up that didn’t bother to ask a follow up question? Were they recording the information in an application that didn’t allow them to easily see these two data points in contradiction? How many clicks and drop downs where they required to move through to complete a single intake? The tools you use to record information can have profound effects on the quality of the data you collect, and on the mental health of the collector.
Just as we know texting and driving is a dangerous practice, so is trying to have a thoughtful interview while clicking through a crappy interface with tiny fonts and a cluttered layout. People make mistakes, or use up so much extra mental energy trying to do these two tasks at once that they quickly become stressed out and exasperated. Bad interface design is one of the main drivers of an epidemic of burnout in medical professionals. An epidemic so bad, it’s now considered a public health crisis.
And there’s the mindset of the people we’re asking to share data with us. Going back to the gunshot example, there are a number of reasons that young person might have answered “no” to a question about past trauma. Perhaps that kid didn’t trust the individual filling out the paperwork. Or maybe they come from a background where getting shot just isn’t exceptional. Maybe they didn’t want to look weak as they were booked into an infamously violent prison. Or maybe they were just hungry and wanted to get the interview over with and disclosing a trauma would have just dragged out the process.
When all these complex social forces meet in a moment of data collection, all that remains is the checked box. The drop-down selected. The parts of the story allowed by those who designed the system.
Data in the wild
Data do not exist in isolation. They sit next to each other on our spreadsheets. They divide and multiply each other into new kinds of data points like credit scores and body-mass index (BMI). And increasingly, they leave the applications where they were originally collected and recombine with data from other sources.
Yes, that’s right. Your digital self is running around the internet, having unprotected data sex with tons of other versions of “you”. You, the occasional gym member. You, the Amazon prime subscriber. You, the single city dweller interested in travel. You, the outgoing, funny, recently STI-tested person on the dating app. Your digital self is swimming in literal data lakes of information, collected with or without your knowledge.
Companies then “enrich” these lakes with boatloads of data purchased from places like Experian (yeah, the credit score people) so that suddenly they can predict what Buzzfeed articles you’ll like based on your household income. And your local supermarket can push you coupons based on how often you use your fitness tracker. It doesn’t matter if you never took the active step to link these apps or websites to share information about you - it happens passively. All behind the scenes.
So even well-structured, accurately collected and safely stored data can end up blended with dirty data from other sources. Before we decide to act on our data, we have to look at the story they tell together, and be willing to question the narrator.
Every part of the data ecosystem matters
Data in and of itself is not good or bad. It’s powerful. As we continue to build the digital world around us, we have to grapple with the fact that collecting, filtering and interpreting all this data we collect is not a task we should take lightly. Like beavers building dams across streams of information, our applications change the environments around them.
Beavers, in their effort to put a bear-proof roof over their heads, end up flooding the surrounding woodlands with a “beaver pond”, fundamentally changing the ecosystem for every creature in the area. Now depending upon who you are, this pond is either a godsend or a disaster. Frogs? Love it. Deer? On board. Clean water activists? Two thumbs up. Farmers? Eh... it depends. Road and bridge maintenance crews? Not so much.
Let me be clear. I love beavers. I feel a deep kinship with these critters as engineers, unlikely woodland power brokers, and Wynona’s problematic BFF. And while beavers may get the pass for not thinking ahead to the consequences of their actions, I can’t do the same for those of us who collect, analyze, secure and make use of digital data.
We increasingly turn over decisions about who receives extra medical attention, who gets approved for a mortgage, or which neighborhoods require extra policing to complex algorithms and artificial intelligence. And as the saying goes, it’s garbage in/garbage out with these tools.
Shoddy data undermines our confidence in all data, even the good stuff. David Gershgorn points out, “If the data are flawed, missing pieces, or don’t accurately represent a population of patients, then any algorithm relying on the data is at a higher risk of making a mistake.” And guess what? Most of the large health datasets used to train AI are notoriously biased, extremely male and extremely white. As Gershgorn argues, we need better teachers for our AI creations, especially if they are going to become our doctors.
So what now?
We have to be good stewards of the knowledge we generate. And we have to build reliable portraits of ourselves in digital form. What can be dismissed as unconscious bias on an interpersonal scale can become catastrophic when amplified through the institutional memory of an organization and its software.
We need to remember:
Data is a selective story
Who makes those choices? Who frames the questions?
Data must be gathered
Who does the asking? Who provides the answers?
Data needs integrity
Who checks for accuracy? What assumptions are they making?
Data shapes analysis
Who makes the meaning? What are they incentivized to find?
If you consider yourself a maker, build with care.
If you find yourself becoming a user, share with caution.
Catch you all down at the data lake :)