STUMP » Articles » Dataviz Number-Crunching: Can You Racially Profile by Last Name? » 14 March 2016, 18:52

Where Stu & MP spout off about everything.

Dataviz Number-Crunching: Can You Racially Profile by Last Name?  


14 March 2016, 18:52

Couple things — I have hideous migraines, so I’m not going to do much new stuff this week.

That said, I have about 20 posts in development right now. So they’re getting popped!

Also, I recently had an article on data visualization published: The Why of Data Visualization. Because the Society of Actuaries is a bit prickly about the whole internet thing, I was not allowed to embed some internet graphics.

Here’s one that got cut from that article:

I can do what I want on LinkedIn, so you can see some of the excised graphics here.

But let’s get into a current issue, and maybe look at some visualizations.


I’ve been seeing this since last year: Bank CEO reveals how Obama administration shook him down:

The former CEO of Ally Financial Inc. says the Obama administration abused its power by holding the bank’s business hostage in order to coerce a record settlement of “trumped-up” racism charges and push profit-killing new regulations on the entire auto-lending industry.

The huge $100 million deal has spooked several other major lenders into resolving similar race-bias charges and offering below-market rates to minorities for car loans.

Michael A. Carpenter, who helmed Detroit-based Ally from 2009 to 2015, complained in an exclusive interview that Obama’s powerful consumer watchdog agency threatened to derail the bank’s efforts to obtain key regulatory approvals if it didn’t agree to settle the allegations out of court.
Since the 2013 deal, the Consumer Financial Protection Bureau has accused the industry’s biggest lenders of ripping off black and other minority customers by charging them higher interest rates than whites.

But Carpenter says there’s no merit to the accusations. He suggests they’re merely a means to a political end: forcing car dealers to abandon discretionary pricing and equalizing credit outcomes, regardless of borrower creditworthiness. He warns moving to flat-rate financing, as the administration wants, would limit the industry’s ability to make a profit and cover risk and would be like “signing our own death warrant.”

Of the dozens of cases prosecuted, none has been based on actual discrimination complaints. All the cases hinge on statistics generated by the flawed disparate-impact methodology.

Let me guess: they did not actually control for credit rating.

Carpenter says the CFPB tried to hide its metrics for the analytical screens it used to determine discrimination. After Ally obtained them and ran the numbers itself, it found that non-discriminatory factors — such as credit differences, vehicle type purchased, trade-ins and down payments — explained virtually all the disparities in the loan data. It also found that many white borrowers were charged dealer rates that were higher than the white average, further undercutting the administration’s case.

Yup. I bet if you look at white people with poor credit scores, they also get quoted bad rates.


Let me give some “equal time” to the other side of the debate:

Worried about the use of big data for corporate gain? Look not further than the credit scoring system in the US, which has profound impact on our daily lives and is a source and perpetuator of systemic racial injustice.

In response to aggressive marketing by the “big three” multinational credit bureaus – Equifax, Experian and TransUnion – employers, landlords and insurance companies now use credit reports and scores to make decisions that have major bearing on our social and economic opportunities. These days, your credit history can make or break whether you get a job or apartment, or access to decent, affordable [auto/home] insurance and loans.

Credit reports and scores are not race neutral. Rather, they embed existing racial inequities in our credit system and economy – to the point that a person’s credit information serves as a proxy for race.

Hey, maybe this author is onto something here, especially if one racial group is penalized compared to others.

Let’s check out the heart of her argument:

People and communities of color have been disproportionately targeted for high-cost, predatory loans, intrinsically risky financial products that predictably lead to higher delinquency and default rates than non-predatory loans. As a consequence, black people and Latinos are more likely than their white counterparts to have damaged credit.

This firmly-entrenched two-tiered financial system has had devastating consequences for entire neighborhoods of color. Starting in the 1990s, financial institutions began flooding historically-redlined neighborhoods with predatory mortgages that ultimately led to the meltdown of the global economy. Waves of foreclosures hammered neighborhoods of color for more than a decade before the crash and black and Latino Americans bore the brunt of the ensuing foreclosure crisis, recession and spiking unemployment. Droves of people turned to high-rate credit cards to cover even basic expenses, contributing to the consumer debt crisis and spawning a bottom-feeding debt-buying industry that purchases old debts on the cheap and then uses the courts to extract judgments disproportionately from people and communities of color. These judgments are then listed in their credit reports, which also brings down their credit scores, in turn limiting a whole range of opportunities.

Although Wall Street is no longer pumping toxic mortgages into black and Latino neighborhoods, people and neighborhoods of color continue to reel from the foreclosure crisis, which many predict is far from over. Meanwhile, racially discriminatory and subprime auto lending are on the rise, payday lenders continue to extract billions of dollars from low-wage workers, and student loan debt has surpassed the trillion dollar mark. One in five Americans has unpaid medical debt, with more than half of all African-Americans and Latinos carrying medical debt on their credit cards. By definition, people who take payday loans and have uninsured medical debt are struggling, and are likely to miss payments. Missed payments translate into decreased credit scores.

….so it’s racist if you ignore that the credit scores are reflecting actual, bad credit behavior.

Yes, people got a bunch of loans shoved at them, but they didn’t have to take the money. Yes, I know it’s tough when people are saying “Take the money! Take that trip to Florida! Worry about repaying tomorrow!”, but it’s also difficult to step away from yummy food, which is a biological imperative.

We still know nobody is making me shove oodles of cheese in my mouth (but I love it sooooo much)

Anyway, there are ways to build up credit and to borrow money as a poorer person. Many immigrants manage to do this. I understand people need education, but people also need to know that there are consequences to their poor fiscal choices.

(Chicago, I’m looking at you)

I have little sympathy for Yelp girl, which is similar to many others. I don’t agree with the piece I linked, but whatever. I still can’t scrape up much sympathy. She could work a nastier job for more money, if she’d like.

But the question becomes: if the lending outfits didn’t compile racial data on potential borrowers, how did the feds accuse them of racism?


Last November, I saw this piece in the WSJ:

Revolt Against Racial Auto Profiling
To Team Obama, if your name is ‘Johnson’ you must be black.

The progressive left hasn’t suffered many defeats in the Obama era, so the occasional setback for the cause of bigger and unlawful government is worth celebrating. Last week we told you about House Democrats preparing to rebel against an egregious abuse by Obama Administration regulators. The rebellion succeeded, and the result is that a veto-proof House majority has rebuked the Consumer Financial Protection Bureau.

By a vote of 332-96, lawmakers voted to roll back the bureau’s campaign to prevent car dealers from negotiating rates on auto loans. The feds have been justifying their power grab—and extracting settlements from the banks that provide auto financing—by claiming that dealers are discriminating against minority borrowers. But the bureau isn’t presenting actual victims who have suffered harm. The regulators are simply guessing the race of borrowers based on their last names and addresses in the loan files and then claiming racism if the people they guessed were minorities seemed to be paying higher rates.

The same was noted in the more recent article.

Worse, the CFPB could never identify the alleged 235,000 Ally minority “victims” harmed by loan mark-ups. The auto industry does not report borrower race, so the CFPB resorted to guessing race by last name and zip code, a so-called “proxy” method that’s wildly inaccurate and often misidentifies whites as black.

As a result, Carpenter says he wouldn’t be surprised if as much as 20% of the checks the government is now mailing out are actually going to Caucasians.

Ooooh, some sweet racial justice!

I do know that people in the U.S. aren’t evenly distributed by race/ethnicity. If you go to the racial dot map, you’ll see results from the 2010 Census, and you’ll see high percentages of black people in the Southeast, Asians bunched up near cities on the coasts, and whites and hispanics everywhere.

You can also figure out where the major cities are by dot density.

Play around with it, it’s fun.

Check out this buzzfeed piece that shows how segregated many major cities are.

So sure, guessing people’s race by address isn’t necessarily a bad thing, given actuality of people physically segregating by race.

But what about the last name issue?


Given naming conventions, it’s not surprising that names are pretty concentrated, though there can be a “long tail”. The U.S. Census Bureau shared some last name data, with racial breakdown, from the 2000 Census. I pulled the “full” data set, which has all the last names where at least 100 people had them.

Let me just do a table to start with. Top 10 names, and I’ll break it out by: non-hispanic white, non-hispanic black, hispanic (any race), and other. You can look at my underlying number-crunching here.


Last Name Non-hispanic white Non-hispanic black Hispanic (any race) Other
Smith 73% 22% 2% 3%
Johnson 62% 34% 1% 3%
Williams 49% 47% 2% 3%
Brown 61% 35% 2% 3%
Jones 58% 38% 1% 3%
Miller 86% 10% 1% 2%
Davis 65% 31% 1% 3%
Garcia 6% 0% 91% 3%
Rodriguez 6% 1% 93% 1%
Wilson 70% 25% 2% 3%

Most people keep forgetting that (non-hispanic) white people are still the majority in the whole of the United States. There are loads of reasons that other races/ethnicities are over-represented in media, and it’s mainly because they’re younger and more urban. But the black population is about 12-13% of the population (and has been for a long time), the Hispanic population in 2000 was about the same as the non-Hispanic black population.

If you look at last names, most people wouldn’t be surprised that Garcia and Rodriguez were mainly Hispanic people.

However, did you know that those with the last names Johnson, Williams, Brown, and Jones were disproportionately black? Heck, even those with the last name Smith are more often black than the overall population.

How common it this?


Let’s look at a visualization of the whole data set.

I took the 160K+ last names, ranked them by popularity, grouped them by 10K last names, and gave an ethnic breakout by percentage.

It’s a really boring graph. I can see that the first 10K matches the overall stats… which tells me there’s a lot of weighting on those top 10K last names. Let’s check it out:

Uh yeah.

So I started taking finer and finer slices to see if I could find interesting patterns, and I didn’t see anything interesting until I got to the top 100:

You can see there’s some “lumpiness” in there. There are reasons that Hispanics and Blacks have higher concentrations in fewer last names than “White” people, and it’s primarily due to patriarchal naming conventions as well as the history of these groups in the U.S.

That said, let’s do a scatterplot of last names. To make it simpler, I’m going to plot the number of non-hispanic white people with that last name vs. everybody else.

Each point represents a different last name. The red line is a demarcation of the overall population ratio. Points above that line are biased towards non-white groups.

You can see there’s a “line” of points well above the population average line.

What are those specific names?


We’ve already seen a couple of those names — Garcia and Rodriguez — which are heavily Hispanic.

What about those other points wayyyy off the population line?

Top Black Names, U.S., 2000:

Last Name Total Number Black Percentage Black Percentage White
Williams 716,704 47% 49%
Johnson 627,720 34% 62%
Smith 527,993 22% 73%
Jones 514,167 38% 58%
Brown 476,702 35% 61%
Jackson 353,179 53% 42%
Davis 329,957 31% 65%
Thomas 271,273 38% 56%
Harris 247,092 42% 54%
Robinson 221,835 44% 51%

Note while that all of these are disproportionately Black compared to the overall population (12 – 13%), in most cases, there are more white people with that last name.

That’s not the case with the top Hispanic last names:

Last Name Total Number Hispanic Percentage Hispanic Percentage White
Garcia 779,412 91% 6%
Rodriguez 745,530 93% 6%
Martinez 710,896 92% 6%
Hernandez 662,648 94% 5%
Lopez 568,768 92% 6%
Gonzalez 561,795 94% 5%
Perez 447,729 92% 6%
Sanchez 404,972 92% 6%
Ramirez 364,364 94% 4%
Torres 296,424 92% 6%

So, you can pretty much characterize “Hispanic last names”, but it’s a bit more iffy for trying to pin down non-white in general.

This is a fairly old data set, and the percentage of the Hispanic population has increased since 2000. The black population is at about the same level.

That said, if the government tried to identify black people specifically by last name and location… I wouldn’t be the least surprised if more white people got that “restitution” than actual black people. Mainly because there are still more white people than black people (by quite a lot) even in areas where there are relatively high numbers of black people.

You get these sorts of absurdities from statistical “profiling”.

Related Posts
Geeking Out: Census Numbers for Apportionment Released -- Let's Visualize!
Sunday Sumo: Some Winning Moves on the Middle Day
Meep's Data Visualization Evolution: Tile Grid Maps