Big Data | 8 December 2020

Marty Kagan

CEO, Hydrolix


We have all heard the stories about AI at some large company being biased, racist, gender-discriminatory because computers are only as smart as we make them. They learn from the data we give to them, so if the data is biased, why wouldn’t the answer also be biased? As humans, we are programmed for confirmation bias. We are looking to prove our theory or idea correct. There’s nothing wrong with this. But it is something we are guilty of based on how we are wired biologically.

Programmers program algorithms independent of data sets usually, so a machine learning algorithm pointed at a data set works the same each time, regardless of what data is fed to it. It is not biased with regard to the data. It is biased based on its algorithms, but it doesn’t have data bias.

This is discussed in statistics all the time. How to avoid sample bias. Think of a telephone survey. You have biased your data toward people that have a phone number. This isn’t that much of a problem in 2020, however, to have the survey be administered, a person must ANSWER the phone. Who answers unknown numbers anymore? I don’t. So your survey will be biased by people who answer phone numbers that they do not recognize. Some estimates say as many as 90% of people don’t answer calls from numbers they do not recognize.

When we are collecting data, we collect the data we think is important, based on our human observations. If we had a security incident that lasted for 378 days, we might decide we need to keep 378 days of data. If we have an issue with DNS exfiltration, we might start heavily monitoring that data closely. In making those choices, we miss a much bigger picture. Our data bias comes from our observable experiences.

Why do we have to make these choices? Because data is expensive to ingest, store, query, and archive. We must be certain that there is value in that data before we invest in the resources we need to use it.

