Sometimes, interviewing a data scientist is a lot like being a data scientist. Whichever side of the conversation you’re on, you’re trying to detect and isolate the signal that can lead you to the insight.
Chief of Staff
Learning that Zohar Vittenberg, Explorium’s Chief of Staff, studied theater in high school is the sort of enticing outlier that can send a conversation in unexpected directions. If we followed that datum down a few rabbit holes, we might derive a theory that his youthful interest in drama and stagecraft are at the root of his current professional interest in natural language processing. You might start to wonder if he’s ever tempted to ask Chat GPT to write a few scenes that he could rehearse with other theater kids who grew up to be data scientists.
Entertaining. Interesting. But, in data terms, noisy, and therefore not particularly insightful about his career in data science and external data.
Fortunately, a conversation with Zohar is high signal as well as highly engaging.
Math was always his favorite subject, from high school through his Master’s degree in computer science. His university studies were part of an elite program for recruits who have demonstrated outstanding academic ability in the sciences and leadership potential. After graduating, he was assigned to be a data scientist in the military intelligence unit.
“Intelligence is basically data, right? Then it’s about how you manage to make smart decisions and predictions from the data. Data science has a lot of math at its basic level, because we use math to optimize how the computer learns to solve a problem. We’re trying to solve a mathematical equation to get the best outcome, and it has a lot of applications in the real world that are really easy to see. It’s the best combination of math and real world application.”
Computer science is math. Intelligence is data. Data science leverages computers to solve math problems derived from intelligence to solve real world problems.
Every time Zohar makes another connection, we get more signal leading us to his work with external data and machine learning models.
Which comes first: Entity resolution or the external data cloud?
Tracing concepts back to first principles and atomic components is necessary for understanding how Explorium’s external data cloud works.
As any data scientist knows, the early steps in merging and joining data from two or more sources entail creating agreement between data types, formats and names. The big question is how can you confirm that what appear to be two entities (e.g., two companies) are really the same entity, and therefore the objects representing them should be merged; or, if they truly are different, then there should be distinct objects. This distinction is critical in the process of connecting internal and external datasets to develop a single comprehensive view of an entity, and it enables companies to deduplicate data within and across their systems of record.
“This process is called entity resolution (ER). It is essential in ensuring that when someone queries our external data cloud with the name of a company, that we have the right identifiers to retrieve and deliver the right information about that company. “
“To do this via the platform, we first have to identify that this is the company that they want. Only then do we provide the signals associated with that company in the data cloud. The first part is where we invested most in terms of our technology to make sure that we do this matching process accurately.”
Constructing an external data cloud is a “chicken and egg” problem. A functional external data cloud requires precise entity resolution. But attaining that level of entity resolution requires the scale and scope of data that only an external data source can provide.
“Every company that wants to do this will either use a solution like ours or try to develop it on their own. And it takes years, and a lot of money goes into it. So, how?”
Practical abstraction: Separate the data and platform R&D groups
“Usually, companies have just one R&D group and you do everything there. But with an external data cloud the data needs to get its unique attention. That’s why we created the data R&D team, in parallel to the platform R&D team.
“The data R&D team is in charge of everything surrounding the data. It starts from onboarding new types of datasets: researching relevant providers, vetting some of them and choosing the best one, adding them to our catalog and maintaining them on an on-going basis. Our customers know they can trust the data they get from us. The data team also determines the data types, the standards, entity resolution, security and privacy protocols, data updates and much more.
Is Chat GPT a dark cloud over the external data cloud?
Over the last month, whenever anyone asks “How?” one of the first responses is “Chat GPT!” Which parts of Zohar’s job, or Explorium’s business, could Chat GPT replace?
“I think it could make our lives easier or make our jobs easier. . But it’s not going to replace us completely because you still need to do things to make sure the data is accurate. If everything on the internet could now be generated, then that includes the mistakes, which would be optimized to look like something that is real. I don’t see the machines having the required skill to resolve that because the machines are the ones generating the data they would be checking.
“Explorium will still need both the platform R&D and the data R&D teams behind everything that we continue to do. A company out there could write a nice query and get information back from Chat GPT, which is cool. But if you want to query information, put it automatically inside your CRM, and automatically make decisions that lead to, say, making a loan offer, then you need to make sure that the data is reliable, in the right format, that it’s validated, tested and responsive to the user’s demands.
“That’s what we build, and there’s a long way until we can trust AI enough to replace us.”
Perhaps someday Zohar and other chiefs of staff will manage AI and robots along with humans. For the foreseeable future, though, intelligence remains a human phenomenon. And as Zohar said, “intelligence is data.” That’s a pretty strong signal for the future of humans building and interacting with the external data cloud.