mas.to is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hello! mas.to is a fast, up-to-date and fun Mastodon server.

Administered by:

Server stats:

13K
active users

Roberto Rocha

I know we should practice Hanlon's Razor whenever possible, but a sometimes I wonder if government data is made messy on purpose to make analysis harder.

These are just SOME of the ways Hewlett-Packard appears in federal contracts data.

And it doesn't include all variations of HP.

@robroc I work in IT in large enterprises and I can assure you: consistency is HARD. I am not even surprised at what you have to deal with. That is also why I have lots of respect for what you are doing 😉👍🏻

@marcoponton

From your experience, how hard is it to create a data entry validation system so there are no duplicates added?

Every time I ask a question on Stack Overflow, it suggests a ton of similar questions that have already been asked in different words. I know it's possible, but is it complicated to apply?

@robroc I read “data entry” so I assume you mean “how hard would it be to prevent this situation at the source”. Unfortunately since the issue is the human factor, harder than it should. It is super easy to prevent duplicates with consistent data, but if someone enters “IBM” instead of the original vendor name of “International Business Machines” that was created 15 years ago in the original contract record, it’s hard for software to prevent this. (1/7)

@robroc Event with good UI (e.g., an as-you-type combo box (which usually won’t trigger after a single “I” is typed and typing “IB” is not going to show “International …”)). Even “did you mean XYZ” logic (string similarity) won’t work here unless the original record is somehow “aliased” to IBM. All of these consistency guardrails are also often missing in Enterprise Software 🤢. (2/7)

@robroc Finally, the source of the information is also another human. Often the persons doing the data entry receive inconsistant data themselves (e.g. on day, the statement of work says “IBM”, the other “IBM Inc.”, the next says IBM still but the address is different so they create a new record and, that day, they typo it as IMB or use different case like Ibm… see what I mean?). (3/7)

@robroc I worked somewhere I won’t name, where selecting a vendor for a vendor-supported configuration item (CI) in Service Now produced 6-7 variations of a name for a list of say 20 entries for the SAME vendor with only THAT information for me to select it. Not having more info, I simply selected the first one with the spelling I preferred. I was the one doing the data entry from a list generated by others doing data entry that got their info from other persons and documents. (4/7)

@robroc It’s ugly AF but it’s the reality. I don’t see this issue getting resolved in large enterprises and governments any time soon. So, in your profession when consistency matters to ensure data is correct and meaningful for reporting, you are pretty much screwed. (5/7)

@robroc Now on your end of the problem, I am not aware of any generic tool that can deduplicate entries based on say different name spellings but I did not search for one either. Given you probably spend lots of time on these issues already, you could build a “simple” tool based on string similarity using your existing data sets and keep building on it so it keeps getting better. (6/7)

@robroc Alternatively with all the new AI tools coming out, that might be the way to go instead but I am not much involved in that field. Have you looked at ChatGPT? (Sorry, I know, I am tired of hearing about it too). I have seen people get amazing results with thing I thought would not be possible. You might be able to leverage it somehow. (7/7)

@robroc I sure hope I did not completely go off track with these replies 😆

@marcoponton
No it's super interesting to get that perspective, thank you.

The best tool to harmonize these kinds of spelling variations is OpenRefine. It even lets you compare strings to a Canonical list on Wikidata and OpenCorporates.

@robroc Nice! Will add to my arsenal. Thanks for the tip.

@robroc wow! But unfortunately not surprising. But that’s what makes data analytics hard.