Posts tonen met het label Data Governance. Alle posts tonen

maandag 27 november 2023

Governing the Data Ingestion Process

“Data lakehousing” is all about good housekeeping your data. There is, of course, room for ungoverned data which are in a quarantine area but if you want to make use of the structured and especially the semi structured and unstructured data you’d better govern the influx of data before your data lake becomes a swamp producing no value whatsoever.

Three data flavours need three different treatments

Structured data are relatively easy to manage: profile the data, look for referential integrity failures, outliers, free text that may need categorising etc… In short: harmonise the data with the target model which can be one or more unrelated tables or a set of data marts to produce meaningful analytical data.

Semi structured data demand a data pipeline that can combine the structured aspects of clickstream or log files analysis with the less structured parts like search terms. It also takes care of matching IP addresses with geolocation data since ISPs sometimes sell blocks of IP ranges to colleagues abroad.

Unstructured data like text files from social media, e-mails, blogposts, document and the likes need more complex treatment. It’s all about finding structure in these data. Preparing these data for text mining means a lot of disambiguation process steps to get from text input to meaning output:

Tokenization of the input is the process of splitting a text object into smaller chunks known as tokens. These tokens can be single words or word combinations, characters, numbers, symbols, or n-grams.
Normalisation of the input: separating prefixes and/or suffixes from the morpheme to become the base form, e.g. unnatural -> nature
Reduce certain word forms to their lemma, e.g. the infinitive of a conjugated verb
Tag parts of speech with their grammatical function: verb, adjective,..
Parse words as a function of their position and type
Check for modality and negations: “could”, “should”, “must”, “maybe”, etc… express modality
Disambiguate the sense of words: “very” can be both a positive and a negative term in combination with whatever follows
Semantic role labelling: determine the function of the words in a sentence: is the subject an agent or the subject of an action in “I have been treated for hepatitis B”? What is the goal or the result of the action in “I sold the house to a real estate company”?
Named entity recognition: categorising text into pre-defined categories like person names, organisation names, location names, time denominations, quantities, monetary values, titles, percentages,…
Co-reference resolution: when two or more expressions in a sentence refer to the same object: “Bert bought the book from Alice but she warned him, he would soon get bored of the author’s style as it was a tedious way of writing.” In this sentence, “him” and “he” refer to “Bert”, “she” refers to “Alice” while “it” refers to “the author’s style”.

What architectural components support these treatments?

The first two data types can be handled with the classical Extract, Transform and Load or Extract, Load and Transform pipelines, in short: ETL or ELT. We refer to ample documentation about these processes in the footnote below[1].

But for processing unstructured data, you need to develop classifiers, thesauri and ontologies to represent your “knowledge inventory” as reference model for the text analytics. This takes up a lot of resources and careful analysis to make sure you come up with a complete, yet practical set of tools to support named entity recognition.

The conclusion is straightforward: the less structure predefined in your data, the more efforts in data governance are needed.

An example of a thesaurus metamodel

[1] Three reliable sources, each with their nuances and perspectives on ETL/ELT:

https://aws.amazon.com/what-is/etl/

https://www.ibm.com/topics/etl

https://www.snowflake.com/guides/what-etl

zaterdag 11 november 2023

Start with Defining Coherent Business Concepts

Below is a diagram describing the governance process of defining and implementing business concepts in a data mesh environment. The business glossary domain is the user facing side of a data catalogue whereas the data management domain is the backend topology of the data catalogue. It describes how business concepts are implemented in databases, whether in virtual or persistent storage.

But first and foremost: it is the glue that holds any dispersed data landscape together. If you can govern the meaning of any data model, any implementation of concepts like PARTY, PARTY ROLE, PROJECT, ASSET and PRODUCT to name a few, the data can be anywhere, in any form but the usability will be guaranteed. Of course, data quality will be a local responsibility in case global concepts need specialisation to cater for local information needs.

Business perspective on defining and implementing a business concept for a data mesh

FAQ on this process model

Why does the process owner initiate the process?

The reason is simple: process owners have a transversal view on the enterprise and are aware they organisation needs shareable concepts.

Do we still need class definitions and class diagrams in data lakehouses?

Yes, since a great deal of data is still in a structured ”schema on write” form and even unstructured or “schema on read” data may benefit from a class diagram creating order in and comprehension from the underlying data. Even streaming analytics use some tabular form to make the data exploitable.

What is the role of the taxonomy editor?

He or she will make sure the published concept is in synch with the overall knowledge categorisation, providing “the right path” to the concept.

Is there always need for a physical data model?

Sure, any conceptual data model can be physically implemented via a relational model, a NoSQL model in any of the flavours or a native graph database. So yes, if you want complete governance from business concept to implementation, the physical model is also in scope.

Any questions you might have?

Drop me line or reply in the comments.

The next blog article Best Practices in Defining a Data Warehouse Architecture will focus on the place of a data warehouse in a data mesh.

dinsdag 31 oktober 2023

Defining a Data Mesh

Zhamak Dehgani cornered the concept of a data mesh in 2019. The data mesh is characterised by four important aspects:

Data is organised by business domain;
Data is packaged as a product, ready for consumption;
Governance is federated
A data mesh enables self-service data platforms.

Below is an example of a data mesh architecture. The HQ of a multinational food marketer is responsible for the global governance of customers (i.e. retailers and buying organisations), assets (but limited to the global manufacturing sites), products (i.e. the composition of global brands) and competences that are supposed to be present in all subsidiaries.

The metamodels are governed at the HQ and data for the EMEA Branch are packaged with all the necessary metadata needed for EMEA Branch consumption. These data products are imported in the EMEA Data Mesh where they will be merged with EMEA level data on products (i.e. localised and local brands), local competences, local customer knowledge and local assets like vehicles, offices…

Example of a data mesh architecture, repackaging data from the HQ Domains into an EMEA branch package

The data producer’s domain knowledge and place in the organisation enables the domain experts to set data governance policies focused on business definitions, documentation, data quality, and data access, i.e. information security and privacy. This “data packaging” enables self-service use across an organisation.

This federated approach allows for more flexibility compared to central, monolithic systems. But this does not mean traditional storage systems, like data lakes or data warehouses cannot be used in a mesh. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories, connected via a conceptual layer and preferably governed via a powerful data catalogue.

The data mesh concept is easy to compare to microservices helping business audiences understand its use. As this distributed architecture is particularly helpful in scaling data needs across complex organizations like multinationals, government agencies and conglomerates, it is by no means a useful solution for SME or even larger companies that sell a limited range of products to a limited type of customers.

In the next blog Start with defining coherent business concepts we will illustrate a data governance process, typical for a data mesh architecture.

…

dinsdag 24 oktober 2023

Why Data Governance is here to stay

More than a fairly stable Google Trend Index, proving that Data Governance issues won’t go away is the fact that “Johnny-come-lately-but-always-catches-up-in-the-end” Microsoft is seriously investing in its data governance software. After letting the playing field for innovators like Ataccama, Alation, Alex Solutions and Collibra, Microsoft is ramping the functionality of its data catalogue product, Purview.

Google Trend Index on "Data Governance"

The reason for this is twofold: the emerging multicloud architectures as well as the advent of the data mesh architecture driving new data ecosystems for complex data landscapes.

Without firm data governance processes and software supporting these processes, the return on information would produce negative figures.

In the next blog Defining a Data Mesh I will define what a data mesh is about and in the following blog articles I will suggest a few measures needed to avoid data swamps. Stay tuned!

maandag 27 juni 2016

Why Master Data Management is Not Just a Nice-to-have…

Sometimes the ideas for a blog just land on your desk without any effort. This time, all the effort was made by one of the world’s largest fast moving consumer goods companies with 355.000 employees worldwide.

But this is not a guarantee for smart process and data management as the next experience from yours truly will illustrate.

The Anamnesis

One rainy day, the tenth of May, I receive a mail piece with a nice promotional offer: buy a coffee machine for one euro while you order your exquisite cups online. On rainy days you take more time to read junk mail and sometimes you even respond to them. So I surfed to their website and filled out the order form. After introducing the invoice data (VAT number,invoice address,…) an interesting question popped up:

Is your delivery address different from your invoice address?

As a matter of fact it was, it was the holiday season and the office was closed for a week but I was at a customer’s site and thought it would be a good idea to have it delivered there.

So I ticked the box and filled in the delivery address. That’s when the horror started.

Because, when I hit the order button, there was no feedback after saving, no chance to check the order and wham, there came the order confirmation by e-mail.

Oops: the delivery address and the invoice address were switched. Was this my fault or a glitch in the web form? Who cares, best practice in e-commerce is to leave the option for changing the order on details and even cancelling the order, right? Wrong. There was no way of changing the order, all I could do was call the free customer service number to hopefully make the switch undone.

10th May, Call to Consumer Service Desk #1

IVR: “Choose 2 if this is your first order”

Me: “2”

Client service agent: “What is your member number?

Me: “I don’t have member number since this is my first order. It’s about order nr NAW19092… “

Client service agent: “hmmm we can’t use the order number to find your data. What is your postcode and house number?”

Me: “This is tricky since I want to switch delivery address with the invoice address. You know what, I’ll give you both”.

Client service agent: Can’t find your order”

So, I am completely out of the picture: not via the company, the address, the order number, let alone a unique identifier like the VAT number

Client service agent: “Please send a mail to our service e-mail address “yyy@zzz.com”.

Me: “Send e-mail” Result: no receipt confirmation, no answer from this e-mail address. Great customer experience guys!

10th May Call to Consumer Service Desk #2

Client service agent: “Oh Sir, you are calling the consumer line, you should dial YYY/YYYYYY for the business customers”

Me:" But that’s the only phone number on your website and the order confirmation???!!!"

10^th May 2 PM Call to Business customer service #3

Client service agent: “Let me check if I can find your order”… (2’ wait time) “Yes, it’s here how can I help you?”

Me: “I want to switch the invoice with the delivery address”

Client service agent“OK Sir, done”

11^th May: The delivery service provider sends a message the delivery is due on the original address from the order.

No switch had been made…

Call to DPD? Too late.. these guys were too efficient...

The Diagnosis, What Else?

Marketing didn’t have a clue about the order flow and launched a promotion without an end-to-end view on the process which resulted in a half-baked online order process: no reviewing of the order possible, no feedback and the wrong customer service number on the order confirmation.

Data elements describing CUSTOMER, ORDER and PRODUCT may or may not be conformed (from the outside hard to validate) but they are certainly locked in functional silos: consumers and companies.

Customer service has no direct connection to the delivery process

The shipping company (DPD) provided the best possible service under the circumstances.

And this is only a major global player! Can you imagine how lesser Gods screw up their online experience?

Yes, it can get worse!

One of my clients called me in on a project that was under way and was seriously going south.

What happened? The organisation had developed a back office application to support a public agenda of events. As a customer of this organisation you could contact the front desk who would then log some data in the back office application and wrap up the rest of the process via e-mail. Each co-worker would use his own “data standards” in Outlook so every event had to be handled by the initial co-worker if the organisation wanted to avoid mistakes. No wonder some event logging processes sometimes took quite a while when the initiator was on a holiday or on sick leave…

A few months later -keep that in mind- the organisation decided to push the front desk work to the web and guess what? Half the process flow and half the data couldn’t be supported by the back office application because the business logic applied by the front desk worker wasn’t analysed when developing the back office app.

Siloed application development can lead you to funny (but unworkable) products

So, please all you folks out there, invest some money in an end-to-end analysis of the process and the master data. It’s a fraction of the building cost and it will save you tons of money and bad will with customers, coworkers and suppliers.

vrijdag 20 mei 2016

Afterthoughts on Data Governance for BI

Why Business Intelligence needs a specific approach to data governance

During my talk at the Data Governance Conference, at least one of my audience was paying attention and asked me a pertinent question. “Why should you need a separate approach for data governance in Business Intelligence?”

My first reaction was “’Oops, I’ve skipped a few stadia in my introduction…” So here’s an opportunity to set things right.

Some theory, from the presentation

At the conference, I took some time to explain the matrix below.

the relevance of data for decision making

Data portfolio management as presented at the 2016 data governance conference in London

If you analyse the nature of the data present in any organisation, you can discern four major types.

Let’s take a walk through the matrix in the form of an ice cream producer.

Strategic Data: this is critical to future strategy development; both forming and executing strategy are supported by the data. By definition almost, strategic data are not in your process data or at best are integrated data objects from process data and/or external data. A simple example: (internal) ice cream consumption per vending machine matched with (external) weather data and an (external) count of competing vending machines and other competing outlets create a market penetration index which in its turn has a predictive value for future trends.

Turnaround Data: critical to future business success as today’s operations are not supported, new operations will be needed to execute. E.g.: new isolation methods and materials make ice cream fit for e-commerce. The company needs to assess the potential of this new channel as well as the potential cannibalizing effect of the substitute product. In case the company decides not to compete in this segment, what are the countermeasures to ward off the competition? Market research will produce the qualitative and quantitative data that need to be mapped on the existing customer base and the present outlets.

Factory Data: this is critical to existing business operations. Think of the classical reports, dashboards and scorecards. For example: sales per outlet type in value and volume, inventory turnover… all sorts of KPIs marketing, operations and finance want every week on their desk.

Support Data: these data are valuable but not critical to success. For instance reference data for vending locations, ice cream types and packaging types for logistics and any other attribute that may cause a nuisance if it’s not well managed.

If you look at the process data as the object of study in data governance, they fall entirely in the last two quadrants.

They contribute to decision making in operational, tactical and strategic areas but they do not deliver the complete picture as the examples clearly illustrate. There are a few other reasons why data governance in BI needs special attention, If you need to discuss this further, drop me a line via the Lingua Franca contact form.