Data modeling problem: Events in Progress in Human Resources data – Dealing with employees who change teams

I have a data modeling conundrum. It concerns records in a dimension that have start- and enddates.

I’ve heard people refer to this type of problem as the “events in progress”-problem. My specific version of this problem goes like this:

Imagine you have a sales fact table called FactSales with a date, an employee who made the sale and the quantity that was sold. It would look like this:

Date | Employee | Sales

The fact table above can be joined with a employee dimension called DimEmployee. This dimension looks like this:

Employee | Team | TeamStartDate | TeamEndDate

Using SQL, you could join FactSales with DimEmployee table using the column “Employee”.

However, sometimes employees have more than one record in DimEmployee. This would lead to unwanted doubling of sales records. You can bypass this problem in SQL by using a multi-column join like below:

SELECT * FROM FactSales

LEFT JOIN DimEmployee ON FactSales.Employee = DimEmployee.Employee

                      AND FactSales.Date BETWEEN DimEmployee.TeamStartDate AND DimEmployee.TeamEndDate

This would lead to One Big Table with the correct team in each record.

However, imagine you are not using SQL, but you are using Power BI. You want to create a report showing the amount of sales per employee, including in which team the employee was at the time of the sale. A matrix visual in Power BI could look like this:

Team | Employee | Sales

1. Using Power BI however, we run into a problem. If we were to load the tables as they are, a many-to-many relationship would be needed. I try to avoid many-to-many relationships in my semantic models because they could lead to ambiguity. Now, as I see it, there are several approaches to solve this problem:

You filter the DimEmployee table: you only take the most recent records. This solution is far from ideal because now, only the most recent team will be present for each employee; NOT the team that the employee was a part of at the time of the sale. Not a big fan of this solution, but I’ve seen it too many times in the wild to not mention it.

2. You create a composite key. First you extend DimEmployee by creating a record for each day between start- and enddate that an employee was a part of a certain team. Then you create a new column containing a concatenation of the new column and the column “Employee”. You create the same new composite key column (concatenation of columns Date and Employee) in the FactSales table. Then, you create a relationship between the two newly minted columns. This approach could work, but you table will start exploding when there are a lot dates between the start- and enddate. Especially when you use dates like 9999-12-31 as enddates for current statuses.

3. You denormalize the data. You basically do the same thing in Power Query as we did using the SQL statement above.

Now if you are using another approach than the ones listed above, I would love to hear about it.

Leave Comment

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *