Package Concepts 🧠¶

Info

Let's continue to explore on how Travis solved his problems at work 💡

Travis Umbrella Thinks Differently¶

Check What You Have¶

Let's take a look at the data that Travis is able to gather before starting to compile everything. He currently has a copy of staff data, checklist data (contains abbreviations references used firm wide) and the raw data from the stock markets.

Note: The below list shows only a portion of the data

Staff DataChecklist 1Checklist 2

LAN ID	Name	Email	Department
R2J0PNN64H944P26B	Vanessa Sanchez	scottkyle@example.com	Google
95F33XSJ5HEK4WCMX	Jordan Harris	uvasquez@example.net	Google
4F244SMY5SPZPD94K	Jennifer Jimenez	nathan14@example.net	Google

Department	Dept Code
Google	GOOG
Apple	AAPL
Microsoft	MSFT
Amazon	AMZN

cube	cube_name
791745	The Midnight Serenade
692791	Starlight Dreams
568621	City of Whispers
580051	Lost in Time

Here is an overview of the raw data.¹ he collects from the stock markets:

NASDAQNYSESGXSSETSE

Department	Dept	User ID	Name	Cube
Google	GOOG	R2J0PNN64H944P26B	NaN	969122
Google	NaN	95F33XSJ5HEK4WCMX	Jordan Harris	676021
NaN	GOOG	4F244SMY5SPZPD94K	Jennifer Jimenez	322610

Department	Dept	User ID	Name
Microsoft	NaN	UJHCDGW5XVREEHTJ6	Danielle Knox
NaN	MSFT	J9NMAJ9NXRDB8H9VV	Todd Rosales
Microsoft	MSFT	LUSH4YFM198L0T17C	NaN

Department	Dept	User ID	Name
Netflix	NFLX	987E0H186K7MVTWMC	NaN
Netflix	NaN	P6X9P4F89XJVX49B5	Diamond James
NaN	NFLX	5KVGRVCD65TK6HUEZ	Maria Green

Department	Dept	User ID	Name	Disable Flag *
Spotify	NaN	81524ER47JWRG1LJ5	Aaron Brooks	Y
Snapchat	SNAP	ZH09GRB41XCBUY5UH	NaN	Y
NaN	SNAP	UNX8J7UK4SR394FXW	Valerie Thompson	Y

Department	Dept	User ID	Name	Cube	Status
Slack	NaN	WJYSG3S324FG716B2	Mary Barry		*ENABLED
Square	SQ	5CJLUSHR0ZB7WC1N7	NaN		*ENABLED

Identify The Pattern¶

Travis is quick to find patterns in his collected raw data, he noticed that although there are empty cells about the staff name, department and department code, the staff ID is always recorded whenever transaction is performed. Of course, with the staff data on his hand, he can then learn about the stock that the staff used.

Moreover, although staff data does not tell him about the department code, but it does tell him the staff department name where he can then make reference to his checklist data for the department code. Well, he can also reference the cube code on his checklist for the name of the code.

Solution¶

He formulates a plan for his reporting workflows and this algorithm is able to process a majority of his datasets:

---
title: Algorithm1
---
    flowchart TD
    subgraph LOOP
        subgraph for_each_file
        direction TB
        A[Match stock market data IDs with staff data IDs] --> B[Conditional Filtering]
        B[Conditional Filtering] --> C[Update the staff name and department]
        C[Update the staff name and department] --> D[Match Department with Department Code]
        D[Match Department with Department Code] --> E[Match Cube with Cube Name]
        end
    end

But there are still some datasets that still not able to process by his program, so Travis decides to devise a separate algorithm for some non-routine datasets.

Some Worse Case¶

Warning

Real world data does not appear to be that clean, let's take a look at some non-routine cases

Inconsistent Data Entry¶

Travis noticed that on one of the data he collected from HKSE the name of the staff does not match with the company staff directory, for example "Muhamad Ali bin Abu" is named as "Muhamad Ali Abu"; "Lucas Den Shi Ki" is named as "Lucas Shi Ki Den". And in the dataset, there is no IDs column and only name column:

HKSEActual Name

Name
Den Shi Ki Lucas
Muhamad Ali Abu
Jef A_DriaN

Name
Lucas Shi Ki Den
Muhamad Ali bin Abu
Jef Adrian

To resolve this, he has to rely on a metric, called the Levinshtein ratio that computes the difference between two sequences. Generally if two sequences are exactly the same, the Levinshtein ratio will be 100 and when they gradually differ more in terms of arrangement and alphabets the ratio will decrease.

What he will do then is he will match all the names in the staff data against the similar names, if the Levinshtein ratio is higher than 90, then the similar names will be replaced with the names in the staff data.

---
title: Algorithm2
---
    flowchart TD
    subgraph LOOP
        subgraph for_each_staff_name
        direction TB
        A[Compute Levinshtein ratio against all other names] --> B[Update other names if ratio > 90]
        end
    end
    B[Update other names if ratio > 90] --"when_done"--> C[Back to Algorithm1 but with name matching to update IDs]

Finalizing Report¶

Time For Export¶

With all the data being cleaned and labelled, Travis can now start to prepare for his reports. Since all the data are stored as one dataset, he would need to export it as individual Excel files for each stock category:

---
title: Algorithm3
---
    flowchart TD
    subgraph LOOP
    A[Filter by department] --> B[Filtered data with one department]
    B[Filtered data with one department] --> C[Export to Excel]
    end

Now Travis gets all the files he wants, but these files are raw, he'll use VBA to add some formatting to them so that they look nice before submitting to his boss. The good news is he don't really need to do a loop to apply the macros he wrote because the macro can run on the entire specify folder:

---
title: Algorithm4
---
    flowchart TD
    A[Get folder that stores all the Excel files] --> B[Apply macros for each file]

The Output¶

Here is the output that Travis will get for each of his stock category:

Image title — Before using macros and after using macros

Info

That's the end for the interesting story with Travis. Check out the next page and see the functions that powered Travis' workflows.

The raw data shown are generated, and it might look clean to work with, and real world data is often messier and not well organized. The only clean data that Travis have were the staff and checklist data. ↩

Last update: 2023-09-07
Created: 2023-09-03