Data Cleaning Utilities

utils.data_cleaning.clean_dataframe_columns(df: DataFrame) → DataFrame[source]

Clean and sanitize the column names of a pandas DataFrame.

This function performs the following operations: 1. Strips leading/trailing whitespace from each column name. 2. Removes any non-ASCII characters from column names.

Parameters:: df (pandas.DataFrame) – The input DataFrame whose columns need to be cleaned.
Returns:: The same DataFrame with cleaned column names.
Return type:: pandas.DataFrame

utils.data_cleaning.remove_initial_stage_candidates(df: DataFrame) → DataFrame[source]

Removes candidates who are only present in the earliest stages of the selection process and have no sector information. These are typically low-signal entries that did not progress in the recruitment pipeline.

A candidate row will be removed if: - It is the only row for that candidate ID. - The ‘Candidate State’ is one of [‘imported’, ‘first contact’, ‘in selection’]. - The ‘Sector’ is missing (NaN).

Parameters:: df (pandas.DataFrame) – The input DataFrame containing candidate records.
Returns:: Cleaned DataFrame with early-stage, low-information candidates removed.
Return type:: pandas.DataFrame

utils.data_cleaning.remove_not_hired_valid_candidates(df: DataFrame, state_order: List[str], event_order: List[str], feedbacks_to_remove: List[str]) → DataFrame[source]

Removes candidates from the DataFrame who have invalid or irrelevant status based on their most recent event feedback and event type. Specifically, it removes candidates who are not hired and have certain feedback or event types indicating they are not progressing in the recruitment process.

The function performs the following steps: 1. Strips whitespace and converts the relevant columns to lowercase. 2. Applies sorting logic to ensure candidate events are processed in the correct order. 3. Identifies candidates who have invalid feedback or event types in their last event

and are not hired.

Removes these candidates from the DataFrame.

Parameters:

df (pandas.DataFrame) – The input DataFrame containing candidate records with columns such as ‘Candidate State’, ‘event_type__val’, and ‘event_feedback’.
state_order (list of str) – The order in which the ‘Candidate State’ values should be sorted.
event_order (list of str) – The order in which the ‘event_type__val’ values should be sorted.
feedbacks_to_remove (list of str) – The feedback values that indicate candidates should be removed if they are not hired.

Returns:

The cleaned DataFrame with invalid candidates removed.

Return type:

pandas.DataFrame

utils.data_cleaning.sort_group(group: DataFrame, state_order: List[str], event_order: List[str]) → DataFrame[source]

Sorts a group of candidate records by the ‘Candidate State’ and ‘event_type__val’ columns, based on provided orderings for both states and event types.

The sorting is done first by ‘Candidate State’ according to the state_order, and then by ‘event_type__val’ according to the event_order.

Parameters:

grouppandas.DataFrame: A group of rows with the same ‘ID’, representing the different stages in the recruitment process for a single candidate.
state_orderlist of str: The predefined order for sorting the ‘Candidate State’ column.
event_orderlist of str: The predefined order for sorting the ‘event_type__val’ column.

Returns:

pandas.DataFrame: The same group sorted by ‘Candidate State’ and ‘event_type__val’.

utils.data_cleaning.split_duplicate_ids_by_invariant_columns(df: DataFrame, invariant_columns: List[str]) → DataFrame[source]

Split duplicate IDs in a DataFrame when invariant columns vary within the same ID group.

This function ensures that each group of rows sharing an ID also shares identical values in the specified invariant columns. If not, the ID is suffixed to distinguish subgroups.

Parameters:

df (pandas.DataFrame) – The input DataFrame containing duplicate IDs.
invariant_columns (list of str) – Columns that should have the same values across all rows for a given ID. If differences are found, new IDs will be generated to separate subgroups.

Returns:

A DataFrame with updated IDs and no unintended merges across differing invariant data.

Return type:

pandas.DataFrame