BaseDataProcessor

The BaseDataProcessor class contains general operations for data processing. It forms the basis for the TwitterDataProcessor class. The BaseDataProcessor class is used within the TwitterAPI class for processing data that is not specific to any social media platform. Methods exist for calculating descriptive metrics for numeric and datetime values as well as calculating the intersection and difference of sets.

This class can also be used to process previously collected data.

Initialization

If you want to use this class for data processing or other package components, follow the steps below.

Import the BaseDataProcessor class from the process module:

from pysna.process import BaseDataProcessor
# init instance
data_processor = BaseDataProcessor()

and start invoking a function:

sets = [set(1,2,4), set(1,3,5), set(1,8,9)]
# calculate intersection of all sets
data_processor.intersection(sets)

Methods

The methods of this class use functions from Numpy and default Python operations.

calc_descriptive_metrics

Calculates descriptive metrics of a given data set. Returns the given data set with appended statistical metrics. Input data set must be a dictionary containing numeric values.

Function:

BaseDataProcessor.calc_descriptive_metrics(data: Dict[str | int, Number])

The following metrics are calculated:

Max Value
Min Value
Mean Value
Median
Standard Deviation
Sample Variance
Range (Max - Min)
Interquartiles Range
Mean Absolute Deviation

These metrics might help to interpret the data more accurately and saves time to evaluate data manually.

All metrics are calculated and appended to a new key 'metrics' in the input dictionary. This implementation enables to enrich input data with statistical metrics without the need to append the metrics to the dictionary after calculation.

Source Code

import numpy as np

def calc_descriptive_metrics(self, data: Dict[str | int, Number]) -> dict:
    """Calculates descriptive metrics of a given data set.

    Args:
        data (Dict[str  |  int, Number]): Data dictionary containing numeric values.

    Raises:
        ValueError: If non-numeric values are contained in the data dictionary.

    Returns:
        dict: Input data dictionary containing descriptive metrics.

    Metrics:
        - Max Value
        - Min Value
        - Mean Value
        - Median
        - Standard Deviation
        - Sample Variance
        - Range (Max - Min)
        - Interquartiles Range
        - Mean Absolute Deviation
    """
    if not any(isinstance(value, Number) for value in data.values()):
        raise ValueError("Only numeric values are allowed.")
    # extract numeric values by iterating over data dict with iterable items
    numerics = list(data.values())
    # init empty dict to store descriptive metrics
    metrics = dict()
    # calc max
    metrics["max"] = max(numerics)
    # calc min
    metrics["min"] = min(numerics)
    # calc mean
    metrics["mean"] = np.array(numerics).mean()
    # calc median
    metrics["median"] = np.median(numerics)
    # calc standard deviation
    metrics["std"] = np.std(numerics)
    # calc variance
    metrics["var"] = np.var(numerics)
    # calc range
    metrics["range"] = max(numerics) - min(numerics)
    # calc interquarile range
    metrics["IQR"] = np.subtract(*np.percentile(numerics, [75, 25]))
    # calc absolute mean deviation
    metrics["mad"] = np.mean(np.absolute(numerics - np.mean(numerics)))
    # add metrics
    data["metrics"] = metrics
    return data

calc_datetime_metrics

Calculates descriptive metrics on datetime objects. The function takes in a dictionary with datetime values. The function will return the input dictionary with appended metrics.

Function:

BaseDataProcessor.calc_datetime_metrics(dates: Dict[str, datetime])

The following metrics are calculated:

Mean
Median
Max
Min
Time Span (in days, seconds, and microseconds)
Deviation from mean (in days and seconds). Negative values indicate below average, positive ones above average.
Deviation from median (in days and seconds). Negative values indicate below median, positive ones above average.

The metrics will help to analyze creation dates of social media accounts or posts. The metrics are choosed based on typical behaviors of social bots as they are often created within a short period. These metrics will help to figure out if it is likely that the investigated account is a social bot.

All metrics are calculated and appended to a new key 'metrics' in the input dictionary. This implementation enables to enrich input data with statistical metrics without the need to append the metrics to the dictionary after calculation.

Source Code

def calc_datetime_metrics(self, dates: Dict[str, datetime]) -> dict():
    """Calculates descriptive metrics on datetime objects.

    Args:
        dates (Dict[str, datetime]): Dictionary containing identifiers as keys and datetime objects as values.

    Returns:
        dict: Input dates with added datetime metrics.

    Metrics:
        - Mean
        - Median
        - Max
        - Min
        - Time Span (in days, seconds, and microseconds)
        - Deviation from mean (in days and seconds). Negative values indicate below average, positive ones above average.
        - Deviation from median (in days and seconds). Negative values indicate below median, positive ones above average.
    """
    # use the datetime's timestamp to make them comparable
    timestamps = [dt.timestamp() for dt in dates.values()]
    # calc mean of creation dates
    total_time = sum(timestamps)
    mean_timestamp = total_time / len(timestamps)
    # convert mean timestamp back to datetime object with timezone information
    mean_datetime = datetime.fromtimestamp(mean_timestamp, tz=timezone.utc)

    # calculate time differences to mean datetime of every creation date
    time_diffs_mean = {key: {"days": (dt - mean_datetime).days, "seconds": (dt - mean_datetime).seconds} for key, dt in dates.items()}

    # find the median of the timestamps
    median_timestamp = np.median(timestamps)
    # Convert median timestamp back to datetime object
    median_datetime = datetime.fromtimestamp(median_timestamp, tz=timezone.utc)

    # calculate time differences to median timestamp of every creation date
    time_diffs_median = {key: {"days": (dt - median_datetime).days, "seconds": (dt - median_datetime).seconds} for key, dt in dates.items()}

    # calc range of creation dates
    max_date, min_date = max(dates.values()), min(dates.values())
    time_span = max_date - min_date

    # convert creation dates to isoformat for readability
    dates = {key: dt.isoformat() for key, dt in dates.items()}

    # add metrics to output
    dates["metrics"] = dict()
    dates["metrics"]["deviation_from_mean"] = time_diffs_mean
    dates["metrics"]["deviation_from_median"] = time_diffs_median
    dates["metrics"]["time_span"] = {"days": time_span.days, "seconds": time_span.seconds, "microseconds": time_span.microseconds}
    dates["metrics"]["mean"] = mean_datetime.isoformat()
    dates["metrics"]["median"] = median_datetime.isoformat()
    dates["metrics"]["max"] = max_date.isoformat()
    dates["metrics"]["min"] = min_date.isoformat()
    return dates

intersection

Calculates the intersection of multiple sets. This function takes in a list of sets and returns their intersection.

Function:

BaseDataProcessor.intersection(iterable: List[set])

This function is used, for example, to get the follower IDs of multiple social media accounts. The sets contain the individual follower IDs of the social media accounts.

Source Code

def intersection(self, iterable: List[set]) -> list:
    """Calculates the intersection of multiple sets.

    Args:
        iterable (List[set]): List containing sets.

    Returns:
        list: intersection set casted to list.
    """
    intersection = set.intersection(*map(set, iterable))
    return list(intersection)

difference

Calculates the difference of multiple sets. The function takes in a list of dictionaries containing identifiers (e.g., account IDs) as keys and the sets as values. This function will return for each key in the dictionary the individual difference of each set.

Function:

BaseDataProcessor.difference(sets: Dict[int | str, set])

This function is used to calculate the difference of followers of the specified social media accounts. In this context, the account IDs are stored as dictionary keys and their follower IDs as values.

Source Code

def difference(self, sets: Dict[int | str, set]) -> dict:
    """Calculates the difference of multiple sets.

    Args:
        sets (Dict[set]): Dictionary containing sets where keys are identifiers.

    Returns:
        dict: Individual difference of each set that was provided.
    """
    # init empty dict to store individual differences for each set
    differences = dict()
    for key, values in sets.items():
        differences[key] = list(set(values))
        for other_key, other_values in sets.items():
            if key != other_key:
                differences[key] = list(set(differences[key]) - set(other_values))
    return differences