TwitterDataProcessor

The TwitterDataProcessor has the purpose to process Twitter-specific data and respective data dictionaries (i.e., user or tweet data dictionaries). This class is used inside the TwitterAPI class as a component class through composition.

This class can also be used to process previously collected data. It requires no authentication for the Twitter platform and, thus, can be used in isolation.

This class has a separated concern compared to the other package's classes, namely to process Twitter-related data.

Initialization

If you want to use this class for data processing or other package components, follow the steps below.

Import the TwitterDataProcessor class from the process module.

from pysna.process import TwitterDataProcessor

data_processor = TwitterDataProcessor()

and invoke a function:

tweet = "Savage Love 🎶 #SavageLoveRemix"
data_processor.clean_tweet(tweet)

Methods

extract_followers

Extract IDs, names, and screen names from a user's followers. This function takes in a Tweepy user object from the v1 API version and returns a dictionary containing the extracted information.

Function:

TwitterDataProcessor.extract_followers(user_object: tweepy.User)

This function will return a dictionary of the form

{"followers_ids": [],
"followers_names": [],
"followers_screen_names": []}

NOTE: This function needs a recently fetched Twitter user object from the API v1. Stored user objects (e.g., using the pickle module) that are to be analyzed later will lead to an error.

Source Code
def extract_followers(self, user_object: tweepy.User) -> Dict[str, str | int]:
    """Extract IDs, names, and screen names from a user's followers.

    Args:
        user_object (tweepy.User): Tweepy User Object.

    Returns:
        Dict[str, str | int]: Dictionary containing IDs, names, and screen names.
    """
    info = {"followers_ids": list(), "followers_names": list(), "followers_screen_names": list()}
    # extract follower IDs
    info["followers_ids"] = user_object.follower_ids()
    # extract names and screen names
    for follower in user_object.followers():
        info["followers_names"].append(follower.name)
        info["followers_screen_names"].append(follower.screen_name)
    return info

extract_followees

Extract IDs, names, and screen names from a user's followees (i.e., their follows). This function takes in a Tweepy user object from the v1 API version and returns a dictionary containing the extracted information.

Function:

TwitterDataProcessor.extract_followees(user_object: tweepy.User)

This function will return a dictionary of the form

{"followees_ids": [],
"followees_names": [],
"followees_screen_names": []}

NOTE: This function needs a recently fetched Twitter user object from the API v1. Stored user objects (e.g., using the pickle module) that are to be analyzed later will lead to an error.

Source Code
def extract_followees(self, user_object: tweepy.User) -> Dict[str, str | int]:
    """Extract IDs, names, and screen names from a user's followees.

    Args:
        user_object (tweepy.User): Tweepy User Object.

    Returns:
        Dict[str, str | int]: Dictionary containing IDs, names, and screen names.
    """
    info = {"followees_ids": list(), "followees_names": list(), "followees_screen_names": list()}
    # extract IDs, names and screen names
    for followee in user_object.friends():
        info["followees_ids"].append(followee.id)
        info["followees_names"].append(followee.name)
        info["followees_screen_names"].append(followee.screen_name)
    return info

clean_tweet

Utility function to clean tweet text by removing links, special characters using simple regex statements. It takes in the raw text of a tweet.

Function:

TwitterDataProcessor.clean_tweet(tweet: str)

This function is used before the detect_tweet_sentiment function. Thus, the tweet is cleaned first and then its sentiment is determined. Both functions are used in combination within the TwitterAPI class.

Source Code
def clean_tweet(self, tweet: str) -> str:
    """Utility function to clean tweet text by removing links, special characters using simple regex statements.

    Args:
        tweet (str): Raw text of the Tweet.

    Returns:
        str: Cleaned Tweet
    """
    return " ".join(re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

detect_tweet_sentiment

Utility function to classify sentiment of passed tweet using vader sentiment analyzer. English Tweets only. The function takes in the text of a tweet (cleaned from special characters, linkes, emojis, etc.) and will return the tweet sentiment as well as the polarity scores.

Function:

TwitterDataProcessor.detect_tweet_sentiment(tweet: str)

For sentiment detection, the Vader sentiment analyzer is used as this one turned out to be more accurate for tweets compared to NLTK sentiment analyzers.

The function will return a dictionary containing the label of the sentiment (i.e., positive, neutral, or negative) and the polarity scores:

{"label": label,
"polarity_scores": polarity_score}
Source Code
def detect_tweet_sentiment(self, tweet: str) -> dict:
    """Utility function to classify sentiment of passed tweet using textblob's sentiment method. English Tweets only.

    Args:
        tweet (str): The raw text of the Tweet.

    Returns:
        str: the sentiment of the Tweet (either positive, neutral, or negative) and the polarity scores.
    """
    # create VADER instance
    analyser = SentimentIntensityAnalyzer()
    # get polarity scores from cleaned tweet
    polarity_scores = analyser.polarity_scores(self.clean_tweet(tweet))
    # define label
    if polarity_scores["compound"] >= 0.05:
        label = "positive"
    elif polarity_scores["compound"] <= -0.05:
        label = "negative"
    else:
        label = "neutral"
    # return label and polarity scores
    return {"label": label, "polarity_scores": polarity_scores}

calc_similarity

This function is used to calculate the similarity between multiple user or tweet objects. The function takes in either a list of user objects or a list of public tweet metrics as well as a features list. Either user objects or tweet metrics need to be provided, not both.

The user objects must be recently fetched from the Twitter API v1. A stored object (e.g., by using the pickle Python module) will not have the necessary properties to be resolved by this function. Otherwise, an error will be returned.

The similarity is calculated based on a feature vector containing numeric values. Thus, for a given set of user or tweet attributes, the features must be provided on which the similarity will be computed.

As a distance measure and, thus, the similarity of feature vectors, the vector norm of second order will be calculated which is equivalent to the euclidean distance. Therefore, the numpy.linalg.norm function is used. The smaller the distance, the more similar the two vectors are.

The function will determine the distance between a distinct pair of user or tweet objects. For instance, when three user objects for the Twitter accounts 12355, 734231, 9083468 are provided, the following output will be generated:

{(12355, 734231): 4567.098,
(12355, 9083468): 5980.076,
(734231, 9083468): 8763.32}

The output dictionary contains the distinct pairs of objects as a tuple as dictionary keys. The distances for each distinct pair is given as dictionary value. The output is sorted in ascending order. Hence, the minimal distance and, thus, the most similar pair is provided as first dictionary entry.

Function:

TwitterDataProcessor.calc_similarity(user_objs: List[dict] | None = None, tweet_metrics: List[Dict[int, dict]] | None = None, *, features: List[str])

Args:

  • user_objs (List[dict] | None, optional): List of serialized Twitter user objects from Twitter Search API v1. Defaults to None.
  • tweet_metrics (List[Dict[int | dict]] | None, optional): List of public Tweet metrics as dictionaries with Tweet IDs as keys. Defaults to None.
  • features (List[str]): Features that should be contained in the feature vector. Features have to be numeric and must belong to the respective object (i.e., user or tweet.)

The features that can be provided for the features list can be found in the detailed description of the attributes for the compare_tweets function and the detailed description of the attributes for the compare_users function.

The implementation design of this function allows a comparison of Twitter users or tweets based on the available metrics. The implementation was inspired by the characterics of social bots on Twitter as they often have a similar number of followers or followees and their posted tweets often have a similar number of likes. Thus, the calculated similarities might help to identify bot-like behavior of Twitter accounts as well as identify deviations from normal Twitter accounts. If their similarities are small, they are likely to have a similar behavior on Twitter (i.e., a bot could be analyzed).

Source Code
def calc_similarity(self, user_objs: List[dict] | None = None, tweet_metrics: List[Dict[int, dict]] | None = None, *, features: List[str]) -> dict:
    """Calculates the euclidean distance of users/tweets based on a feature vector. Either user objects or Tweet objects must be specified, not both.

    Args:
        user_objs (List[dict] | None, optional): List of serialized Twitter user objects from Twitter Search API v1. Defaults to None.
        tweet_metrics (List[Dict[int | dict]] | None, optional): List of public Tweet metrics as dictionaries with Tweet IDs as keys. Defaults to None.
        features (List[str]): Features that should be contained in the feature vector. Features have to be numeric and must belong to the respective object (i.e., user or tweet.)

    Raises:
        ValueError: If either 'user_objs' and 'tweet_objs' or none of them were provided.
        AssertionError: If non-numeric feature was provided in the 'features' list.

    Returns:
        dict: Unique pair of users/tweets containing the respective euclidean distance. Sorted in ascending order.
    """
    # init empty dict to store distances
    distances = dict()
    # if users and tweets were provided
    if user_objs and tweet_metrics:
        raise ValueError("Either 'user_objs' or 'tweet_metrics' must be specified, not both.")
    # if only user_objs were provided
    elif user_objs:
        # iterate over every uniqe pair
        for i in range(len(user_objs)):
            for j in range(i + 1, len(user_objs)):
                # get user objects for each pair
                user_i = user_objs[i]
                user_j = user_objs[j]
                # build feature vector
                vec_i = np.array([user_i[feature] for feature in features])
                vec_j = np.array([user_j[feature] for feature in features])
                # feature vectors have to contain numeric values
                assert all(isinstance(feat, Number) for feat in vec_i), "only numeric features are allowed"
                assert all(isinstance(feat, Number) for feat in vec_j), "only numeric features are allowed"
                # calc euclidean distance
                distances[(user_i["id"], user_j["id"])] = np.linalg.norm(vec_i - vec_j, ord=2)
    elif tweet_metrics:
        # iterate over every uniqe pair
        for i in range(len(tweet_metrics)):
            for j in range(i + 1, len(tweet_metrics)):
                # get Tweet objects for each pair
                tweet_i = list(tweet_metrics.values())[i]
                tweet_j = list(tweet_metrics.values())[j]
                # build feature vector
                vec_i = np.array([tweet_i[feature] for feature in features])
                vec_j = np.array([tweet_j[feature] for feature in features])
                # feature vectors have to contain numeric values
                assert all(isinstance(feat, Number) for feat in vec_i), "only numeric features are allowed"
                assert all(isinstance(feat, Number) for feat in vec_j), "only numeric features are allowed"
                # calc euclidean distance
                distances[(list(tweet_metrics.keys())[i], list(tweet_metrics.keys())[j])] = np.linalg.norm(vec_i - vec_j, ord=2)
    # if none was provided
    else:
        raise ValueError("Either 'user_objs' or 'tweet_metrics' must be provided.")
    # sort dict in ascendin order
    sorted_values = dict(sorted(distances.items(), key=operator.itemgetter(1)))
    return sorted_values