Utility Functions
Some utility functions were implemented to support export and import of collected data to different formats. With the help of these functions, a comparison over time is made possible as they can be used in combination with specific arguments from the main functions of the TwitterAPI
class.
Besides the class-linked methods from the TwitterAPI
, TwitterDataFetcher
, and TwitterDataProcessor
classes, utility functions were developed which are part of the utils
module of the package. These are, by not being part of a class, not only usable for Twitter data exclusively, but also for data from other social media platforms and are not linked to classes or the corresponding platforms, therefore. Methods for export to CSV and JSON files were designed as well as appending new observations to existing files.
All functions can be imported by running:
from pysna.utils import *
or name the desired functions in the import statement.
In the following, functions for internal usage as well as user functions are presented.
Internal Utility Functions
The following functions are used internally at different places in the code. They are not intended to be used directly by users. Often, they are designed to be helper functions for contributing developers.
strf_datetime
Converts datetime objects to string representation. Default format is %Y-%m-%d %H:%M:%S
and will return a datetime string like 2023-03-10 09:16:12.662320
.
Function:
strf_datetime(date: datetime, format: str = "%Y-%m-%d %H:%M:%S")
This function takes in a date format string. Any other format different to the default one can be passed in using the format
argument.
This function is used internally to convert a Unix timestamp to a readable format. This is the case, for the return_timestamp
argument of the four main functions from the TwitterAPI
class`.
Source Code
def strf_datetime(date: datetime, format: str = "%Y-%m-%d %H:%M:%S") -> str:
"""Convert datetime object to string representation.
Args:
date (datetime): Input datetime object
format (str, optional): Datetime string format. Defaults to "%Y-%m-%d %H:%M:%S".
Returns:
str: string representation of input datetime in given format.
"""
return date.strftime(format)
tuple_to_string (private)
This function serializes tuples-keys from dictionaries to string representation. A tuple-key wil obtain a leading __tuple__
string and decomposed in list representation. This function is private as it is not intended for external usage by the package user.
The reason why this function was implemented is that an export to JSON format is not possible with tuples as keys. Some of the four main functions from the TwitterAPI
class will generate tuples as dictionary keys (e.g., when a relationship of two Twitter users or twees is investigated). The JSON format does not support tuples for serialization and, thus, needs a list representation of Python tuples. This function will iterate recursively through the entire dictionary that is provided to the function and will exchange the dictionary tuple-keys to a string representation. Thus, even for nested dictionaries, this function will find and convert all tuple-keys inside the dictionary.
Function:
_tuple_to_string(obj: Any)
For instance, a tuple-key like ("WWU_Muenster", "goetheuni")
will be encoded to __tuple__["WWU_Muenster", "goetheuni"]
. Then, the JSONEncoder
class from the json
Python module can convert this key as string.
This function is used within the export_to_json
function to serialize tuples inside the data dictionary.
In order to avoid a manipulation of the object passed in, a deep copy of the object is performed at the beginning before conversion.
Source Code
def _tuple_to_string(obj: Any) -> Any:
"""Serialize tuple-keys to string representation. A tuple wil obtain a leading '__tuple__' string and decomposed in list representation.
Args:
obj (Any): Typically a dict, tuple, list, int, or string.
Returns:
Any: Input object with serialized tuples.
Example:
A tuple ("WWU_Muenster", "goetheuni") will be encoded to "__tuple__["WWU_Muenster", "goetheuni"].
"""
# deep copy object to avoid manipulation during iteration
obj_copy = copy.deepcopy(obj)
# if the object is a dictionary
if isinstance(obj, dict):
# iterate over every key
for key in obj:
# set for later to avoid modification in later iterations when this var does not get overwritten
serialized_key = None
# if key is tuple
if isinstance(key, tuple):
# stringify the key
serialized_key = f"__tuple__{list(key)}"
# replace old key with encoded key
obj_copy[serialized_key] = obj_copy.pop(key)
# if the key was modified
if serialized_key is not None:
# do it again for the next nested dictionary
obj_copy[serialized_key] = _tuple_to_string(obj[key])
# else, just do it for the next dictionary
else:
obj_copy[key] = _tuple_to_string(obj[key])
return obj_copy
string_to_tuple (private)
This function converts serialized tuples back to original representation. Serialized tuples need to have a leading __tuple__
string. This function is private as no external usage by the package user is intended.
This function does the opposite of what the _tuple_to_string
function does. Since any tuple-keys were decomposed into a string representation through the export_to_json
function, these tuples need to be recovered when the data is to be imported again. Therefore, this function is used internally within the load_from_json
function. This function iterates recursively through the entire JSON data that is being loaded and decodes any serialized tuple with a leading __tuple__
string to the corresponding Python tuple representation, so that serialized tuples are recovered.
Function:
_string_to_tuple(obj: Any)
In order to avoid a manipulation of the object passed in, a deep copy of the object is performed at the beginning before conversion.
Source Code
def _string_to_tuple(obj: Any) -> Any:
"""Convert serialized tuples back to original representation. Tuples need to have a leading "__tuple__" string.
Args:
obj (Any): Typically a dict, tuple, list, int, or string.
Returns:
Any: Input object with recovered tuples.
Example:
A encoded tuple "__tuple__["WWU_Muenster", "goetheuni"] will be decoded to ("WWU_Muenster", "goetheuni").
"""
# deep copy object to avoid manipulation during iteration
obj_copy = copy.deepcopy(obj)
# if the object is a dictionary
if isinstance(obj, dict):
# iterate over every key
for key in obj:
# set for later to avoid modification in later iterations when this var does not get overwritten
serialized_key = None
# if key is a serialized tuple starting with the "__tuple__" affix
if isinstance(key, str) and key.startswith("__tuple__"):
# decode it so tuple
serialized_key = tuple(key.split("__tuple__")[1].strip("[]").replace("'", "").split(", "))
# if key is number in string representation
if all(entry.isdigit() for entry in serialized_key):
# convert to integer, recover ID
serialized_key = tuple(map(int, serialized_key))
# replace old key with encoded key
obj_copy[serialized_key] = obj_copy.pop(key)
# if the key was modified
if serialized_key is not None:
# do it again for the next nested dictionary
obj_copy[serialized_key] = _string_to_tuple(obj[key])
# else, just do it for the next dictionary
else:
obj_copy[key] = _string_to_tuple(obj[key])
# if another instance was found
elif isinstance(obj, list):
for item in obj:
_string_to_tuple(item)
return obj_copy
User Utility Functions
These functions are designed for external usage by the package user. They allow export to JSON or CSV formats as well as appending new observations to existing files. For the JSON format specifically, a function was designed to load and recover a saved Python dictionary from a JSON file.
All user utility function can be imported by running
from pysna import *
as they are part of the import-all shortcut.
export_to_json
Export dictionary data to JSON file. Tuple-keys are encoded to strings.
Function:
export_to_json(data: dict, export_path: str, encoding: str = "utf-8", ensure_ascii: bool = False, *args)
Args:
data
(dict): Data dictionary that should be exported.export_path
(str): Export path including file name and extension.encoding
(str, optional): Encoding of JSON file. Defaults to "utf-8".ensure_ascii
(bool): Wheather to convert characters to ASCII. Defaults to False.
Other encodings could be specified by overwriting the default for the encoding
argument. The ensure_ascii
argument is used for the json.dump
function from the json
Python module. Additional arguments can be passed to the json.dump
function by the `*args
argument of this function.
In case a tuple was detected in the input data dictionary, an error will be raised during the serialization since JSON does not support tuple encoding. Therefore, the TypeError
or json.JSONDecodeError
are caught and the data dictionary will be preprocessed by the internal _tuple_to_string
function. Then, all tuple-keys inside the data dictionary will be converted to a string representation and the export will be repeated with the serialized tuples.
Any exported dictionary will be exported to a JSON file of the form:
{
"data": [
...
]
}
Thus, the dictionary will be stored inside the list of the data
key. This allows appending new entries to the same file (for more information, see the append_to_json
function).
Reference: https://docs.python.org/3/library/json.html
Source Code
def export_to_json(data: dict, export_path: str, encoding: str = "utf-8", ensure_ascii: bool = False, *args):
"""Export dictionary data to JSON file. Tuple-keys are encoded to strings.
Args:
data (dict): Data dictionary
export_path (str): Export path including file name and extension.
encoding (str, optional): Encoding of JSON file. Defaults to "utf-8".
ensure_ascii (bool): Wheather to convert characters to ASCII. Defaults to False.
"""
try:
with open(export_path, "w", encoding=encoding) as jsonfile:
# add 'data' key in order to append additional dicts to same file, if not already exist
if "data" not in data:
serialized_data = {"data": [data]}
# dump to json
json.dump(serialized_data, jsonfile, indent=4, ensure_ascii=ensure_ascii, *args)
except IOError as e:
raise e
# usually when tuple cannot be serialized
except TypeError or json.JSONDecodeError:
# serialize tuples
data = _tuple_to_string(data)
# retry
export_to_json(data=data, export_path=export_path, encoding=encoding, ensure_ascii=ensure_ascii)
pass
append_to_json
Append a dictionary to an existing JSON file. Tuple-keys are encoded to strings.
Function:
append_to_json(input_dict: Dict[str, Any], filepath: str, encoding: str = "utf-8", **kwargs)
Args:
input_dict
(Dict[str, Any]): Dictionary containing new data that should be added to file.filepath
(str): Absolute or relative filepath including the file extension. Depending on the current working directory.encoding
(str, optional): The encoding of the file. Defaults to "utf-8".
The function takes in a data dictionary containing the data that should be added to an existing file. Tuple-keys will be encoded to strings using the _tuples_to_strings
function. If any tuple-key inside the dictionary is detected during the serialization, the corresponding TypeError
and/or json.JSONDecodeError
will be caught and the _tuples_to_strings
will be invoked. After that, the export will be repeated. The filepath of the existing JSON file must be provided including the file extension. Other encodings different to UTF-8 can also be specified. Keyword arguments can also be passed to the json.dumps
function by the **kwargs
argument.
The provided input data dictionary will be appended to the data
key of the JSON file. Hence, the existing file must be of the form:
{
"data": [
...
]
}
Source Code
def append_to_json(input_dict: Dict[str, Any], filepath: str, encoding: str = "utf-8", **kwargs):
"""Append a dictionary to an existing JSON file. Tuple-keys are encoded to strings.
Args:
input_dict (Dict[str, Any]): Dictionary containing new data that should be added to file.
filepath (str): Absolute or relative filepath including the file extension. Depending on the current working directory.
encoding (str, optional): The encoding of the file. Defaults to "utf-8".
NOTE: Existing JSON file needs a 'data' key.
Raises:
ValueError: If input dict and file do not have the same keys or columns, respectively.
"""
# load file from path
with open(filepath, "r", encoding=encoding) as input_file:
f = json.load(input_file, **kwargs)
# existing file should have a "data"-key and a list to append to
if "data" not in f.keys():
raise KeyError("The file to be extended must contain the key 'data'.")
else:
try:
# serialize tuples if any exist
input_dict = _tuple_to_string(input_dict)
# append new dict to file
f["data"].append(input_dict)
with open(filepath, "w", encoding=encoding) as jsonfile:
json.dump(f, jsonfile, indent=4, **kwargs)
except IOError as e:
raise e
# usually when tuple cannot be serialized
except TypeError or json.JSONDecodeError:
# serialize tuples
input_dict = _tuple_to_string(input_dict)
# retry
append_to_json(input_dict=input_dict, filepath=filepath, encoding=encoding, **kwargs)
pass
load_from_json
Load Python dictionary from JSON file. Tuples are recovered.
Function:
load_from_json(filepath: str, encoding: str = "utf-8", **kwargs)
Args:
filepath
(str): Path to JSON file.encoding
(str, optional): Encoding of file. Defaults to "utf-8".
The function allows to recover a JSON serialized dictionary containing tuple-keys. Therefore, the interncal _strings_to_tuples
is used. The user will get a full Python dictionary like before the export to JSON of it.
Source Code
def load_from_json(filepath: str, encoding: str = "utf-8", **kwargs) -> dict:
"""Load Python Dictionary from JSON file. Tuples are recovered.
Args:
filepath (str): Path to JSON file.
encoding (str, optional): Encoding of file. Defaults to "utf-8".
Returns:
dict: Python Dictionary containing (deserialized) data from JSON file.
"""
# read from filepath
with open(filepath, "r", encoding=encoding) as jsonfile:
f = json.load(jsonfile, **kwargs)
if "data" in f:
entries = [_string_to_tuple(entry) for entry in f["data"]]
f = {"data": entries}
else:
# try to deserialize if any tuples were found in the file
f = _string_to_tuple(f)
return f
export_to_csv
Besides the JSON export, a CSV export option is provided by this function. Dictionary data is exported to CSV files using the Pandas package.
Function:
export_to_csv(data: dict, export_path: str, encoding: str = "utf-8", sep: str = ",", **kwargs)
Args:
data
(dict): Data dictionary (nested dictionaries are not allowed)export_path
(str): Export path including file name and extension.encoding
(str, optional): Encoding of CSV file. Defaults to 'utf-8'.sep
(str, optional): Value separator for CSV file. Defaults to ",".kwargs
: Keyword arguments for pd.DataFrame.to_csv. See references below for further details.
The function will raise a ValueError
if a nested dictionary was provided.
This function was designed to allow an export of a simple one-level dictionary to a more readable format compared to JSON. However, it is highly recommended to use the JSON export function instead.
Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
Source Code
def export_to_csv(data: dict, export_path: str, encoding: str = "utf-8", sep: str = ",", **kwargs):
"""Export dictionary data to CSV file.
Args:
data (dict): Data dictionary (nested dictionaries are not allowed)
export_path (str): Export path including file name and extension.
encoding (str, optional): Encoding of CSV file. Defaults to 'utf-8'.
sep (str, optional): Value separator for CSV file. Defaults to ",".
kwargs: Keyword arguments for pd.DataFrame.to_csv. See references below for further details.
Raises:
ValueError: If nested dictionary was provided.
IOError: If export fails due to bad input.
References: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
"""
# catch nested dict
if any(isinstance(data[key], dict) for key in data.keys()):
raise ValueError("'data' dictionary must not contain nested dictionaries. Use JSON export instead.")
try:
# convert to pandas dataframe from dict
f = pd.DataFrame(data, index=[0])
# export data frame
f.to_csv(export_path, encoding=encoding, sep=sep, index=False, **kwargs)
except IOError as e:
raise e
append_to_csv
Append a dictionary to an existing CSV file.
Function:
append_to_csv(data: dict, filepath: str, encoding: str = "utf-8", sep: str = ",")
Args:
data
(dict): Dictionary containing new data that should be added to file.filepath
(str): Absolute or relative filepath including the file extension. Depending on the current working directory.encoding
(str, optional): Encoding of CSV file.. Defaults to 'utf-8'.sep
(str, optional): Value separator for CSV file. Defaults to ",".
This function was designed to allow an append of a simple one-level dictionary to an existing CSV file. However, it is highly recommended to use the JSON export function instead.
References:
Source Code
def append_to_csv(data: dict, filepath: str, encoding: str = "utf-8", sep: str = ","):
"""Append a dictionary to an existing CSV file.
Args:
data (dict): Dictionary containing new data that should be added to file.
filepath (str): Absolute or relative filepath including the file extension. Depending on the current working directory.
encoding (str, optional): Encoding of CSV file.. Defaults to 'utf-8'.
sep (str, optional): Value separator for CSV file. Defaults to ",".
Raises:
ValueError: If nested dictionary was provided.
IOError: If export fails due to bad input.
References:
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
"""
# catch nested dict
if any(isinstance(data[key], dict) for key in data.keys()):
raise ValueError("'data' dictionary must not contain nested dictionaries. Use JSON export instead.")
try:
# read existing file
f = pd.read_csv(filepath, sep=sep, encoding=encoding)
# convert data dict to df
input_df = pd.DataFrame(data, index=[0])
# concat dfs
f = pd.concat([f, input_df], axis=0)
# export to CSV
f.to_csv(filepath, sep=sep, encoding=encoding, index=False)
except IOError as e:
raise e
A function for CSV import was not designed as this was already implemented by the csv
Python package or the Pandas
package. Thus, a comparable function seemed unreasonable and redundant.