Goal: To build a reusable Natural Language Processing (NLP) pipeline that cleans unstructured text, quantifies emotional valence, and discovers hidden thematic structures without human supervision.
Social media platforms generate massive amounts of unstructured text data. For researchers and businesses, manually reading thousands of comments to gauge public opinion is impossible.
In this project, I developed a computational framework to βreadβ and categorize disparate social media comments. The pipeline moves from preprocessing (cleaning slang, removing stopwords) to inferential modeling (determining sentiment and topics). This methodology shifts textual data from purely qualitative to a quantitative resource suitable for statistical analysis.
Applying the pipeline to a sample dataset of 1,000 public comments revealed distinct patterns in user discourse:
The chart below visualizes the sentiment distribution across the corpus, highlighting the skew toward positive engagement in this specific dataset.

Text data is notoriously noisy. Below is the custom function I wrote to normalize the text data for the machine learning models.
def refine_text(raw_text):
# Convert to string and lowercase
text = str(raw_text).lower()
# Remove technical artifacts (URLs, Usernames)
text = re.sub(r'http\S+|www\.\S+', '', text)
text = re.sub(r'@\w+', '', text)
# Lemmatize tokens (e.g., "running" -> "run")
# This requires tokenization and stopword lists loaded previously
cleaned_tokens = [lemmatizer.lemmatize(word) for word in tokens
if word not in stop_words and len(word) > 2]
return " ".join(cleaned_tokens)