Twitter Data analysis using python

Amanuel Zewdu
4 min readAug 12, 2022

--

Twitter Data — Definition

Twitter data provides you with a snapshot of your Twitter information, including the following: Account: If you are logged in to your Twitter account, you will see information such as your username, email addresses or phone numbers associated with your account, and your account creation details.

Data collection

Raw Twitter was first provided to us. So we didn’t go get developers account and collect the data ourselves. Two types of json data were provided to us.

The first will be around 140mb of a raw twitter data dump in JSON format. This data is collected using the following keywords: [‘chinaus’, ‘chinaTaiwan’, ‘chinaTaiwancrisis’, ‘taiwan’, ‘XiJinping’, ‘USCHINA’, ‘pelosi’, ‘TaiwanStraitsCrisis’, ‘WWIII’, ‘pelosivisittotaiwan’]. The second one will be around 130mb of the same format, but collected based on the original keyword plus country specific geocodes included e.g. ‘-28.479,26.128,400km for South Africa.

EXTRACTION OF DATA FROM RAW JSON

To load the data from JSON format we need to install the required libraries. We will have to load the Twitter data into a pandas data frame using different types of python functions like find_status_count(), find_hashtags(), and find_retweeted_text. Using this many functions, we need to append every tweet into a list and at the end, we will get the extracted data in the form of a CSV file.

DATA PREPARATION AND CLEANING

When working with multiple data sources, there are many chances for data to be incorrect, duplicated, or mislabeled. If data is wrong, outcomes and algorithms are unreliable, even though they may look correct. Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset.

In this project, we managed to clean any duplicate or incomplete or convert polarity and subjectivity into numeric values or remove non-English tweets from the data set that we extracted in the data extraction phase. Below is a snapshot of the end result.

Picture: Cleaned data

DATA SCIENCE: Preprocessing, exploration, Modeling-Sentimental and Topic model analysis of data

1. Data preprocessing

The data preprocessing part was performed by writing a code base that mainly produces a csv data that contains a clean text column by avoiding special symbols and presenting the core message of the tweet which we are interested in. But first the code helps to understand a little more about the data by extracting out information about some key rows and columns. p

picture: Cleaned and preprocessed data
picture: only language is English

2. Data exploration

This step is better illustrated with images presented below. Here what is done was taking a thorough look in the data’s assumed key features.

Picture: Polarity inclination is towards neutral
Picture: Top five hashtags and authors
Picture: Pie and bar chart visualizations
Picture: Wordcloud

3. Sentimental Analysis

Sentimental analysis can be defined as a process that automates the mining of attitudes, opinions, views, and emotions from text, speech, tweets, and database sources through Natural Language Processing (NLP). Sentiment analysis involves classifying opinions in text into categories like “positive”, “negative” or “neutral”.

Based on the feature which is an attribute of an object with respect to which evaluation is made; And the orientation of an opinion on a feature which represents whether the opinion is positive, negative, or neutral; sentimental analysis is build. First prepare a model ready data that will be suitable for the analysis.

Picture: Model-ready-data
Picture: Choose the best model

4. Topic modelling

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. e.g. word cloud.

Picture: Topic visualization

Dashboard visualization with stream lit

A data visualization dashboard allows digital marketers or researchers to track multiple data sources and visualize the data, ensuring a solid data set for decision-makers.

Picture: Dashboard showing word cloud on tweets from Ethiopia

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Amanuel Zewdu
Amanuel Zewdu

Written by Amanuel Zewdu

Junior data engineer who builds scalable data pipelines using ETL tools; Airflow, Kafka and Dbt with data modeling dexterity; Python and SQL

Responses (1)

Write a response