Twitter Data analysis using python

Twitter Data — Definition
Twitter data provides you with a snapshot of your Twitter information, including the following: Account: If you are logged in to your Twitter account, you will see information such as your username, email addresses or phone numbers associated with your account, and your account creation details.
Data collection
Raw Twitter was first provided to us. So we didn’t go get developers account and collect the data ourselves. Two types of json data were provided to us.
The first will be around 140mb of a raw twitter data dump in JSON format. This data is collected using the following keywords: [‘chinaus’, ‘chinaTaiwan’, ‘chinaTaiwancrisis’, ‘taiwan’, ‘XiJinping’, ‘USCHINA’, ‘pelosi’, ‘TaiwanStraitsCrisis’, ‘WWIII’, ‘pelosivisittotaiwan’]. The second one will be around 130mb of the same format, but collected based on the original keyword plus country specific geocodes included e.g. ‘-28.479,26.128,400km for South Africa.
EXTRACTION OF DATA FROM RAW JSON
To load the data from JSON format we need to install the required libraries. We will have to load the Twitter data into a pandas data frame using different types of python functions like find_status_count(), find_hashtags(), and find_retweeted_text. Using this many functions, we need to append every tweet into a list and at the end, we will get the extracted data in the form of a CSV file.
DATA PREPARATION AND CLEANING
When working with multiple data sources, there are many chances for data to be incorrect, duplicated, or mislabeled. If data is wrong, outcomes and algorithms are unreliable, even though they may look correct. Data cleaning is the process of changing or eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset.
In this project, we managed to clean any duplicate or incomplete or convert polarity and subjectivity into numeric values or remove non-English tweets from the data set that we extracted in the data extraction phase. Below is a snapshot of the end result.

DATA SCIENCE: Preprocessing, exploration, Modeling-Sentimental and Topic model analysis of data
1. Data preprocessing
The data preprocessing part was performed by writing a code base that mainly produces a csv data that contains a clean text column by avoiding special symbols and presenting the core message of the tweet which we are interested in. But first the code helps to understand a little more about the data by extracting out information about some key rows and columns. p


2. Data exploration
This step is better illustrated with images presented below. Here what is done was taking a thorough look in the data’s assumed key features.




3. Sentimental Analysis
Sentimental analysis can be defined as a process that automates the mining of attitudes, opinions, views, and emotions from text, speech, tweets, and database sources through Natural Language Processing (NLP). Sentiment analysis involves classifying opinions in text into categories like “positive”, “negative” or “neutral”.
Based on the feature which is an attribute of an object with respect to which evaluation is made; And the orientation of an opinion on a feature which represents whether the opinion is positive, negative, or neutral; sentimental analysis is build. First prepare a model ready data that will be suitable for the analysis.


4. Topic modelling
Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. e.g. word cloud.

Dashboard visualization with stream lit
A data visualization dashboard allows digital marketers or researchers to track multiple data sources and visualize the data, ensuring a solid data set for decision-makers.
