Abstract- With the evolution of web technology, there is a huge amount of data present in the web for the internet users. The users not only explore the resources present on web, but also provide feedback, thus generating additional useful information. Sentiment analysis also known as Opinion mining deals with automating the task of classifying a textual review expressed in natural language as either positive or negative. In general, supervised methods consist of two stages. Firstly is extraction of information followed by its classification. The proposed approach consists of two major algorithms jaccard and cosine to implement a tool that will carry out the process of extraction of sentiments from a textual review and rate it accordingly. In this paper, the shortcomings of the existing approach will be discussed, along with the future & directions for research.
Keywords – Sentiment Analysis, Opinion mining, Jaccard, Cosine, Term Frequency (TF), Inverse Document Frequency (IDF).
Nowadays, People refer to the reviews of a product, reviews of a movie etc before making a purchase. A product with negative reviews will be less preferred over the product having positive review. Reviews are expressed in Natural language.
The main aim of opinion mining is to analyze views of the people and use them in decision making. The World Wide Web is a huge repository of massive data that can be structured or unstructured [3,5]. Opinion Mining or Sentiment analysis tool involves building a system to explore user’s opinions made in blog posts, comments, reviews or tweets, about the product, policy or a
topic. It aims to determine the attitude of a user about some topic. With increasing popularity of opinion-base websites and other resources new challenges has been arrived in opinion mining. It is now becoming evident that the views expressed on the web can be influential to readers in forming their opinions on some topic. Also these reviews are taken into consideration by the vendors and policy makers, providing them with scope to improve.
There are several challenges in the field of sentiment analysis. The most common challenges are given here. Firstly, Word Sense Disambiguation (WSD), a classical NLP problem is often encountered. For example, ‘an unpredictable plot in the movie’ is a positive phrase, while ‘an unpredictable steering wheel’ is a negative one. The opinion word unpredictable is used in different senses. Secondly, addressing the problem of sudden deviation from positive to negative polarity, as in ‘The movie has a great cast, superb storyline and spectacular photography; the director has managed to make a mess of the whole thing’. Thirdly, negations, unless handled properly can completely mislead. ‘Not only do I not approve Supernova 7200, but also hesitate to call it a phone’ has a positive polarity word approve; but its effect is negated by many negations .
II. RELATED WORK
From time to time, many extraction systems have been developed. Most of the studies related to emotion and informatics are in the Human Computer Interaction (HCL). Based on experiments, it has been shown that an emotional state has the influential effect on a person’s behaviour. As per paper basically there are three main levels of sentiment analysis namely, Document level analysis, Sentence level analysis and Feature level analysis.
In Document level analysis and Sentence level analysis one cannot identify reviewer’s likes or dislikes on specific feature of that object. It has been found that document level and sentence level classification are not enough to identify each and every one detail about sentiments expressed in a document as sentiments may be expressed with respect to different features. In Feature level method algorithm with parts of speech tags is used to improve the accuracy on the benchmark dataset. It is fine- grained analysis process which takes every feature of object into consideration. The feature level method include State Vector Machines (SVM) but SVM does not provide accurate results and hence must be calculated using Jaccard and Cosine implementation.
Many researchers have been addressing the problem of sentiment classification on textual reviews. Some datasets are available and have been used by many researchers in order to compare the results and the dataset of movie reviews is the most popular benchmark dataset in the literature. Since the focus of our study is on the overall opinion (positive or negative) expressed in the review.
III. PROPOSED WORK
The proposed framework presents an approach with similarity measure to give better accuracy and efficiency than SVM.
Figure 1. System Architecture
Similarity measure consists of three methods namely Jaccard & Dice and Cosine, any of which can be used for. It is based on the analysis of comments in a given review.
The proposed architecture for sentimental analysis is shown in figure 1. The Sentiment expressed in the textual review is analysed. The blocks represent the informative part. Firstly, the comments are tokenised by the process of tokenisation. The words which do not contribute in opinion extraction are removed. These include it, the, is, was and so on. Stemming is applied over the tokens which returns stem words. The termfrequency and inverse document frequency are calculated. And on the basis of their score, the words are usually sorted.
Later, the cosine similarity as well as jaccard similarity equations are applied on the score of the words to obtain the average score and produce the final review rating.
IV. ALGORITHMIC APPROACH:
1. Jaccard Similarity:
a. It starts with finding important keywords in documents and removing irrelevant words.
b. A TF-IDF approach is used initially.
c. Given a document collection D, a word w,
and an individual
document d D, we calculate
wd (fw, d ) * log ( |D| )
a. Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.
b. In Information Retrieval and text mining, each term is notionally assigned a different dimension and a document is characterized by a vector where the value of each dimension corresponds to the number of times that term appears in the document.
c. It gives a measure of how similar two documents are likely to be in terms of their subject matter. The cosine of two vectors can be derived by using the Euclidean dot product formula:
a.b = ||a||b||cos??
fw, D e. The angle between two term frequency
d. In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (TF-IDF weights) cannot be negative.
where, fw, d equals the number of times w appears in d.
|D| is the size of the corpus(data collection from twitter, blogs etc.), fw, D equals the number of documents in which w appears in D.
Once we get important terms in documents then similarity measure is applied.
d. First we apply Jaccard similarity measure, a
binary distinguisher defined as follows:
| A ‘ B | JS(A,B) = ————— | A U B |
e. Finally we get a feature set depending on its similarity i.e. positive and negative.
2. Cosine Similarity:
vectors cannot be greater than 90??.
V. IMPLEMENTATION DETAILS:
The proposed work will include the following different modules.
1) Collecting dataset.
3) Pre-processing and storing domain specific keywords (stemming) 4) Calculating TF-IDF.
5) Similarity measure.
6) Feature Extraction.
7) Classification and Analysis.
VI. PERFORMANCE MEASUREMENT:
The classification performance will be evaluated in three terms accuracy, recall and precision as defined below. A confusion matrix is used for this.
True positive samples + True
Accuracy= ————————————————- Total number of samples
True positive sample
True positive samples+ false negative samples
True positive sample
Precision= ———————————————– True positive sample+ false positive samples
VII. FUTURE WORK AND
Based on our study about the proposed work on sentimental analysis with respect to SVM and similarity measure we concluded following points:
1. Methodologies under similarity measure lead to better accuracy and should be implemented in less computational complexity than SVM.
2. It should stand for maximum number of features and samples.
In the nearby future we are going to implement the project and the results then obtained will be used to show benefits of the procedure.