Web Scraping in Machine Learning

Web Scraping

Web Scraping is data scraping used for extracting data from websites.

Here we are going to scrape an HTML file and get the text for particular tag. The file contains some reviews of laptops and we are going to get the textual data.

We will be using BeautifulSoup.

HTML file:

<!DOCTYPE html>
<html>
<head></head>
<body>

<div class="review">

The laptop is best for students. 
not happy with the delivery. 
This is the best laptop. 
Not very happy with the delivery.
It looks gorgeous but touch pad is not working. 
product is not very efficient.
nice laptop.

</div>
</body>
</html>

Python code:

# Import BeautifulSoup
from bs4 import BeautifulSoup

# Read the html file
html_file = open(' index.html' , 'r')
page = html_file.read( )

# Create instance of BeautifulSoup to parse document
soup = BeautifulSoup(page, 'html.parser')

# Look for p tage
reviews = soup.find_all('p')

# Print each review
for p in reviews :
''' print p.get_text ( )

Output :

The laptop is best for students.
not happy with the delivery.
This is the best laptop.
Not very happy with the delivery.
It looks gorgeous but touch pad is not working.
product is not very efficient.
nice laptop.
'''

We will use the reviews for sentiment analysis to check out customers reaction.

Sentiment Analysis

# Import TextBlob
from textblob import TextBlob

positive, negative = 0 , 0

# finf sentiment of each review
for p in reviews :
text = p.get_text( )
sentiment = TextBlob(text). sentiment.polarity
if(sentiment >= 0) :
 positive+ =1
 else :
negative+ =1

print "Positive review :" , positive
print "Negative review :" , negative
'''

Output :

Positive review : 4
Negative review : 3
'''

This method can be used with twitter to check the reaction of people on a particular topic.