
Working with data is key for any data science, data analytics, or machine learning projects. Finding the right type of data to use for a specific problem is not that easy. Despite the huge amount of data that is collected everyday on the internet and other devices, the data always comes in a messy state. To work with a dataset, processing the dataset is crucial.
Web scraping helps in extracting information from scratch from the internet. We will do web scraping using Python and Scrapy and then apply it to a Contact Extractor. A Contact Extractor is a bot that aims to crawl some websites and then get to collect emails and other contact information.
We have many free Python packages that can be used for web scraping, parsing, and crawling. For instance, Selenium, Mechanize, lxml, Scrapy, Requests, and Beautiful Soup. It will be great to have a look at some of the packages, it will be helpful in deciding which one is the best one to use for your datasets.
We will be using Beautiful Soup to parse HTML and then combining it with Scrapy environment. It is a multi-purpose scraping and web crawling framework. The framework will extract emails from website, however you can use it to extract more contact information like the phone numbers.
With this article, you will be able to first scrape and search google for websites using certain tags, then parse each one to find the emails and then finally register them inside a data frame.
For example, if you want to get 500 emails related to financial institutions, you can use different tags and then have the emails stored in a CSV file on your local computer. This will be helpful in building mailing list for easy sending of emails at once.
Step 1: Extracting of websites from google using googlesearch
To extract URLs links from a tag, you will need to make use of the googlesearch library using the method called search. The method query a number of websites to look for and a language and then return the links from the google search.
First, you will need to import the modules that you will be using:
import logging
import os
import re
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
logging.getLogger ('scrapy').propagate = False
The {logging.getLogger(‘scrapy’).propagate = False} is used to avoid getting too many logs and warnings while using the Scrapy module inside the Jupyter Notebook.
The next function will be used to return and check the list of URL strings:
def get_urls(tag, n, language):
urls = [url for url in search (tag, stop=n, lang=language)] [:n]
return urls
get_urls('financial institutions', 5 , 'en')
Great, now the URL list, the google_urls, will work as the input for the Spider that will read the source code of every web page to find the emails.
Step 2: Making the Regex expression for the emails extraction
Regular expressions (regex) is also used for text handling too. Mastering regex makes it easy to manipulate text to find patterns in strings, extract, and replace parts of the text based on the character sequence. For example, when extracting emails, you can use regex findall method as shown below:
mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)
The regex expression ‘\w+@\w+\.{1}\w+’ looks into every piece of the string that starts with one or more letters, then followed by the @ sign, then followed by another letter(s) with a dot in the end and finally another letter(s) at the end.
For more tutorial on regex, check the Python documentation and the introduction by Sentdex.
Remember that the above regex expression can be improved to avoid unwanted emails or return of errors during the extraction.
Step 3: Website Scraping using Scrapy Spider
A simple version of a Spider consist of a name, a list of the URLs (to start the requests), and one or more methods to parse the response. An example of a complete Spider looks like the one below:
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
links = LxmlLinkExtractor(allow=()).extract_links(response)
links = [str(link.url) for link in links]
links.append(str(response.url))
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
def parse_link(self, response):
for word in self.reject:
if word in str(response.url):
return
html_text = str(response.text)
mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
df.to_csv(self.path, mode='a', header=False)
The above function, Spider takes in the list of the URLs as the input and then reads their source codes one by one. Inside the URLs, we have are also looking for the emails by searching the links that are found inside that URL. Majority of the emails and contact information are stored in the contact page and not the main page of a website.
Therefore, the first parse function performs link extraction object (LxmlLinkExtractor) and also checks for the new URLs inside a source. The URLs are then passed to the parse_link method where regex expression findall performs its task of looking for the emails.
Using the code below, you will be able to send the links from one parse method to another. The function is accomplished by a callback argument, which defines which method the request URL must be sent to.
yield scrapy.Request(url=link, callback=self.parse_link)
The parse_link has a for loop inside it with the variable reject. The function rejects the list of words that are to be avoided while looking for web addresses. For example, if you are looking for the tag = ‘financial institutions in Spain‘, and you do not want to come across social pages like facebook and twitter, you can include these words as bad words in the Spider process.
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
process.start()
google_urls list will pass as an argument when you run Spider and the path defines where you will be saving the CSV file. The reject function works as we had seen earlier.
The process run Spider by implementing Scrapy inside Jupyter Notebook. The advantage of using Scrapy framework is that you can run multiple Spiders at once, Scrapy speeds up the process.
Step 4: Saving emails in CSV file
Scrapy has a method that can be used to store and export the extracted data. For now, we will use the panda to csv method. For every website that you scraped, you will need a dataframe with columns, in this case, email and link columns. Then you will need to append the data to the created CSV file.
def ask_user(question):
response = input(question + 'y/n' + '\n')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
The function above defines two basi helper functions. The functions creates the new CSV file and in case where the file does exists, it will ask if you will like to overwrite it.
Step 5: Compiling everything together
The final step will entail building of the main function where everything will work together. The function works by first writing an empty data frame to a new CSV file. Next, it gets the google_urls using the get_urls function. Then it starts the process by executing Spider.
def get_info(tag, n, language, path, reject=[]):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting Google urls...')
google_urls = get_urls(tag, n, language)
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
You can now test the result using the function below:
bad words = ['twitter', 'facebook']
df = get_info('financial institutions available', 300, 'pt', 'financial.csv', reject = bad_words)
The function above will replace the old CSV file, financial.csv will then be created. The data will be stored in the new CSV file. The final function below will be used to return the data frame with the scraped contact information.
df.head()
That is it! As simple as that. You will receive a list of emails, some may be weird others may be important data. The next step will be to find a way to filter the useless data from the important ones, using machine learning.
Feel free to let any comment, concerns, or ideas. Enjoy the post and like and share the knowledge with your peers.
If you have any question or comment, do not hesitate to ask us.
Quote: The moon looks upon many night flowers; the night flowers see but one moon. – Jean Ingelow