There are times when you need to compile a list of email addresses from URL. This article shows you how to do it with C#.
It’s amazing what you do for friends. I had a buddy call and ask how hard it would be to get email addresses from a list of Web sites. Of course, I wanted to make it sound easy so that I could sound like some sort of expert. But when he asked me if I would do it I realized that my bravado had painted me into a corner. I knew I would have to agree to do the project (for free, of course), and get him a list of emails that were scraped from a list of Web sites.
It’s okay, though, because now I can show you how I did it. And this fits nicely with another article I recently wrote; for DevSource.com where I explained how to pull data from Internet URLs. In this article I’ll work through writing a program that scrapes email addresses from a Web site.
The Program Organization
The program is fairly simple, and its flow is simple. It starts off getting a domain or URL from the user. It pulls the data from the specified domain or URL. The page is parsed for links (anchor tags) and then the links which are emails are moved to a separate email list. The following figure shows how the program is organized.
Using the ScrapeEmailAddresses Class
There’s a class named ScrapeEmailAddresses that does all of the work. In this section, we’ll discuss using this class.
The Set Up
There are four things we need to do to set things up. First, is to instantiate the class.
The second is to determine the limit for the number of pages that it will examine. The reason that there is the ability to limit the number of pages is because for a large Web site it might take a long time to load and parse all of the pages. You may not want an entire large web site examined. And most of the email addresses will be early in the process anyway. The demo program has a text box that allows users to enter the limit. If there is no limit specified, then the limit value defaults to 99999.
The third thing that must be done is to create a delegate with which the ScrapeEmailAddresses class will update the user interface. And finally, a Boolean value must be available which indicates whether the email scraping code will go beyond the domain or URL that’s specified. The demo program has a checkbox, and its Checked property is used. The following code shows how the demo program sets up the class for the scraping.
ScrapeEmailAddresses sea = new ScrapeEmailAddresses();
int nLimit = 99999;
try
{
nLimit = Convert.ToInt32(txtLimit.Text.Trim());
}
catch
{
}
sea.ShowStatusDelegate = new ScrapeEmailAddresses.ShowStatusMethod(ShowStatus);
The DoIt Method
A single method call named DoIt invokes the email scraping process. It takes three arguments: a string containing the URL, a Boolean value indicating whether to go beyond the initial URL, and an integer value containing the limit value. The following code shows how the DoIt method is used in the demo program.
The ScrapeEmailAddresses Class
The ScrapeEmailAddresses class’s DoIt method does the work. It starts off by instantiating a LinkParser object. It then retrieves the page specified by the URL string that’s passed in. Each link that the LinkParser’s GetURLLinks method extracted from the page is added to a list that’s a member of the ScrapeEmailAddresses class. The AddLink method is used so that code specific to the process is encapsulated in a method. As simplified version (without any error trapping) of this process follows.
// Instantiate the specialized link parser
LinkParser lp = new LinkParser();
// Instantiate a WebClient object
WebClient wc = new WebClient();
// Retrieve the page data.
byte[] data = wc.DownloadData(LinkParser.PrependHTTP(strURL));
// Call the GetURLLinks that parsers the page data.
lp.GetURLLinks(data, strURL);
// Add the found links to a local list by using the
// AddLink method.
for (int i = 0; i < lp.m_Links.Count; i++)
{
AddLink(lp.m_Links[i], strDomain);
}
If the user only wants a single page scraped (as indicated by the “Scrape Entire Domain” checkbox), then the DoIt method is done. The following figure shows the program after pulling a single page. If the user checked the checkbox, then the code needs to move through all of the pages in the domain. A simplified version (without any error trapping or user interface updates) of this follows.
nIndex = 0;
while (nIndex < m_Links.Count
&& m_Links.Count < nLimit)
{
lp = new LinkParser();
WebClient wc = new WebClient();
byte[] data = wc.DownloadData(LinkParser.PrependHTTP(m_Links[nIndex]));
lp.GetURLLinks(data, m_Links[nIndex]);
for (int i = 0; i < lp.m_Links.Count; i++)
{
AddLink(lp.m_Links[i], strDomain);
}
nIndex++;
}
You can see the program being used for an entire domain in the following figure. The demo program can be downloaded here.
Extending the Code
That actually covers most of the working parts of the code. But I had to extend it to meet my friend’s needs. I used the same exact code. But I had a list of web addresses which I used to call into the ScrapeEmailAddresses.DoIt method. I also used the results to write to a tab delimited file that was then imported into Excel. It wasn’t hard at all to use the code I’ve shown you to create the application that was called for.
Conclusion
The .NET framework makes it easy to scrape web pages. And with a little thought, the String class makes it easy to pull out the links and save them. There are countless applications that you can develop based on these techniques.