Ziff-Davis Enterprise 
DevSource: Microsoft Developer Resource
Add OnsArchitectureLanguagesTechniquesUsing VSForums
 
Home arrow Techniques arrow Scraping Email Addresses from the Web
Scraping Email Addresses from the Web
By Rick Leinecker

Rate This Article:
Add This Article To:
There are times when you need to compile a list of email addresses from URL. This article shows you how to do it with C#.

It’s amazing what you do for friends. I had a buddy call and ask how hard it would be to get email addresses from a list of Web sites. Of course, I wanted to make it sound easy so that I could sound like some sort of expert. But when he asked me if I would do it I realized that my bravado had painted me into a corner. I knew I would have to agree to do the project (for free, of course), and get him a list of emails that were scraped from a list of Web sites.

It’s okay, though, because now I can show you how I did it. And this fits nicely with another article I recently wrote; for DevSource.com where I explained how to pull data from Internet URLs. In this article I’ll work through writing a program that scrapes email addresses from a Web site.

ADVERTISEMENT

The Program Organization

The program is fairly simple, and its flow is simple. It starts off getting a domain or URL from the user. It pulls the data from the specified domain or URL. The page is parsed for links (anchor tags) and then the links which are emails are moved to a separate email list. The following figure shows how the program is organized.

Using the ScrapeEmailAddresses Class

There’s a class named ScrapeEmailAddresses that does all of the work. In this section, we’ll discuss using this class.

The Set Up

There are four things we need to do to set things up. First, is to instantiate the class.

The second is to determine the limit for the number of pages that it will examine. The reason that there is the ability to limit the number of pages is because for a large Web site it might take a long time to load and parse all of the pages. You may not want an entire large web site examined. And most of the email addresses will be early in the process anyway. The demo program has a text box that allows users to enter the limit. If there is no limit specified, then the limit value defaults to 99999.

The third thing that must be done is to create a delegate with which the ScrapeEmailAddresses class will update the user interface. And finally, a Boolean value must be available which indicates whether the email scraping code will go beyond the domain or URL that’s specified. The demo program has a checkbox, and its Checked property is used. The following code shows how the demo program sets up the class for the scraping.

ScrapeEmailAddresses sea = new ScrapeEmailAddresses();
int nLimit = 99999;
try
{
     nLimit = Convert.ToInt32(txtLimit.Text.Trim());
}
catch
{
}
sea.ShowStatusDelegate = new ScrapeEmailAddresses.ShowStatusMethod(ShowStatus);

The DoIt Method

A single method call named DoIt invokes the email scraping process. It takes three arguments: a string containing the URL, a Boolean value indicating whether to go beyond the initial URL, and an integer value containing the limit value. The following code shows how the DoIt method is used in the demo program.

The ScrapeEmailAddresses Class

The ScrapeEmailAddresses class’s DoIt method does the work. It starts off by instantiating a LinkParser object. It then retrieves the page specified by the URL string that’s passed in. Each link that the LinkParser’s GetURLLinks method extracted from the page is added to a list that’s a member of the ScrapeEmailAddresses class. The AddLink method is used so that code specific to the process is encapsulated in a method. As simplified version (without any error trapping) of this process follows.

// Instantiate the specialized link parser
LinkParser lp = new LinkParser();
// Instantiate a WebClient object
WebClient wc = new WebClient();
// Retrieve the page data.
byte[] data = wc.DownloadData(LinkParser.PrependHTTP(strURL));
// Call the GetURLLinks that parsers the page data.
lp.GetURLLinks(data, strURL);
// Add the found links to a local list by using the
//   AddLink method.
for (int i = 0; i < lp.m_Links.Count; i++)
{
     AddLink(lp.m_Links[i], strDomain);
}

If the user only wants a single page scraped (as indicated by the “Scrape Entire Domain” checkbox), then the DoIt method is done. The following figure shows the program after pulling a single page. If the user checked the checkbox, then the code needs to move through all of the pages in the domain. A simplified version (without any error trapping or user interface updates) of this follows.

nIndex = 0;
while (nIndex < m_Links.Count
     && m_Links.Count < nLimit)
{
 lp = new LinkParser();
 WebClient wc = new WebClient();
 byte[] data = wc.DownloadData(LinkParser.PrependHTTP(m_Links[nIndex]));
 lp.GetURLLinks(data, m_Links[nIndex]);
 for (int i = 0; i < lp.m_Links.Count; i++)
 {
   AddLink(lp.m_Links[i], strDomain);
 }
 nIndex++;
}

You can see the program being used for an entire domain in the following figure. The demo program can be downloaded here.

Extending the Code

That actually covers most of the working parts of the code. But I had to extend it to meet my friend’s needs. I used the same exact code. But I had a list of web addresses which I used to call into the ScrapeEmailAddresses.DoIt method. I also used the results to write to a tab delimited file that was then imported into Excel. It wasn’t hard at all to use the code I’ve shown you to create the application that was called for.

Conclusion

The .NET framework makes it easy to scrape web pages. And with a little thought, the String class makes it easy to pull out the links and save them. There are countless applications that you can develop based on these techniques.




Discuss Scraping Email Addresses from the Web
 
It looks like the content management system on this site is not working. What...
>>> Post your comment now!
 

 
 
>>> More Techniques Articles          >>> More By Rick Leinecker
 



DevSource video
Devsource Video Series
Manipulating Society through Technology
Jeremy Bailenson, Director of the Virtual Human Interaction Lab at Stanford University, talks about virtual reality, avatars, Moore's law, how real world behaviors influence online reality, and societal manipulation through technology!
>> Play video
>> Read article
>> See all videos
DevLife Blog

Julia explores the Robotics Studio! (It's for more than you think.)

MSDev Blog

Messages for Bill Gates!

Make it Work
.NET makes runtime type checking a breeze. See what Peter has to say about it in this week's tips!
News
Microsoft Counts on App Support for Vista
Microsoft has taken pains to demonstrate that Windows Vista will have ample application support.
DevSource RSS FEEDS
XML Want an easy way to keep up with breaking tech news? And the Get DevSource headlines delivered to your desktop with RSS.