Data scraping and misuse of listing data that belongs to brokers are growing concerns. Many scrapers (or others who receive the data legitimately but use it in ways that violates its licenses) are actively reselling listing data for various uses not intended by brokers, such as statistical or financial reporting. I recently had the opportunity to sit down with Curt Beardsley, vice president of consumer and industry development at Move, Inc., a leader in online real estate and operator of realtor.com®. Beardsley shared his thoughts on the proliferation of data scraping, the grey market for data, and how brokers can protect their data from unintended uses.
Reva Nelson: How are scrapers using the listing data once they scrape it?
Curt Beardsley: The listing data’s first and foremost value is as an advertisement to get the property sold, and to promote the agent and broker. That value is clear and fairly simple. But the second value is not as clearly defined, which is all the ancillary ways this data can be used, such as for statistical reporting, valuation, marketing of relevant services, targeted mailing lists – that is beyond the original purpose of the listing, which is to present information to consumers and other agents to facilitate the sale of the property. Banks and other entities are eager to get hold of that data because it lets them know who will be up for a mortgage, who will be moving, who will be potentially needing services, etc. People who are selling homes offer a prime marketing opportunity since they may need movers, packing materials, storage, etc. There is a vast grey market for this data.
RN: Given that the users of this data aren’t even in the real estate industry, why should brokers be concerned about this issue?
CB: If your license agreements aren’t clear, that whole other world gets fed and is living on this data. It is taking money out of the broker’s pocket. The leaks are taking money away in two different ways. First, if this marketplace could be created, it’s a revenue stream. Brokers could be making money off of this. Second, they lose control of the leaked data. Brokers are losing control of their brand, and the way that listing is being monetized and displayed. Consumers are agitated when they keep connecting to their agents with homes that are not really for sale. A lot of it is wrong. Agents are paying for leads on homes that went off the market months ago. As a broker, that’s a brand problem.
RN: How does the grey market actually get the data?
CB: There are two ways. Both are concerns for brokers. The first is scraping, which is rampant and aggressive in the real estate arena. In a matter of hours, a computer program, or bot, can run through a site and extract all the listing data from that site. When someone wants to scrape a site, all it takes is for someone to write simple code to do this. I’ve seen many requests like this on legitimate online marketplaces. In December, someone put a request out to bid on Elance.com, (a marketplace for developers, mobile programmers, designers, and other freelancers to connect with those seeking programming or other services) for someone to scrape code off of real estate web sites. The bid closed for $350 and 52 people from around the world bid on it. If you have a content site with unique content, it is invariably targeted to be scraped. The second area of exposure is data leaks. There is a whole flood of secondary data when you send your data someplace and then it flows right out the back door.
RN: How can brokers prevent this from happening?
CB: One way is enforce the license rules. This data is important and has value for promoting the listing and creating derivative works. You need to make sure that all parties using it have a license for that data, and that they are treating the data with respect for that license. For example, all of the data we have displayed on realtor.com is under a license agreement. We believe in that. In our case, it’s licensed for display and use on our website. Our other sites — ListHub (data syndication), Top Producer CRM (customer relationship management software), and TigerLead (lead generation/data analytics) — have very specific agreements around how we get our data and what we do with it. I think brokers need to take a stand against entities that actively take data that doesn’t have a valid license. As a broker, I’d refuse to do business with someone who doesn’t license the data from me or from my authorized party.
RN: You mentioned that this is taking money out of brokers’ pockets. Is there anything the industry can do to stop feeding this grey market?
CB: Part of the problem is that there isn’t a structured, easy, and legal way for entities to get that data if they want it. For example, banks may want to know when homes for which they carry the note come up for sale. Moving companies may want to know if someone’s selling a home since they’ll be in the market for movers. These may be perfectly legitimate uses for that data, but typically, these companies don’t have a way to get it. Grey or black markets exist when normal markets are really difficult. That’s the case here.
I think there is a way to create a legitimate marketplace for access to our data for other uses, but it would require some work, cooperation, and long-term vision. A place where entities who are not interested in advertising the listing to a consumer audience could license listing data for those other uses for a cost that is appropriate for the nature of the transaction. Providing a legitimate, reasonably priced way to license the data would cut back on its illicit flow. If a bank wants nationwide data on sales trends, it is far better for that bank to get that data legitimately. There isn’t an easy legitimate way to license national coverage data at the moment.
RN: How does a website like realtor.com protect its data from scrapers?
CB: Anti-scraping is an evolving art. First, we have a real-time snap that holds 20 minutes of live queries in memory to look for suspicious activity. If it sees something, it tries to decide if the activity is machine-driven or human. Humans typically look around and click on various things. Machines have an order to how they go through a site. If it’s machine-driven, we ask: Is that machine friend or foe? (Friendly machines, such as search engines Google, Bing or Yahoo, index that site in order to display its contents on a search engine, which is, essentially, friendly scraping.) Once we determine the scraper to be foe, we immediately block the user’s IP address. But, scrapers get clever. One scraper realized we were looking at a 20-minute window, so instead of launching one bot to scrape 10,000 listings, they launched 10,000 bots to scrape one listing every 20 minutes. Since we realized that, we now look back at the previous 24 hours and implement processes to block any scrapers we’ve identified.
Once we’ve identified that there is a scraper on our site, we try to determine where the data is going. To do that, we manually seed the data. This involves physically changing the listing record, for example, taking an ampersand and making it “and” or writing out an acronym. Then we search for that string to find out if our modified version appears anywhere online. But, quite often scrapers aren’t putting this data online. The vast majority of data being scraped goes into statistical analysis and documents that are shared internally at financial institutions, hedge funds, banks and other interested parties.
Finally, at the end of last year, we began working with an external security company firm that provides anti-scraping services to content providers around the world. Their dedicated expertise and their ability to evaluate and compare our traffic with known profiles of scrapers they have caught on other sites they monitor, has dramatically increased both the number of scrapers identified and the number of scrapers blocked on our sites.
RN: What recourse do you have when you catch a scraper?
CB: If we know who they are, we first block their IP address. If we can identify where the code is going, we can send a cease and desist letter, requesting that they take the data down from that site immediately. If they do not do so, we have the right to take appropriate legal action.
RN: What has been the outcome of this rigorous process?
CB: Our process has made a tremendous difference. When we started seriously cracking down on this two years ago, we identified 1.5 million scraping attempts per day. That has dropped dramatically over the last year. Interestingly, we saw a massive uptick in December and January of this year, and a decline again after that. There were over 59 million attempts on our site in December alone. We assume that this was because people were trying to pull year-end stats. We blocked almost all of them. We’re closing in on a 99% percent rate of scrapers that get blocked.
Unfortunately, scrapers have started going easier places to get the data. My problem has now become an MLS and broker problem.