文/朱利葉斯·切爾尼奧斯卡斯 譯/云天
近來(lái),互聯(lián)網(wǎng)正經(jīng)歷著與18 世紀(jì)早期“采金熱”類似的現(xiàn)象,特別是在數(shù)據(jù)提取方面。數(shù)據(jù)因其巨大的價(jià)值而被某些分析師稱為“新石油”。數(shù)據(jù)領(lǐng)域仍然對(duì)大大小小的參與者開(kāi)放,但這也導(dǎo)致了若干不專業(yè)的行為,甚至有人設(shè)法獲取有密碼保護(hù)的數(shù)據(jù)。
2盡管許多網(wǎng)站確實(shí)包含IP禁令等防御措施,但由于競(jìng)爭(zhēng)加劇和各種經(jīng)濟(jì)因素,網(wǎng)絡(luò)爬蟲(chóng)和服務(wù)器之間的無(wú)形沖突仍在持續(xù),并愈演愈烈。盡管大多數(shù)人很樂(lè)意利用億客行、谷歌購(gòu)物、PriceGrabber 和天巡網(wǎng)等聚合網(wǎng)站的低價(jià)優(yōu)勢(shì),但人們并沒(méi)有意識(shí)到上述沖突正發(fā)生在不同的電商平臺(tái)之間。
3使用工具的目的有好有壞,網(wǎng)頁(yè)數(shù)據(jù)抓取也不例外。一種相當(dāng)常見(jiàn)的情況是以營(yíng)銷為目的抓取個(gè)人數(shù)據(jù)。數(shù)億用戶通過(guò)電商平臺(tái)上的服務(wù)協(xié)議條款同意公開(kāi)他們的數(shù)據(jù),無(wú)論他們是否意識(shí)到了這一操作。然而,數(shù)據(jù)遭泄露的問(wèn)題在于,這些數(shù)據(jù)由社交媒體機(jī)構(gòu)提取,卻為僵尸網(wǎng)站所用。這類網(wǎng)站在未經(jīng)用戶許可的情況下創(chuàng)建個(gè)人資料,并羅列出個(gè)人的詳細(xì)信息。
4結(jié)果,網(wǎng)頁(yè)數(shù)據(jù)抓取的負(fù)面新聞越來(lái)越多,這使得公眾對(duì)自身數(shù)據(jù)價(jià)值和隱私的認(rèn)識(shí)有所提高。網(wǎng)頁(yè)數(shù)據(jù)抓取本身并沒(méi)有什么不道德的,因?yàn)樗贿^(guò)是把人們通常需要手動(dòng)操作的活動(dòng)自動(dòng)化了。主要的區(qū)別在于,網(wǎng)頁(yè)數(shù)據(jù)抓取使用機(jī)器人程序,在極短時(shí)間內(nèi)爬取大量網(wǎng)站、提取海量信息,從而實(shí)現(xiàn)更大規(guī)模的信息搜集。
5提取公開(kāi)的數(shù)據(jù)需要代理。簡(jiǎn)單來(lái)說(shuō),代理是網(wǎng)絡(luò)爬蟲(chóng)和服務(wù)器之間的中介。使用代理可以將數(shù)據(jù)請(qǐng)求均勻地分配到服務(wù)器,這樣能確保以合理的速率請(qǐng)求數(shù)據(jù),也可保證請(qǐng)求方匿名。
6不道德抓取所采用的數(shù)據(jù)提取方式可能損害個(gè)人隱私,導(dǎo)致服務(wù)器過(guò)載。
7盡管很多網(wǎng)站試圖通過(guò)IP禁令來(lái)防止不道德抓取,但這漸漸變得徒勞,因?yàn)槭褂昧舜恚疫@些代理能夠模擬人類行為來(lái)規(guī)避服務(wù)器問(wèn)題。這最終可能導(dǎo)致服務(wù)器過(guò)載(使在線企業(yè)耗費(fèi)資金)、互聯(lián)網(wǎng)透明度降低、公眾在隱私問(wèn)題上的不信任加重。
8網(wǎng)頁(yè)數(shù)據(jù)抓取大有裨益,但這有賴于有自由且透明的互聯(lián)網(wǎng)可用。我確信,如果我們能遵循一些準(zhǔn)則,使局面對(duì)每個(gè)人都公平,那么網(wǎng)頁(yè)數(shù)據(jù)抓取將有益于整個(gè)科技領(lǐng)域:
1. 只抓取公開(kāi)的網(wǎng)頁(yè)
2. 研究目標(biāo)網(wǎng)站的法律文件以確定你依照法律是否接受其服務(wù)條款。如果接受,確定自己是否不會(huì)違背
3. 合理請(qǐng)求數(shù)據(jù)以保證服務(wù)器功能不受損害(DDoS 攻擊)
4. 尊重源網(wǎng)站對(duì)所獲得的任何數(shù)據(jù)的隱私保護(hù)
5. 使用以合乎道德的手段獲取的代理
9眾所周知,當(dāng)今正在運(yùn)行的某些代理,其獲取方式并不道德。許多代理通常是人們從下載到個(gè)人設(shè)備里的應(yīng)用程序中獲取的。很難確定這些用戶是否意識(shí)到了他們的設(shè)備正在被使用。但可以肯定的是,如果用戶同意了具有誤導(dǎo)性或是容易混淆的服務(wù)條款,從而不情愿地將個(gè)人設(shè)備變成住宅代理網(wǎng)絡(luò)中的參與者,那么將這類程序用作代理一定是不道德的。
10現(xiàn)代網(wǎng)頁(yè)數(shù)據(jù)抓取的某些方面缺乏明確性,需要道德規(guī)范來(lái)為行業(yè)帶來(lái)秩序。如果業(yè)內(nèi)人士能夠就專業(yè)的網(wǎng)頁(yè)數(shù)據(jù)抓取方法達(dá)成共識(shí),這將有助于維護(hù)一個(gè)公平、開(kāi)放、自由的網(wǎng)絡(luò)環(huán)境,使企業(yè)與消費(fèi)者雙贏。關(guān)于數(shù)據(jù)抓取在各行各業(yè)所能發(fā)揮的最大潛能,我們對(duì)此的了解仍處在早期階段,所以讓我們抓住這個(gè)大好時(shí)機(jī),以最合乎道德的方式來(lái)推動(dòng)創(chuàng)新、促進(jìn)發(fā)展。 □
The internet is currently undergoing a similar phenomenon to the gold rushes of the early eighteenth century,specifically when it comes to data extraction. With data now dubbed by some analysts as the “new oil” in terms of its value, the field is still open to small and large players alike, which has led to some unprofessional activities that extend all the way towards the acquisition of password-protected data.
2While many websites do contain defensive measures such as IP bans, the invisible conflicts between scrapers1scraper 網(wǎng)絡(luò)爬蟲(chóng),一種按照一定的規(guī)則,自動(dòng)抓取萬(wàn)維網(wǎng)信息的程序或腳本。后文的抓取、爬取,均指從萬(wàn)維網(wǎng)上收集數(shù)據(jù)。and servers are ongoing and gaining in intensity, due to increased competition and economic factors. Most people don’t realise these are taking place between e-commerce stores, although they are happily taking advantage of the low prices found on aggregator websites2aggregator website 聚合網(wǎng)站,指的是通過(guò)人為技術(shù)方式收集其他網(wǎng)站的熱點(diǎn)內(nèi)容,進(jìn)而將相關(guān)鏈接內(nèi)容分類聚合成為自己網(wǎng)站內(nèi)容的網(wǎng)站。
2 aggregator website 聚合網(wǎng)站,指的是通過(guò)人為技術(shù)方式收集其他網(wǎng)站的熱點(diǎn)內(nèi)容,進(jìn)而將相關(guān)鏈接內(nèi)容分類聚合成為自己網(wǎng)站內(nèi)容的網(wǎng)站。like Expedia, Google Shopping, Price-Grabber and Skyscanner.
3Tools can be used for positive and negative purposes, and web scraping is no exception. A fairly common scenario is the scraping of personal data for marketing purposes. Hundreds of millions of users agree to release their data through terms of service agreements on e-commerce sites—whether they realise it or not. The issue with the exposed data, however, is that it has been extracted by social media agencies and used by now-defunct websites that create profiles and list personal details without user permission.
4As a result, web scraping is increasingly being subjected to negative press that has resulted in increased awareness from the public with respect to the value and privacy of their data. There is nothing inherently unethical about web scraping as it automates activities that people often do on a manual basis. The main difference is that web scraping does it on a much bigger scale by using bots to crawl numerous websites and extract huge amounts of information in seconds.
5Extracting publicly available data requires proxies3proxy 代理,一種特殊的網(wǎng)絡(luò)服務(wù)。它允許客戶端通過(guò)這個(gè)服務(wù)與服務(wù)器進(jìn)行連接。. In short, proxies act as intermediaries between the web scraper and web server. Employing proxies allows distributing data requests evenly to the web server, ensuring that the data is requested at a fair rate, as well as providing the anonymity factor to the requesting party.
6Unethical scraping uses data extraction in a way that may compromise4compromise 危及,損害。privacy and result in server overload.
7While many websites try to prevent it through IP bans, this is becoming futile5futile 徒勞的。due to the use of proxies and their function in circumventing66 circumvent 逃避(規(guī)則或限制)。server issues by simulating human behaviour. The end results can be server overloads that cost online businesses money, reduced internet transparency and more distrust from the public with respect to privacy issues.
8Web scraping has many benefits that depend upon the availability of a free and transparent internet. I believe it would benefit the entire tech space if we adopted a few guidelines in order to make the landscape fair for everyone:
1. Scrape publicly available web pages only
2. Study the target website’s legal documents to determine whether you will legally accept their terms of service and if you will do so—whether you will not breach these terms
3. Make reasonable requests for data in order to ensure that server function is not compromised (DDoS attack7DDoS attack 即distributed denial-of-service attack,分散式阻斷服務(wù)攻擊,一種網(wǎng)絡(luò)攻擊手法。該手法的目的在于將目標(biāo)電腦的網(wǎng)絡(luò)資源及系統(tǒng)資源耗盡,待目標(biāo)電腦負(fù)荷過(guò)重而倒下后,通過(guò)系統(tǒng)漏洞入侵目標(biāo)電腦。)
4. Respect privacy concerns of source websites with regards to any data obtained
5. Make use of proxies procured in an ethical manner
9It is commonly known that some proxies operating today are not ethically sourced, with many often obtained through applications downloaded by people on their devices. Whether these individuals are aware that their device is being used is difficult to ascertain.What’s certain is that it’s definitely not ethical to use them as a proxy in cases where they consented to misleading or confusing terms of service that unwillingly turn their device into a participant on a residential proxy network.
10There are some aspects of modern web scraping activity that are missing clarity, and a code of ethics is needed to bring order to the industry. If those in the industry can come together in agreement over a professional approach to web scraping, it will help to maintain a fair, open and free internet that will benefit both businesses and consumers. We are still in the early stages of discovering the full potential of data scraping in different industries, so let’s take advantage of this golden opportunity to drive innovation and create growth in the most ethical way possible. ■