鄔峻
18世紀中葉以來,人類歷史上先后發(fā)生過3次工業(yè)革命。第一次工業(yè)革命開創(chuàng)了“蒸汽時代”(1760—1840年),標志著農(nóng)耕文明向工業(yè)文明的過渡,是人類發(fā)展史上的第一個偉大奇跡;第二次工業(yè)革命開啟了“電氣時代”(1840—1950年),使得電力、鋼鐵、鐵路、化工、汽車等重工業(yè)興起,石油成為新能源,促進了交通迅速發(fā)展以及世界各國之間更頻繁地交流,重塑了全球國際政治經(jīng)濟格局;兩次世界大戰(zhàn)之后開始的第三次工業(yè)革命,更是開創(chuàng)了“信息時代”(1950年至今),全球信息和資源交流變得更為便捷,大多數(shù)國家和地區(qū)都被卷入全球一體化和高度信息化進程中,人類文明達到空前發(fā)達的高度。
2016年1月20日,世界經(jīng)濟論壇在瑞士達沃斯召開主題為“第四次工業(yè)革命”的主題年會,正式宣告了將徹底改變世界發(fā)展進程的第四次工業(yè)革命的到來。論壇創(chuàng)始人、執(zhí)行主席施瓦布(Schwab)教授在其《第四次工業(yè)革命》(The Fourth Industrial Revolution)中詳細闡述了可植入技術(shù)、基因排序、物聯(lián)網(wǎng)(IoT)、3D打印、無人駕駛、人工智能、機器人、量子計算(quanturn computing)、區(qū)塊鏈、大數(shù)據(jù)、智慧城市等技術(shù)變革對“智能時代”人類社會的深刻影響。這次工業(yè)革命中大數(shù)據(jù)將逐步取代石油成為第一資源,其發(fā)展速度、范圍和程度將遠遠超過前3次工業(yè)革命,并將改寫人類命運以及沖擊幾乎所有傳統(tǒng)行業(yè)的發(fā)展[1],建筑、景觀與城市的發(fā)展也不例外。
第四次工業(yè)革命帶來的數(shù)據(jù)爆炸正在改變我們的生活和人類未來。過去十幾年,無論是數(shù)據(jù)的總量、種類、實時性還是變化速度都在呈現(xiàn)幾何級別的遞增[2]。截至2013年全世界電子數(shù)據(jù)已經(jīng)達到460億兆字節(jié),相當于約400萬億份傳統(tǒng)印刷本報告,它們拼接后的長度可以從地球一直鋪墊到冥王星。而僅僅在過去2年里,我們創(chuàng)造的數(shù)據(jù)量就占人類已創(chuàng)造數(shù)據(jù)總量的90%[3]。大數(shù)據(jù)已被視為21世紀國家創(chuàng)新競爭的重要戰(zhàn)略資源,并成為發(fā)達國家爭相搶占的下一輪科技創(chuàng)新的前沿陣地[4]。
雖然數(shù)據(jù)大井噴帶來了符合“新摩爾定律”的數(shù)據(jù)大爆炸,2020年全世界產(chǎn)生的數(shù)據(jù)總量將是2009年數(shù)據(jù)總量的44倍[5],但是由于缺乏應(yīng)對數(shù)據(jù)大爆炸的新型研究范式,世界上僅有不到1%的信息能夠被分析并轉(zhuǎn)化為新知識[6]。Anderson指出這種數(shù)據(jù)爆炸不僅是數(shù)量上的激增,更是在復雜度、類別、變化速度和準確度上的激增與相互混合。他將這種混合型爆炸定義為“數(shù)據(jù)洪水”(data deluge),認為在出現(xiàn)新的研究范式以前,“數(shù)據(jù)洪水”將成為制約現(xiàn)有所有學科領(lǐng)域科研發(fā)展的瓶頸[7]1。在城市和建筑設(shè)計領(lǐng)域,盡管我們早已經(jīng)生活在麻省理工學院米歇爾教授(William J. Mitchell)生前預(yù)言的“字節(jié)城市”(city of bits)里[8],但是卻遠遠沒有賦予城市研究與設(shè)計“字節(jié)的超能量”(power of bits)[9]。我們必須開發(fā)新型研究范式以應(yīng)對“數(shù)據(jù)洪水”的強烈沖擊。
常規(guī)研究方法與傳統(tǒng)范式越來越捉襟見肘,這迫使一些科學家探索適合于大數(shù)據(jù)和人工智能的新型研究范式。Bell、Hey和Szalay預(yù)警道,所有研究領(lǐng)域都必須面對越來越多的數(shù)據(jù)挑戰(zhàn),“數(shù)據(jù)洪水”的處理和分析對所有研究科學家至關(guān)重要且任務(wù)繁重。Bell、Hey和Szalay進而提出了應(yīng)對“數(shù)據(jù)洪水”的“第四范式”。他們一致認為:至少自17世紀牛頓運動定律出現(xiàn)以來,科學家們已經(jīng)認識到實驗和理論科學是理解自然的2種基本研究范式。近幾十年來,計算機模擬已成為必不可少的第三范式:一種科學家難以通過以往理論和實驗探索進行研究的新標準工具。而現(xiàn)在隨著模擬和實驗產(chǎn)生了越來越多的數(shù)據(jù),第四種范式正在出現(xiàn),這就是執(zhí)行數(shù)據(jù)密集型科學分析所需的最新AI技術(shù)[10]。
Halevy等指出,“數(shù)據(jù)洪水”表明傳統(tǒng)人工智能中的“知識瓶頸”,即如何最大化提取有限系統(tǒng)中的無限知識的問題,將可以通過在許多學科中引入第四范式得到解決。第四范式將運用大數(shù)據(jù)和新興機器學習的方法,而不再純粹依靠傳統(tǒng)理論研究的數(shù)學建模、經(jīng)驗觀察和復雜計算[11]。
美國硅谷科學家Gray、Hey、Tansley和Tolle等總結(jié)出數(shù)據(jù)密集型“第四范式”區(qū)別于以前科研范式的一些主要特征(圖1)[12]16-19:1)大數(shù)據(jù)的探索將整合現(xiàn)有理論、實驗和模擬;2)大數(shù)據(jù)可以由不同IoT設(shè)備捕捉或由模擬器產(chǎn)生;3)大數(shù)據(jù)由大型并行計算系統(tǒng)和復雜編程處理來發(fā)現(xiàn)隱藏在大數(shù)據(jù)中的寶貴信息(新知識);4)科學家通過數(shù)據(jù)管理和統(tǒng)計學來分析數(shù)據(jù)庫,并處理大批量研究文件,以獲取發(fā)現(xiàn)新知識的新途徑。
第四范式與傳統(tǒng)范式在研究目的和途徑上的差異主要表現(xiàn)在以下方面。傳統(tǒng)范式最初或多或少從“為什么”(why)和“如何”(how)之類的問題開始理論構(gòu)建,后來在“什么”(what)類問題的實驗觀測中得到驗證。但是,第四范式的作用相反,它僅從數(shù)據(jù)密集型“什么”類問題的數(shù)據(jù)調(diào)查開始,然后使用各種算法來發(fā)現(xiàn)大數(shù)據(jù)中隱藏的新知識和規(guī)律,反過來生成為揭示“如何”和“為什么”類問題的新理論。安德森在其《理論的終結(jié):數(shù)據(jù)洪水使科學方法過時了?》(“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete?”)一文中指出, 首先,第四范式并不急于從煩瑣的實驗和模擬,或嚴格的定義、推理和假設(shè)開始理論構(gòu)建;相反,它從大型復雜數(shù)據(jù)集的收集和分析開始[7]1。其次,隱藏在這些龐大、復雜和交織的數(shù)據(jù)集中的寶貴知識很難處理,通常無法使用傳統(tǒng)的科學研究范式完成知識 發(fā)現(xiàn)[12]16-19。
在國內(nèi)外,使用第四范式進行城市和風景園林研究的探索仍處于起步階段,其方法和目標多種多樣。由于篇幅所限,本研究直接使用了一些開放數(shù)據(jù),并將其與荷蘭政府關(guān)于宜居性的調(diào)查結(jié)果相結(jié)合,重點是將“宜居性”(livability)作為機器學習的預(yù)測目標,以引入系統(tǒng)的數(shù)據(jù)密集型研究方法論證第四范式在城市和風景園林研究中的可行性和實用前景。
宜居性是衡量與評價城市與景觀環(huán)境可居性與舒適性的重要指標,近年來更是成為智慧城市開發(fā)的重要切入點,澳大利亞-新西蘭智慧城市委員會(The Council of Smart Cities of Australia/New Zealand)執(zhí)行主席Adam Beck將智慧城市定義為:“智慧城市將利用高科技和大數(shù)據(jù)加強城市宜居性、可操作性和可持續(xù)性?!盵13]
“Eudaimonia”(宜居性)在西方最初由亞里士多德提出,意味著生活和發(fā)展得很好。長期以來,關(guān)于宜居性并沒有統(tǒng)一的定義,它在不同城市發(fā)展階段、不同地區(qū)和不同學科領(lǐng)域有多樣化的含義與運用,這導致了宜居性概念上的混亂和可操作性上的難度。盡管宜居性缺乏統(tǒng)一的認知和可量化的度量系統(tǒng),多年來,經(jīng)典理論研究嘗試從經(jīng)濟、社會、政治、地理和環(huán)境等維度來探索宜居性的相關(guān)指標。
Balsas強調(diào)了經(jīng)濟因素對宜居性的決定性作用,他認為較高的就業(yè)率、人口中不同階層的購買力、經(jīng)濟發(fā)展、享受教育和就業(yè)的機會、生活水準是決定宜居性的基礎(chǔ)[14]103。Litman也指出人均GDP對宜居性的重要影響。同時,居民對于交通、教育、公共健康設(shè)施的可及性和經(jīng)濟購買力也應(yīng)被視為衡量宜居性的重要指標[15]。Veenhoven的研究也發(fā)現(xiàn)GDP發(fā)展程度、經(jīng)濟的社會需求和居民購買力對于評價宜居性起到重要作用[16]2-3。
Mankiw對以經(jīng)濟作為單一指標評價宜居性提出批評,他認為僅僅用人均GDP維度來衡量宜居性是不夠的[17]。Rojas建議其他維度也必須納入考慮范圍,例如政治、社會、地理、文化和環(huán)境的因素[18]。Veenhoven增加了政治自由度、文化氛圍、社會環(huán)境與安全性作為宜居性評價指標[16]7-8。
Van Vliet的研究表明,社會融合度、環(huán)境清潔度、安全性、就業(yè)率以及諸如教育和醫(yī)療保健等基礎(chǔ)設(shè)施的可及性對城市宜居性具有直接影響[19]。Balsas還承認,除了經(jīng)濟以外,諸如完備的基礎(chǔ)設(shè)施、充足的公園設(shè)施、社區(qū)感和公眾參與度等因素也對城市宜居性的提升發(fā)揮了積極作用[14]103。
盡管上述學者提出了評估宜居性的政治、社會、經(jīng)濟和環(huán)境因素,但他們并沒有給出評估城市宜居性的具體指標建議。Goertz試圖通過整體性的3層方法系統(tǒng)地評估宜居性,從而整合上述要素,并為每個子系統(tǒng)提出相關(guān)的評估指標。第一層界定了宜居類型,第二層構(gòu)建了因素框架,而第三層則定義了具體指標變量[20](表1)。
表1 根據(jù)Goertz三層結(jié)構(gòu)系統(tǒng)總結(jié)的宏觀宜居性因素框架與相關(guān)指標變量Tab. 1 The framework of macro livability factors and related measurement indexes based on Goertz’s three-tiers system
通過對宜居性經(jīng)典研究方法進行總結(jié)和整合,Goertz在宏觀層面對宜居性進行了定性研究,其中一些因素為未來的定量研究指明了方向。但是,他未能在中觀層面上呈現(xiàn)不同城市系統(tǒng)中相應(yīng)的可控變量。作為回應(yīng),Sofeska提出了一個中觀層面的城市系統(tǒng)評價因素框架[21],包括安全和犯罪率、政治和經(jīng)濟穩(wěn)定性、公眾寬容度和商業(yè)條件;有效的政策、獲得商品和服務(wù)的機會、高國民收入和低個人風險;教育、保健和醫(yī)療水平、人口組成、壽命和出生率;環(huán)境和娛樂設(shè)施、氣候環(huán)境、自然區(qū)域的可及性;公共交通和國際化。Sofeska特別強調(diào)了建筑質(zhì)量、城市設(shè)計和基礎(chǔ)設(shè)施有效性對中觀層面城市系統(tǒng)宜居性的影響。
超越宏觀的政治、經(jīng)濟、社會、環(huán)境體系和中觀的城市體系,Giap等認為“宜居性”更應(yīng)該是一個微觀的地域性概念。在定性研究微觀的社區(qū)尺度的宜居性時,他更強調(diào)城市生活質(zhì)量與城市物質(zhì)環(huán)境質(zhì)量對宜居性的微觀影響,并將綠色基礎(chǔ)設(shè)施作為一個重要的指標[22]。因此,Giap等提出在社區(qū)單元尺度上定性研究宜居性的重要性。
但是無論Goertz、Sofeska還是Giap都未建立關(guān)于社區(qū)宜居性的定量評價指標體系和對應(yīng)的預(yù)測方法。目前對宜居性作定量分級調(diào)查主要來自兩套宏觀系統(tǒng):基于六大指標體系的經(jīng)濟學人智庫(Economist Intelligence Unit, EIU)宜居指標與基于十大指標體系的Mercer生活質(zhì)量調(diào)查體系(Mercer LLC)。不過這兩套體系僅提供了主要基于經(jīng)濟指標的不同城市之間宜居性的宏觀分級比較工具,在社區(qū)微觀層面進行量化研究和預(yù)測時并不具備可操作性[23]。
在荷蘭語大辭典中,“l(fā)eefbaarheid”(宜居性)的一般定義如下:“適合居住或與之共存”(荷蘭語:geschikt om erin of ermee te kunnen leven)。因此,荷蘭語境的宜居性實際是關(guān)于主體(有機體,個人或社區(qū))與客體環(huán)境之間適宜和互動關(guān)系的陳述[24]。
1969年,Groot將宜居性描述為對獲得合理收入和享受合理生活的客觀社會保障,充分滿足對商品和服務(wù)需求的社會主觀認識。在這個偏經(jīng)濟和社會目標的定義中,對客觀和主觀宜居性的劃分隱含其中??陀^保障涉及實際可記錄的客觀情況(勞動力市場、設(shè)施、住房質(zhì)量等);主觀意識涉及人們體驗實際情況的主觀方式[25]。在20世紀70年代,宜居性進入了地區(qū)政治的視角。人們意識到宜居性的中心不應(yīng)該是建筑物,而是人;物質(zhì)生活不僅在數(shù)量,更在質(zhì)量。當時的鹿特丹市議員Vermeulen將這一社會概念的轉(zhuǎn)換描述為:“你可以用你住房的磚塊數(shù)量算出你房子的大小,但卻無法知道它的宜居性?!?2002年,荷蘭社會和文化規(guī)劃辦公室(The Social and Cultural Planning Office of The Netherlands)對宜居性給出了以下描述:物質(zhì)空間、社會質(zhì)量、社區(qū)特征和環(huán)境安全性之間的相互作用。2005年Van Dorst總結(jié)了宜居性的3個視角:顯著的宜居性是人與環(huán)境的最佳匹配;以人為本體驗環(huán)境的宜居性;從可定義的生活環(huán)境去推斷宜居性[26]。
荷蘭政府自1998年以來設(shè)計并分發(fā)了大量調(diào)查問卷,定期對全國部分地區(qū)進行宜居性調(diào)查和統(tǒng)計。調(diào)查問卷的內(nèi)容主要包括:基于居住環(huán)境、綠色與文體設(shè)施、公共空間基礎(chǔ)設(shè)施、社會環(huán)境、安全性等方面的滿意度,對宜居性從極低到極高(1~9)打分。
同時, 荷蘭住建部(Ministerie van Volkshuisvesting, Ruimtelijke Ordening en Milieu, VROM)委托國立公共健康和環(huán)境研究院(Rijksinstituut voor Volksgezondheid en Milieu, RIVM)以及荷蘭建筑環(huán)境研究院(Het Research Instituut Gebouwde Omgeving, RIGO)依據(jù)調(diào)查問卷結(jié)果進行深入分析。他們首先對過去150年內(nèi)建筑、城市規(guī)劃、社會學、經(jīng)濟學角度關(guān)于宜居性的研究進行了相關(guān)文獻梳理,發(fā)現(xiàn)對于宜居性的定義和研究長期以來存在廣泛的不同定義甚至分歧。他們通過文獻研究認為:如果想在關(guān)于宜居性的研究領(lǐng)域取得突破,那么就必須建立一個超越當前文獻中學科差異的關(guān)于宜居性的多學科融合的理論框架。為此,他們提出了與客觀環(huán)境相對應(yīng)的宜居性、感知和行為的主觀評估系統(tǒng)。該系統(tǒng)包括:研究環(huán)境和人的各方面如何影響對生活環(huán)境宜居性的感知;縱向研究宜居性的交叉特征;對宜居性決定性因素進行跨文化比較,旨在根據(jù)時間、地點和文化確定普遍要素、基本需求和相對要素。
從調(diào)查問卷結(jié)果來看,該研究認為生活質(zhì)量是連接人的主觀評價和客觀環(huán)境的一個研究切入點。而在評價生活質(zhì)量時,社區(qū)尺度的環(huán)境、經(jīng)濟和社會質(zhì)量的因子選擇是至關(guān)重要的。他們列出50個因子作為評價荷蘭社區(qū)宜居性的指標,并將這些因子分為居住條件、公共空間、環(huán)境基礎(chǔ)設(shè)施、人口構(gòu)成、社會條件、安全性等幾個領(lǐng)域(圖2)。他們提出盡快利用大數(shù)據(jù)和開發(fā)高級AI預(yù)測工具支持城市建設(shè)、決策和制定規(guī)劃的緊迫性[27]。
為了響應(yīng)后工業(yè)社會的到來,Battey在 20世紀90年代率先提出了“智慧城市”的概念。由于當時的大數(shù)據(jù)還處于初期階段,Battey只強調(diào)了互聯(lián)網(wǎng)技術(shù)在增強信息交流和城市競爭力中的重要性[28]。鑒于智慧城市的內(nèi)涵太廣泛,并且涉及整個城市系統(tǒng),因此智慧城市很難獲得統(tǒng)一的認同。目前,一個由6個子系統(tǒng)構(gòu)成的智慧城市框架逐步被許多學者接受,其中智慧公民、智慧環(huán)境和智慧生活是3個重要的環(huán)節(jié)[29]。這符合荷蘭宜居性理論研究關(guān)鍵結(jié)論中關(guān)于社區(qū)居民、主觀宜居性和客觀環(huán)境質(zhì)量間互動關(guān)系的描述。
如RIVM和RIGO研究中心的結(jié)論所示,經(jīng)典分析研究結(jié)果和問卷調(diào)查結(jié)果基本吻合。但是他們認為所有使用傳統(tǒng)范式的研究都存在一定局限性,無論它們來自問卷、觀察、系統(tǒng)理論、數(shù)學模型還是統(tǒng)計方法,在方法論創(chuàng)新上都沒有太大的區(qū)別。因此,除了傳統(tǒng)研究之外,有必要探索利用大數(shù)據(jù)的新方法并開發(fā)先進的AI工具,這也將為宜居性評估創(chuàng)造新條件。第四范式帶來的上述大數(shù)據(jù)的挑戰(zhàn)和機遇恰好為這種轉(zhuǎn)變提供了機會。這項研究的目的就是開發(fā)一種基于機器學習的新型數(shù)據(jù)密集型AI工具箱,以監(jiān)測荷蘭人居環(huán)境中的宜居性。
新工具箱旨在最大限度地從所有開源數(shù)據(jù)中提取數(shù)據(jù)并開發(fā)相關(guān)變量,然后它將通過高級數(shù)據(jù)工程和數(shù)據(jù)庫技術(shù)完成轉(zhuǎn)換、集成和存儲數(shù)據(jù)。此后,宜居性問卷結(jié)果獲取的宜居性等級將作為機器學習中的預(yù)測目標。在數(shù)據(jù)倉庫中,將來自調(diào)查問卷的歷史宜居性等級與同一歷史時期內(nèi)最相關(guān)的變量進行集成,以建立機器學習的預(yù)測模型。先進的AI算法能夠根據(jù)最相關(guān)變量的新的數(shù)據(jù)輸入來預(yù)測未來的宜居性。同時,可以將其與傳統(tǒng)范式得出的結(jié)論進行比較,以確定通過第四范式發(fā)現(xiàn)新知識的有效性和優(yōu)勢。新的輸入可以支持機器學習的再訓練,以改善模型。圖3總結(jié)了基于大數(shù)據(jù)的關(guān)鍵研究框架。
基于大數(shù)據(jù)爆炸及相伴產(chǎn)生的第四范式,這種新的預(yù)測工具箱將不依賴于現(xiàn)有常用研究范式,而是首先搜尋可用的開源大數(shù)據(jù)、再通過數(shù)據(jù)密集型的機器學習來研究這些數(shù)據(jù),并構(gòu)建算法和預(yù)測模型。就可以在提供相應(yīng)參數(shù)的情況下對任何社區(qū)的宜居性進行科學預(yù)測甚至提前干預(yù),為智慧城市的宜居性評價和規(guī)劃打下基礎(chǔ)。
社區(qū)宜居性等級的歷史記錄是通過RIVM問卷和RIGO研究獲得的,而這些社區(qū)的人口、經(jīng)濟、社會和環(huán)境領(lǐng)域的所有可用變量均來自同一時期荷蘭中央統(tǒng)計局(Het Centraal Bureau voor de Statistiek, CBS)和其他開源數(shù)據(jù)的數(shù)據(jù)集。這2個數(shù)據(jù)集可以通過它們的郵政編碼相互連接,派生出的數(shù)據(jù)用于形成可能的機器學習數(shù)據(jù)集。由于數(shù)據(jù)來源不同、格式不同、規(guī)模大且雜亂無章,并且具有不同的實時更新頻率,因此它們符合大數(shù)據(jù)的最典型特征,即“四個V”:數(shù)量(volume)、種類(variety)、準確性(veracity)和速度(velocity)[30]。首先必須執(zhí)行必要的數(shù)據(jù)工程學流程,以滿足機器學習對數(shù)據(jù)質(zhì)量的基本要求。
量子力學的先驅(qū),維爾納·海森堡指出,“必須記住,我們觀察到的不是自然本身,而是暴露在我們質(zhì)疑方法下的自然”。傳統(tǒng)范式研究帶來的主觀認知局限性是顯而易見的。而人工智能和大數(shù)據(jù)的涌現(xiàn),無疑提供了第四范式這樣一個更加客觀的認知方法論,來分析隱藏在繁雜多變的數(shù)據(jù)后面的神秘自然規(guī)律。因而,Wolkenhauer將數(shù)據(jù)工程總結(jié)為認知科學和系統(tǒng)科學的完美結(jié)合,并稱之為知識工程中數(shù)據(jù)和模型相互匹配的最佳實踐(圖4)[31]。
根據(jù)Cuesta的數(shù)據(jù)工程學工藝[32],在預(yù)處理繁復數(shù)據(jù)集合時,我們可以通過數(shù)據(jù)流程管理、數(shù)據(jù)庫設(shè)計、數(shù)據(jù)平臺架構(gòu)、數(shù)據(jù)管道構(gòu)建、數(shù)據(jù)計算機語言腳本編程等關(guān)鍵流程來實現(xiàn)數(shù)據(jù)的獲取、轉(zhuǎn)換、清理、建模和存儲。這個復雜過程可以用最典型的ETL(Extract-Transform-Load)流程簡述(圖5)。
在通過上述復雜過程對所有不同來源的數(shù)據(jù)進行清理和規(guī)范化處理之后,應(yīng)為數(shù)據(jù)倉庫(data warehouse, DWH)的構(gòu)建設(shè)計合適的數(shù)據(jù)模型。數(shù)據(jù)應(yīng)定期存儲在數(shù)據(jù)倉庫中,以利于深入分析并及時進行機器學習?;诋斍皵?shù)據(jù)環(huán)境,并以宜居性為核心預(yù)測目標,設(shè)計了星形數(shù)據(jù)模型。在此關(guān)系數(shù)據(jù)模型中,在中心構(gòu)建記錄每個社區(qū)宜居性級別的事實表,并通過主鍵和外鍵將其與各個域的維度表相關(guān)聯(lián)。該數(shù)據(jù)模型的設(shè)計參考了上述針對宜居因子分類框架的經(jīng)典研究方法的相關(guān)結(jié)果(表1,圖2)。6個主要維度分別是人口維度、社會維度、經(jīng)濟維度、住房維度、基礎(chǔ)設(shè)施維度、土地利用和環(huán)境維度(圖6)。
實際上,通過前數(shù)據(jù)工程收集和建模得到的源數(shù)據(jù)通常高度混亂,整體質(zhì)量低下,不適合直接用于機器學習。需要進行廣泛的數(shù)據(jù)清理,以符合機器學習對數(shù)據(jù)質(zhì)量的基本要求。盡管它絕對不是機器學習最動人的部分,但它代表了每個專業(yè)數(shù)據(jù)科學家必須面對的流程之一。此外,數(shù)據(jù)清理是一項艱巨而煩瑣的任務(wù),需要占用數(shù)據(jù)科學家50%~80%的精力。眾所周知,“更好的數(shù)據(jù)集往往勝過更智能的算法”。換句話說,即使運用簡單的算法,正確清理過的數(shù)據(jù)集也能提供最深刻的見解。當數(shù)據(jù)“燃料”中存在大量雜質(zhì)時,即使是最佳算法(即“機器”)也無濟于事。
鑒于數(shù)據(jù)清理工作如此重要,我們首先必須明白什么是合格的數(shù)據(jù)。一般而言,合格數(shù)據(jù)應(yīng)該至少具備以下質(zhì)量標準(圖7)[33]。 1)有效性。數(shù)據(jù)必須滿足業(yè)務(wù)規(guī)則定義的有效約束或度量程度的有效范圍,包括滿足數(shù)據(jù)范圍、數(shù)據(jù)唯一性、有效值、跨字段驗證等約束。2)準確性。與測量值或標準以及真實值、唯一性和不可重復性的符合程度。為了驗證準確性,有時必須通過訪問外部附加數(shù)據(jù)源來確認數(shù)值的真實性。3)完整性。數(shù)據(jù)的缺省或者缺失以及各范圍的數(shù)據(jù)值的完整分布對機器學習的結(jié)果將產(chǎn)生相應(yīng)的影響。如果系統(tǒng)堅持某些欄位不應(yīng)為空,則可以通過指定一個表示“未知”或“缺失”的值,僅僅提供默認值并不意味著數(shù)據(jù)已具備完整性。4)一致性。指的是一套度量在整個系統(tǒng)中的等效程度。當數(shù)據(jù)集中的2個數(shù)據(jù)項相互矛盾時,就會發(fā)生不一致。數(shù)據(jù)的一致性包括內(nèi)容、格式、單位等。
顯然,不同類型的數(shù)據(jù)需要不同類型的清理算法。在評估了適用于機器學習的宜居性數(shù)據(jù)集的特征和初始質(zhì)量之后,應(yīng)使用以下方法清理所提議的數(shù)據(jù)。1)刪除不需要的或無關(guān)的觀察結(jié)果。2)修正結(jié)構(gòu)錯誤。在測量、數(shù)據(jù)傳輸或其他“不良內(nèi)部管理”過程中會出現(xiàn)結(jié)構(gòu)錯誤。3)檢查數(shù)據(jù)的標簽錯誤,即對具有相同含義的不同標簽進行統(tǒng)一處理。4)過濾不需要的離群值。離群值可能會在某些類型的模型中引起問題。如果有充分的理由刪除或替換了異常值,則學習模型應(yīng)表現(xiàn)得更好。但是,這項工作應(yīng)謹慎進行。5)刪除重復數(shù)據(jù),以避免機器學習的過度擬合。6)處理丟失的數(shù)據(jù)是機器學習中一個具有挑戰(zhàn)性的問題。由于大多數(shù)現(xiàn)有算法不接受缺失值,因此必須通過“數(shù)據(jù)插補”技術(shù)進行調(diào)整,例如:刪除具有缺失值的行;用“0”、平均值或中位數(shù)替換丟失的數(shù)字標值。缺失值也可以使用基于特殊算法的其他非缺失值的變量來估算。實際的具體處理應(yīng)根據(jù)值的實際含義和應(yīng)用場景確定。
通過以上專業(yè)操作,數(shù)據(jù)中重復或者相同社區(qū)不同名字的社區(qū)單元被合并,同樣社區(qū)的不同數(shù)字標簽經(jīng)過自動對比與整合,錯誤的標簽和數(shù)據(jù)被校正,一些異常值經(jīng)再度確認后將被刪除,對重復數(shù)據(jù)進行去重復處理。缺失的數(shù)據(jù)按不同情況進行相關(guān)“數(shù)據(jù)插補”處理后,使數(shù)據(jù)基本達到機器學習的要求。
作為數(shù)據(jù)預(yù)處理的重要組成部分,特征工程有助于構(gòu)建后續(xù)的機器學習模型以及知識發(fā)現(xiàn)的關(guān)鍵窗口。特征工程是一種搜索相關(guān)特征的過程,該特征可以利用AI算法和專業(yè)知識來最大化機器學習的效率,還用作機器學習應(yīng)用程序的基礎(chǔ)。但是提取特征非常困難且耗時,并且該過程需要大量的專業(yè)知識。斯坦福大學教授Andrew Ng指出:“應(yīng)用型機器學習的主要任務(wù)是特征工程學。”[34]
在這個宜居性機器學習模型中,已經(jīng)研究了維度表局域各自的特征,以獲得局部特征的排名。接下來對整個全域的維度表進行研究以獲取全局特征的排名。局部特征排名能夠了解局部各維度特征對宜居性的影響,從而有助于發(fā)現(xiàn)隱藏在大數(shù)據(jù)中的新知識。全局特征被用于為機器模型構(gòu)建特定的算法,以達到最高的預(yù)測精度。
通過機器學習發(fā)現(xiàn):人口維度、社會維度、經(jīng)濟維度、住房維度、基礎(chǔ)設(shè)施維度以及土地利用與環(huán)境維度的局部特征對宜居性的局部影響權(quán)重排序各不相同(圖8)。現(xiàn)有數(shù)據(jù)顯示,土地利用與環(huán)境因素對宜居性的影響是最不均衡的。
在第一組維度(人口維度)中,宜居性影響因子權(quán)重排名最靠前的依次是社區(qū)婚姻狀態(tài)、人口密度、25~44歲人口比率、帶小孩家庭數(shù)量等人口特征。提示只有一定的人口密度才能形成宜居性,而青壯年人口、婚姻穩(wěn)定狀態(tài)、有孩子家庭數(shù)量對社區(qū)宜居性具有良性作用。
在第三組維度(經(jīng)濟維度)中,宜居性影響因子權(quán)重排名最靠前的依次是家庭平均年收入、每家擁有汽車平均數(shù)、家庭購買力等方面的指標。
在第四組維度(住房維度)中,宜居性影響因子權(quán)重排名最靠前的依次是政府廉租房比率、小區(qū)內(nèi)買房自居家庭數(shù)、新建住宅數(shù)、房屋空置率等方面的指標。
在第五組維度(基礎(chǔ)設(shè)施維度)中,社區(qū)內(nèi)超市數(shù)量、學校托幼機構(gòu)數(shù)量、醫(yī)療健康機構(gòu)數(shù)量、健身設(shè)施數(shù)量、餐飲娛樂設(shè)施數(shù)量、這些設(shè)施的可及性對社區(qū)宜居性影響較大。
在第六組維度(土地利用與環(huán)境維度)中,城市化程度、公園、綠道、水體、交通設(shè)施以及到不同土地功能區(qū)的可達性對社區(qū)宜居性影響較大。
全局特征組已被應(yīng)用來建立用于構(gòu)建機器學習模型的特定算法,以獲得最高的預(yù)測精度。在按全局變量影響因子排序的140個可收集變量中,只有第二組維度(社會維度)、第三組維度(經(jīng)濟維度)和第四組維度(住房維度)的變量出現(xiàn)在最具影響力因子前20名中。這表明,總體而言,社會、經(jīng)濟和住房方面對社區(qū)的宜居性具有更大的影響。通過對具體排名簡化整合,以下綜合因素出現(xiàn)在最具影響力因子前10名中,是影響社區(qū)宜居性的最具決定性因素:獲得社會救濟的人口比例、非西方移民比例、政府廉租住房比率、購買住房的平均市場價格、高收入人口比率、固定收入住戶數(shù)量、新住房數(shù)量、總體犯罪率、戶年均天然氣消耗量以及戶年均電力消耗量(圖9)。
完成上述必要的數(shù)據(jù)掃描和研究工作后,我們將進入機器學習的最核心階段:開發(fā)算法并優(yōu)化模型。這個階段的工作重點是獲得最佳的預(yù)測結(jié)果。
因為必須先標注數(shù)據(jù)以便指定預(yù)測目標從而進行學習,所以本機器學習實際上為監(jiān)督式機器學習(supervised machine learning)。在進行機器學習前通過必要的數(shù)據(jù)掃描和研究,得到一個初步的數(shù)據(jù)評估結(jié)論。根據(jù)上述數(shù)據(jù)清理工程原理進行大量的數(shù)據(jù)清理,得到滿足機器學習基本標準的數(shù)據(jù)。再通過上述數(shù)據(jù)特征工程得到綜合簡化后的10個最強影響因子作為預(yù)測因子參與后續(xù)機器學習,以便生成算法和建模。
然后將上述數(shù)據(jù)集劃分為訓練數(shù)據(jù)集和測試數(shù)據(jù)集,劃分比例為7∶3。將第一組數(shù)據(jù)集(訓練數(shù)據(jù))輸入機器學習算法得到訓練模型,對訓練模型進行打分。將第二組數(shù)據(jù)集(測試數(shù)據(jù))輸入訓練模型進行比較和評估。本機器學習目標是宜居性的不同等級,屬于多分類問題機器學習,擬采取兩組常用決策林算法進行比選優(yōu)化:多類決策叢林算法和多類決策森林算法。這2種通用算法的工作原理都是構(gòu)建多個決策樹,然后對最常見的輸出類進行投票。投票是一種聚合形式,在這種形式中,分類決策林中的每個樹都輸出標簽的非規(guī)范化頻率直方圖。聚合過程對這些直方圖求和并對結(jié)果進行規(guī)范化處理,以獲取每個標簽的“概率”。預(yù)測置信度較高的樹在最終決定系綜時具有更大的權(quán)重。通常,決策林是非參數(shù)模型,這意味著它們支持具有不同分布的數(shù)據(jù)。在每個樹中,為每個類運行一系列簡單測試,從而增加樹結(jié)構(gòu)的級別,直到到達葉節(jié)點(決策)能最佳滿足預(yù)測目標。
該機器學習的決策森林分類器由決策樹的集合組成。通常,與單個決策樹相比,集成模型提供了更好的覆蓋范圍和準確性。通過后臺代碼將算法部署到云計算環(huán)境中運行,生成的該機器學習的具體工作流程如圖10所示。在部署到云端之后,仍然需要在后臺定期使用新輸入的數(shù)據(jù)來改進算法,換言之,通過重新訓練來更新和完善數(shù)據(jù)模型和算法。圖11顯示了機器學習的整個生命周期。
隨后根據(jù)細過濾器濾料要求,從倉庫調(diào)撥合適粒徑的細石榴石,粗石榴石,無煙煤3種濾料,對細過濾器進行徹底的清罐防腐,對集水器分水器結(jié)構(gòu)進行了檢查并按照設(shè)計濾料厚度更換濾料。濾料更換完成后,注水系統(tǒng)恢復正常(圖3)。
從機器學習的背景中提取出兩組算法的混淆矩陣(圖12)。與多類決策叢林算法相比,多類決策森林算法具有更好的性能。多類決策叢林算法的主要錯誤是,易將宜居性 1~2級高估為3~4級,且對5~9級的預(yù)測不夠準確。在多類決策森林算法中很少發(fā)生類似的錯誤,因此它具備更好的總體性能。另外,多類決策叢林算法總體預(yù)測準確率為76%,而多類決策森林算法總體預(yù)測準確率為96%,高于前者,所以決定在云端生產(chǎn)環(huán)境中部署多類決策森林算法(表2)。
表2 兩種不同機器學習算法的預(yù)測性能比較Tab. 2 A comparison of predictive performances of two different machine learning algorithms
對全荷蘭人居環(huán)境宜居性進行反復機器學習和預(yù)測后,可在全國地圖上進行可視化和監(jiān)測(圖13)。其中顏色越偏綠的區(qū)域宜居性越高,越偏紅的地方宜居性越低。該圖顯示預(yù)測宜居性高低分布全國相對均衡,在鏈型城市帶(Randstad)和靠近德國東部地區(qū)的高宜居性區(qū)域相對比較集中。宜居性比較低的相對集中在近年圍海造地形成的新省份弗萊福蘭(Flevoland),可能是人口密度較低以及配套基礎(chǔ)設(shè)施比較滯后造成的。
此外,該預(yù)測工具還能夠?qū)χ杏^層面的城市群和微觀層面的社區(qū)進行深入的研究和預(yù)測。如大鹿特丹地區(qū)和海牙地區(qū)的宜居性預(yù)測結(jié)果表明(圖14),一些老城區(qū)的市區(qū)宜居性不高,而郊區(qū)的宜居性通常較高。特別是鹿特丹和代爾夫特交界處的北郊,以及海牙西北部的沿海地區(qū),相對宜居且人口密集。
第四范式荷蘭智慧城市宜居性預(yù)測研究的主要結(jié)果表明,基于可用大數(shù)據(jù)和必要的數(shù)據(jù)工程,由人工智能算法可直接推演得到最影響人居環(huán)境宜居性的十大主導要素簡化后總結(jié)為:獲得社會救濟的人口比例、非西方移民比例、政府廉租住房比率、購買住房的平均市場價格、高收入人口比率、固定收入住戶數(shù)量、新住房數(shù)量、總體犯罪率、戶年均天然氣消耗量以及戶年均電力消耗量。此外,可以通過輸入最新數(shù)據(jù)集和機器再學習改進模型來更新變量,以執(zhí)行環(huán)境宜居性的實時預(yù)測。
這項研究的結(jié)果可以應(yīng)用于宜居性分析的4個不同階段(圖15):宜居性描述研究、宜居性診斷研究、宜居性預(yù)測研究、宜居性預(yù)視研究。因此,它可以根據(jù)實時更新的大數(shù)據(jù)和經(jīng)過重新訓練的算法,對人類住區(qū)中的宜居性進行監(jiān)測和早期干預(yù)。
將根據(jù)第四范式進行的本研究與前述傳統(tǒng)范式的研究進行比較,發(fā)現(xiàn)不需要依賴傳統(tǒng)人工智能系統(tǒng)的專家體系(expert system)或者專業(yè)研究人員的長期大量研究積累就能得到一些最有效的知識發(fā)現(xiàn)和高精度的預(yù)測模型。通過機器學習得到的宜居性研究結(jié)論,無論在主導要素、局域特征還是全局特征上,基本與RIVM與RIGO宜居性的相關(guān)定性研究結(jié)果相互吻合。此外,本研究能夠以定量的方式對預(yù)測中最具決定性的因素進行快速排名,從而使科學研究更加高效、快捷,并且實現(xiàn)實時數(shù)據(jù)更新和預(yù)測。
本研究可用數(shù)據(jù)集中的土地利用與環(huán)境簇群依然偏少,導致其錐型圖比較尖銳。這個不足之處需要在將來通過收集更多土地利用與環(huán)境相關(guān)變量執(zhí)行強化學習,進行更多的知識發(fā)現(xiàn),拓寬預(yù)測模型的觀察視野。
另外,在收集和處理實時大量數(shù)據(jù)時需要更強大的數(shù)據(jù)收集與處理能力、計算能力和更復雜的運算環(huán)境。通過最新的5G、物聯(lián)網(wǎng)和量子計算等新科技,將來研究可以收集更多、更復雜甚至非結(jié)構(gòu)性實時數(shù)據(jù)擴展當前研究,從而具備更廣闊的智慧城市運用前景。
圖表來源:
圖1引自參考文獻[12];圖2引自參考文獻[27];圖3、5、6、8~14由作者繪制;圖4引自參考文獻[31];圖7引自參考文獻[33];圖15由作者根據(jù)Gartner概念繪制;表1~2由作者繪制。
(編輯/王一蘭)
WU Jun
1 Research Background
1.1 The Fourth Industrial Revolution: Entering a New Age of Human-Orientated Artificial Intelligence
Since the mid-18th century, mankind has gone through three industrial revolutions. The first industrial revolution had ushered in the“steam age” (1760—1840), which marked our transition from agriculture to industry, and represented the first great miracle in the history of human development. The second industrial revolution had launched the“electric age” (1840—1950), which led to the rise of heavy industries such as electricity, steel, railways, chemicals and automobiles, with oil becoming a new energy source. It promoted the rapid development of transport and more frequent exchanges among countries around the world, and reshaped the global political and economic landscape. The third industrial revolution, which began after the two world wars, had initiated the“information age” (1950—present). As it becomes easier to carry out global exchanges of information and resources, most countries and regions are involved in the process of globalization and informatization, and human civilization has reached an unprecedented level of development.
On the 20th January of 2016, the World Economic Forum held its annual meeting in Davos, Switzerland, with a focus on the theme, “The fourth industrial revolution”, and officially heralded the arrival of the fourth industrial revolution that shall bring about a radical change to the global development process. In his address onThe Fourth Industrial Revolution, Professor Schwab, founder and executive chairman of the World Economic Forum, had elaborated on the profound impact of new technologies, such as implantable technology, genetic sequencing, Internet of Things(IoT), 3D printing, autonomous vehicle, artificial intelligence, robotics, quantum computing, blockchain, big data, and smart cities, on the human society in the“Age of Intelligence”. In this industrial revolution, big data will gradually replace oil as the first resource. The pace, scope, and extent of its development shall far exceed those of the previous three industrial revolutions. It shall rewrite the fate of mankind and make a great impact on the development of almost all traditional industries, including architecture, landscape and urban development[1].
1.2 The Fourth Paradigm: The Urgency of Exploring the New Paradigm of“Data-Intensive” Research
Our future and our lives are being changed by the data explosion brought about by the fourth industrial revolution. Over the last decades, there had been an exponential growth in the total volume, variety, veracity of data, as well as the velocity in data[2]. In 2013, the amount of electronic data generated worldwide had reached 46 billion terabytes, equivalent to about 400 trillion traditional printed reports, which could have paved the way from Earth to Pluto. In the past two years alone, we have generated more than 90% of the data worldwide[3]. Regarded as an important strategic resource in nations that are competing for innovation in the 21st century, big data is at the forefront of the next round of scientific and technological innovation that developed countries are scrambling to seize[4].
Despite the“New Moore’s Law” data explosion caused by the big data blowout, and the worldwide generation of 44 times more data in 2020 than in 2009[5], less than 1% of the information in the world can be analyzed and translated into new knowledge[6], due to the lack of a new research paradigm to respond to the data explosion. Anderson pointed out that apart from triggering a surge in quantity, data explosion also represents a mixture and explosion of data in terms of volume, variety, veracity, and velocity. He defined this hybrid explosion as a“data deluge” and argued that the“data deluge” would become a bottleneck for scientific researches in all existing disciplines before the emergence of a new research paradigm[7]1. Although we are already living in the“city of bits” as predicted by the late MIT professor William J. Mitchell in the field of urban and architectural design[8], we are still a distance away from the acquisition of“power of bits” in the urban research and design sector[9]. We must develop a new research paradigm to counter the strong impact of data deluges.
The conventional research methods and traditional paradigms are becoming increasingly inadequate. This has compelled some scientists to explore new research paradigms that are suitable for big data and artificial intelligence. Bell, Hey and Szalay warned that all fields of research shall have to face increasing data challenges, and the handling and analysis of“data deluge” are becoming increasingly onerous and vital for all researchers and scientists. As such, Bell, Hey and Szalay proposed the“fourth paradigm” to deal with the“data deluge”. They concurred that: since Newton’s Laws of Motion in the 17th century or earlier, scientists have recognized that experimental and theoretical sciences are two basic research paradigms for the understanding of nature. In recent decades, computer simulation has become an essential third paradigm and a new standard tool for scientists to explore hard-to-reach areas of theoretical research and experimentation. Nowadays, as an increasing amount of data is generated by simulations and experiments, the fourth paradigm is emerging, and this shall be the latest AI technology desired to perform data-intensive scientific research[10].
Halevy et al. pointed out that the“data deluge” highlighted the“knowledge bottleneck” in traditional AI. This indicates that the question of how to maximize the extraction of infinite knowledge with limited systems shall be solved and applied in many disciplines through big data and emerging machine learning algorithms brought forth by the fourth paradigm. This is in contrast with traditional paradigms that rely solely on pure mathematical modeling, empirical observation, and complex computation[11].
The scientists from the Silicon Valley, Gray, Hey, Tansley and Tolle, have summed up the key features of a data-intensive“fourth paradigm”, that is unlike earlier scientific paradigms, as follows(Fig. 1)[12]16-19: 1) Exploration of big data shall lead to an integration of existing theories, experiments, and simulations; 2) Big data can be captured by different IoT devices or generated by simulators; 3) Big data is processed by large parallel computing systems and complex programming to discover valuable information/new knowledge hidden in the big data; 4) Scientists shall obtain novel methodologies to acquire new knowledge through data management and statistical analysis of databases and mass research documents.
The differences between the fourth paradigm and the traditional paradigms in research purposes and approaches are mainly reflected in the following aspects. The traditional paradigms started with more or less a theoretical construction from“why” or“how”, and was later verified in more experimental observations of“what”. However, the fourth paradigm has the opposite effect. It initially starts with data-intensive“what” surveys, and then uses various algorithms to discover new knowledge and laws hidden in big data, which in turn become the“how” or“why” New theory. In his articleThe End of Theory: The Data Deluge Makes the Scientific Method Obsolete?, Anderson pointed out that, firstly, the fourth paradigm is not in a hurry to start theoretical construction from tedious experiments and simulations, or strict definitions, inferences, and assumptions. Instead, it starts with the collection and analysis of large and complex datasets[7]. Secondly, the valuable knowledge hidden in these huge, complex and intertwined datasets is hard to be processed with the traditional scientific research paradigms to discover new knowledge[12]17-19.
The exploration of urban and landscape research using the fourth paradigm is still in its infancy, both at home and abroad, with a myriad of methods and objectives available. Due to space limitations, this research has directly used some various open data integrated with the results from the governmental survey for livability in The Netherlands, focused on the“l(fā)ivability” as the predictive goal in the machine learning to introduce a systematic data-intensive research method to demonstrate the feasibility and prospects of the fourth paradigm in urban and landscape research.
2 Research Status and Objectives
2.1 Summary of the Relevant International Researches on Urban Livability with Traditional Paradigms
As an important indicator used to measure and assess the level of comfort and habitability in urban and landscape environments, livability has become a key point for the development of smart cities in recent years. Adam Beck, executive director of the Smart Cities Council, of Australia/New Zealand defines a smart city as, “The smart city is one that uses technology, data and intelligent design to enhance a city’s livability, workability and sustainability.”[13]
“Eudaimonia” (Livability) was first proposed in the West by Aristoteles, who defined it as“doing and living well”. For a long time, there had been no unified definition for livability. Instead, there are different implications and applications in different stages of urban development, in different regions, and in different disciplines. This has led to confusion and difficulty in implementing the concept of livability. Although livability lacks a unified and quantifiable measurement system, classical theoretical studies have attempted to explore relevant indicators of livability from the economic, social, political, geographical, and environmental dimensions over the last decades.
Balsas emphasized the decisive role of economic factors in livability. He argued that the foundation of livability is made up of factors such as high employment rates, affordability from a diverse population, economic development, living standards, and accessibility to education and employment[14]103. Litman also pointed out the significant impact of per capita GDP on livability. In addition, the accessibility of residents to transport, education, and public health facilities, as well as their economic power to afford such services, should be recognized as important indicators of livability[15]. Veenhoven’s research also found that the level of GDP development, social needs of the economy, and the purchasing power of residents were vital to the assessment of livability[16]2-3.
Mankiw criticized the use of economy as the only indicator to assess livability and argued that it was inadequate to measure livability solely using per capita GDP[17]. Rojas suggested that other dimensions, such as political, social, geographical, cultural, and environmental factors, should also be taken into account[18]. Furthermore, Veenhoven recommended political freedom, cultural atmosphere, social climate, and safety as additional indicators for the assessment of livability[16]7-8.
Van Vliet’s research showed that social integration, environmental cleanliness, safety, employment rate, and accessibility of infrastructures such as education and medical care had a direct impact on urban livability[19]. Balsa also conceded that factors other than the economy, such as completed infrastructure, adequate park facilities, community spirit, and public participation also played a positive role in urban livability[14]103.
Although the aforementioned scholars had proposed the political, social, economic and environmental foundations to evaluate livability, they have not recommended concrete indexes for the appraisal of urban livability. Goertz has tried to integrate the above elements by systematically evaluating livability with a holistic three-tiers approach and proposing relevant evaluation indicators for each sub-system. The first tier is composed of the types of livability, the second tier includes the framework of factors, while the third tier shows the variables of indicators[20](Tab. 1).
Tab. 1 The framework of macro livability factors and related measurement indexes based on Goertz’s three-tiers system
Through his summary and integration of classical livability, Goertz has made the qualitative research of livability feasible at the macro-level and some of the factors have indicated the direction for future quantitative research. However, he has failed to present the corresponding controllable variables in different urban systems at the meso-level. In response, Sofeska has proposed a framework of evaluation factors in urban systems at the mesolevel[21], including security and crime rates, political and economic stability, public tolerance, and business conditions; effective policies, access to goods and services, high national income and low personal risks; education, health and medical levels, demographic composition, longevity and birth rates; environment and recreation facilities, climatic environment, accessibility of natural areas; public transport, and internationalization. In particular, Sofeska hasemphasized the impact of building quality, urban design, and effectiveness of infrastructure on the livability of urban systems at the meso-level.
Apart from the nation’s political, economic, social, and environmental systems at the macrolevel, as well as urban systems at the meso-level, Giap et al. believed that“l(fā)ivability” should be a community concept at the micro-level. While carrying out the qualitative research of livability on the micro-community scale, he emphasized the micro impacts of the quality of urban life and the city’s physical environment on livability and regarded green infrastructure as the significant indicators[22]. Therefore, Giap et al. proposed the importance of qualitative research of livability on the community scale.
However, neither Goertz, Sofeska nor Giap have established a quantitative evaluation of index system and the corresponding method to predict community livability. At present, the quantitative classification survey of livability is mainly derived from two macro systems: Economist Intelligence Unit(EIU)’s Livability Index, which is based on six major indicators; and Mercer’s Quality of Living Survey, which is based on ten indicators. Nevertheless, these two systems provide only macrolevel tools to compare livability among different cities and are mainly based on economic indicators, hence they are not feasible in quantitative research and prediction at the micro-level of communities[23].
2.2 Summary of the Researches on Urban Livability in The Netherlands
In the Dutch dictionary, “l(fā)eefbaarheid” (livability) is generally defined as: “suitable for living or coexistence” (“geschikt om erin of ermee te kunnen leven”). Therefore, livability in the Dutch context is actually a statement on the appropriate and interactive relationship between the subjects(living beings, individuals or communities) and the environment as the object[24].
In 1969, Groot described livability as the objective social security with the means to obtain a reasonable income and enjoy a reasonable life, thus fulfilling the social subjective understanding of demands for goods and services. This definition, which is in favor of economic and social objectives, implies the distinction between objective and subjective livability. The former involves tangible objective information(labor market, facilities, housing quality, etc.), while the latter relates to subjective ways that the actual situation is experienced by people[25]. Livability was brought into regional politics in the 1970s, when people realized that livability should not be centered on buildings. It was more important to consider the quality and not simply the quantity of our material lives. Vermeulen, a city councilor of Rotterdam, described the conversion of this social concept as, “You can figure out the size of your house with the number of bricks, but you can’t know its livability.” In 2002, the Social and Cultural Planning Office of The Netherlands provided the following description of livability: interactions between the physical space, social quality, community characteristics, and environmental security. In 2005, Van Dorst summarized three perspectives on livability: Remarkable livability is the optimal match between human and environment, the livability of the environment should be experienced in a peopleoriented approach, and livability should be inferred from a definable living environment[26].
Since 1998, the Dutch government has designed and distributed a large number of questionnaires to conduct livability surveys and gather the statistics in different parts of the country on a regular basis. The questionnaires mainly comprise livability scores(1-9), which reflect the people’s satisfaction with the living environment, green spaces, cultural and sports facilities, public space infrastructure, social environment, security and etc.
Simultaneously, the Dutch Ministry of Housing, Spatial Planning and the Environment(Ministerie van Volkshuisvesting, Ruimtelijke Ordening en Milieu, VROM) commissioned the Dutch Institute for Public Health and Environment(Rijksinstituut voor Volksgezondheid en Milieu, RIVM) and the RIGO Research Centre to conduct in-depth analysis based on the results of the questionnaires. After reviewing the literature on livability over the past 150 years from the perspectives of architecture, urban planning, sociology, and economics, RIVM and RIGO found that a wide range of differences and even divergences in the definition and research of livability had been present for a long time. Having studied relevant literature, the institutes believed that it was necessary to establish a theoretical framework for multi-disciplinary integration of livability beyond the differences across disciplines of current literature, so as to achieve breakthroughs in the research of livability. To this end, they proposed systematic screening of subjective assessment of livability, perceptions, and behaviors that correspond to the objective environment and relationships. They recommended studies on the influence of environment and people on the perception of environmental livability, longitudinal studies on the inter-disciplinary features of livability, as well as cross-cultural comparison of decisive factors in the assessment of livability, so as to identify the universal, basic, and relative elements based on time, location, and culture.
Based on results from the survey, this research identified the quality of life as a research hub to bridge subjective evaluation and objective environment. The quality of the environment, economy, and society at the community level is crucial in the evaluation of the quality of life. A total of 50 factors have been listed as indicators that can be used to evaluate the livability of Dutch communities. These factors have been divided into several clusters, such as living conditions, public spaces, environmental infrastructure, population composition, social conditions, and security(Fig. 2). They appealed the urgency of using big data and developing advanced AI forecasting tools to support urban design, decision-making in urban planning[27].
2.3 Research Objective and Framework
In response to the advent of the postindustrial society, Battey was the first to propose the concept of“smart city” in the 1990s. As big data was in its early stages at that time, Battey had only stressed the importance of Internet technology in the enhancement of information exchange and competitiveness of cities[28]. Due to its too extensive connotations and involvement in the entire urban system, it is difficult for smart cities to gain unified attention and acceptance. At present, a smart city framework with six sub-systems has been gradually accepted by many scholars, in which the smart citizen, smart environment, and smart life represent three important elements[29]. This conforms to the interactive relations among community residents, subjective livability, and objectively environmental quality in the key conclusions of the theoretical research of livability in The Netherlands.
As illustrated in the conclusions of the RIVM and the RIGO Research Centre, the classical analyses follow the results from the questionnaire. There are limitations in all studies using the traditional paradigms, and it makes no big difference whether or not they come from the questionnaire, observation, systematic theory, mathematical models, or statistical methods. Therefore, beyond the traditional studies, it is necessary to explore the new methodology using big data and developing advanced AI tools to create the basis for the integration of livability evaluation and prediction. The aforesaid big data challenges and opportunities from the fourth paradigm happen to provide an opportunity for this transit. The objective of this research is to develop such a novel dataintensive toolbox based on machine learning to monitor and even to predict livability in the Dutch settlement environment.
The new toolbox aims to maximize the application of all open-source data and develop referenced variables, after that it will transform, integrate and store the data through advanced data engineering and database technologies. Thereafter, the results of livability questionnaires shall be extracted as the target to be predicted in machine learning through the fourth paradigm. In the data warehouse the historical livability grades from the questionnaires are integrated with the most relevant variables at the same historical period, to build predictive models form the machine learning. The developed AI algorithms are able to predict future livability based on the new inputs of the most relevant variables. Consequently, it can be compared with those conclusions drawn from the traditional paradigms to identify the new knowledge discovered by the fourth paradigm. The new inputs can support the retraining in machine learning to improve the model as well. The big data based key research framework is summarized in Fig. 3.
3 Research Method and Process
Due to the big data explosion and the resulting fourth paradigm, the new prediction toolbox shall no longer rely on the existing and widely-used research paradigms. Instead, it shall firstly search for available open-source big data. After that, it will investigate the data via data-intensive machine learning and further build algorithms and prediction models. Thereafter, it may predict and even intervene to the livability of any community with the corresponding parameters in a scientific manner, thus laying a solid foundation for the evaluation and planning of livability in the smart city.
3.1 Data Engineering and Preliminary Data Modeling
The historical grades of community livability were obtained from RIVM and RIGO questionnaires, while all available variables on the population, economic, social, and environmental fields of these communities were obtained from the Dutch Central Bureau of Statistics(Het Centraal Bureau voor de Statistiek, CBS) and other open-source data at the same period. These two datasets are inner joined with each other by their postcodes correspondently. The derived data were used to form possible machine learning data sets. As the data came from different sources, in different formats, were large and disorganized, and had varying frequencies of real-time updates, they match the most typical features of big data, namely the“four V’s”: volume, variety, veracity, and velocity[30]. The necessary data engineering process must be carried out to meet the basic requirements for the data quality of machine learning.
Werner Karl Heisenberg, a pioneer in quantum mechanics, pointed out that, “It must be remembered that what we observe is not nature itself, but nature exposed to our questioning methods”. Consequently, the subjectively cognitive limitation of the traditional paradigm research is evident. The emergence of AI and big data undoubtedly lend a more objectively cognitive methodology, such as the fourth paradigm, to analyze the mysterious natural laws hidden behind the dynamic and complex data. Wolkenhauer summed up data engineering as the perfect combination of cognitive science and systems science, and called it the best practice for the matching of data and models in knowledge engineering(Fig. 4)[31].
According to Cuesta’s data engineering process[32], data acquisition, conversion, cleansing, modeling, and storage can be achieved through key steps such as data flow management, database design, data platform architecture, data pipeline construction, and data script throught computer language during the preprocessing of complex data sets. This complex process can be described using the most typical ETL process(Fig. 5).
After data from all different sources had been cleansed and normalized through the above complex processes, a suitable data model should be designed for the construction of the data warehouse(DWH). Data should be regularly stored in the data warehouse to facilitate in-depth analysis and timely machine learning. A star-schema data model is made based on the current data circumstances and with livability as the core predictive goal. In this relational data model, a fact table with livability grades from each community is found at the center and is associated with the dimensional tables of various domains through the primary and foreign keys. This data model has been designed with reference to the results from the aforesaid classical research based on the framework for livability factors classification(Tab. 1, Fig. 2). The six major dimensions are population dimension, social dimension, economic dimension, housing dimension, service facility dimension, land use and environment dimension(Fig. 6).
3.2 Process of Data Cleansing
In practice, source data collected and modeled through pre-data engineering are often highly disorganized, low in overall quality, and unsuitable for direct machine learning. Extensive data cleansing is required to meet the data quality input requirements for machine learning. While it is definitely not the“sexiest” part of machine learning, it represents one of the required courses for each professional data scientist. Furthermore, data cleansing can be a tough and exhausting“task” that takes up 50%-80% of the energy of a data scientist. It is known that“better data sets tend to outperform smarter algorithms”. In other words, a properly cleansed data set will deliver the deepest insight even with simple algorithms. When a large amount of impurities is present in your data“fuel”, even the best algorithm, i.e. “machine”, is of no help.
Given the importance of data cleansing, it is vital to understand the criteria of qualified data first. In general, qualified data should meet at least the following quality standards(Fig. 7)[33]. 1)Validity: The data must conform to valid constraints defined by the business rules or to the effective range of measures. They include constraints conforming to the data range, data uniqueness constraints, valid value constraints, and cross-field validation. 2)Accuracy: Degree of conformity to the measurements or standard, and to the true value, uniqueness, and nonrepetition. In general, it is sometimes necessary to verify values by accessing external data sources that contain true values as reference. 3)Integrity: Default or missing data and the complete distribution of data values across the ranges shall have a corresponding impact on the results of machine learning. If the system insists that certain fields should not be empty, it is possible to specify a value that represents“unknown” or“missing”. Merely providing the default value does not mean that the data is complete. 4)Consistency: The equivalences of a set of measurements throughout a system. Inconsistency occurs when two data items in a data set are contradictory. Data consistency includes consistency in data content, format, and unit.
It is apparent that different types of data require different types of cleansing algorithms. After evaluating the characteristics and initial quality of this livability data set for machine learning, the proposed data shall be cleansed using the following methods: 1) Delete unwanted or unrelated observation results. 2) Fix structural errors. Structural errors emerge during measurement, data transmission, or other“poor internal management” processes. 3) Check label errors of the data, i.e. initiate unified treatment of different labels with the same meanings. 4) Filter unwanted outliers. Outliers may cause problems in certain types of models. The learning model shall perform better if outliers have been deleted or replaced for good reason. However, this should be carried out with caution. 5) Carry out data deduplication to avoid the overfitting of machine learning. 6) Handle missing data is a challenging issue in machine learning. As most existing algorithms do not accept missing values, they have to be imputed through“Data Imputation” techniques such as: deleting the rows with missing values; replacing the missing numeric vale with 0, average or median values. Missing values can also be estimated using variables of other nonmissing values based on special algorithm. Actual specific treatment shall be determined according to the actual meanings and application scenarios of the values.
Application of the aforementioned professional operations shall result in the merging of community units with duplicate names or those with different names. Different digital labels of the same community are automatically compared and integrated. Error labels and data are corrected. Some outliers are deleted after reconfirmation, and repeated data are removed. After missing data has been processed according to the“Data Imputation”, the data shall basically meet the requirements of machine learning.
3.3 Feature Engineering
As an important part of data preprocessing, feature engineering helps to build the subsequent machine learning model, as well as a key window for knowledge discovery. Feature engineering is a process to search for relevant features that maximize the effectiveness of machine learning with AI algorithms and expertise. It also serves as the basis for machine learning application programs. However, it is very difficult and time-consuming to extract features, and the process requires a lot of expertise. Stanford professor Andrew Ng pointed out that, “‘Applied machine learning’ is basically feature engineering.”[34]
In this machine learning model for livability, the features of different domains of dimensional tables have been investigated to obtain the ranking of local features. This was followed by a research of the dimensional tables of the whole domains to acquire the ranking of global features. The local feature ranking allows the understanding of the impact of local features on livability, thus facilitating the discovery of new knowledge hidden in big data. Global features are applied to construct specific algorithms for machine models and they strive for the highest accuracy.
The machine learning discovered the influences from demographic dimension, social dimension, economic dimension, housing dimension, service facility dimension, as well as the land use& environmental dimension have very varied influences on livability(Fig. 8). In general, the impact of land use& environmental dimension appears the most uneven.
In the first set of dimensions(demographic dimension), the livability factors with the most weighted were the community marital status, the population density, the ratio of people aged 25-44 years, and the number of families with children(in this order). It suggested that only a certain population density was able to form livability, while the young and middle-aged population, stable marital status, and the number of families with children wield a positive effect on the livability of the community.
In the second set of dimensions(social dimension), the livability factors with the most weighted suggested that the proportion of population receiving social relief, non-western immigration ratio from Morocco, Turkey and Suriname, as well as community crime rates wield greater negative effects on the livability of the community.
In the third set of dimensions(economic dimension), the livability factors with the most weighted were the average annual household income, the average number of cars per home, and the household’s purchasing power(in this order).
In the fourth set of dimensions(housing dimension), the livability factors with the most weighted were the ratio of governmental social housing, the number of homeowners, the number of new housing, and the housing vacancy rate(in this order).
In the fifth set of dimensions(facilities dimension), factors such as the number of supermarkets in the community, the number of schools and childcare institutions, the number of health care institutions, the number of fitness facilities, the number of catering and entertainment facilities, and the distances to these facilities have a greater impact on the livability of the community.
In the sixth set of dimensions(land use dimension), the level of urbanization, the total amounts of parks, green corridors, water bodies, transportation facilities, and the distances to these facilities have a greater impact on the livability of the community.
The global features group has been applied to establish specific algorithms for the construction of the machine learning model, and it strives for the highest prediction accuracy. Among the 140 collectible variables that had been ranked by global variable impact factors, only the variables of the second set of dimensions(social dimension), the third set of dimensions(economic dimension), and fourth set of dimensions(housing dimension) have been ranked within the top 20 list. It suggested that as a whole, the social, economic and housing dimensions have a greater impact on the livability of the community. With regards to specific ranking, the following factors have been identified in the top 10 in this order after a simplification, thereby representing the most decisive factors affecting the livability of the community: the ratio of population receiving social relief, the ratio of non-western immigration, the ratio of government low-rent housing, the average market price of purchased housing for residence, the ratio of high-income people, the number of households with fixed incomes, the number of new houses, the overall crime rate, the annual consumption of natural gas, and the annual consumption of electricity(Fig. 9).
After completing the necessary workflows stated above, we shall move on to the core stage of machine learning to develop the algorithms and optimize the models, so as to obtain the best prediction results.
4 Research Results
4.1 Generation of Key Process of Machine Learning
As it is necessary to labelize data to specify the predicted target before machine learning, this research has chosen supervised machine learning. Prior to machine learning, a preliminary conclusion of data evaluation was obtained through necessary data scanning and research. According to the aforementioned principles of data cleansing engineering, large-scale data cleansing was carried out to obtain data that met the basic standard of machine learning. By adhering to the aforementioned principles of data feature engineering, the top 10 important factors were obtained and used as predictors in subsequent algorithm modeling in the machine learning.
Subsequently, the data sets were split into a training dataset and a test dataset, at a ratio of 7:3. The first dataset(training data) was entered into the machine learning algorithm to obtain the training model and the corresponding score. The second dataset(test data) was entered into the training model for comparison and evaluation. This is a type of multi-classification problem in machine learning and the goal is to obtain different grades of livability. Two groups of commonly used decision forest algorithms were planned for selection and optimization: Multiclass Decision Jungle and Multiclass Decision Forest. The two generic algorithms work by building multiple decision trees before voting on the most common output categories. The voting process serves as a form of aggregation, in which each tree in the classification decision forest outputs a non-normalized frequency histogram of the label. The aggregation process sums the histograms and normalizes the results to obtain the“probability” of each label. Trees with higher predictive confidence have greater weight in the final decision ensemble. In general, decision forests are nonparametric models, which means that they support data with different distributions. Within each tree, a series of simple tests were performed for each category, thus increasing the level of the tree structure until the leaf node(decision making) is reached, to best meet the predicted target.
The decision forest classifier of this machine learning is composed of the ensemble of the decision tree. In general, the ensemble models provide better coverage and accuracy, as compared with a single decision tree. Specific workflows of this machine learning through the back-ends codes running in the Cloud are shown in Fig. 10. Following the deployment to the cloud, the algorithm is still required to be improved with newly inputted data at the back-ends regularly, in another word, to update the data model and algorithm through retraining. The entire life cycle of machine learning is shown in Fig. 11.
4.2 Primary Results of Machine Learning
The confusion matrices of the two sets of algorithms can be extracted from the backends of machine learning(Fig. 12). The Multiclass Decision Forest algorithm had better performance as compared to the Multiclass Decision Jungle algorithm. The main errors in the Multiclass Decision Jungle algorithm were that some of the 1-2 grades of livability were overrated as the 3-4 grades. Meanwhile, the Multiclass Decision Jungle algorithm has a lower accuracy in predicting the livability grade 4-9. Similar errors had seldom occurred in the Multiclass Decision Forest, which thus provided better overall performance. In addition, the results showed that the overall prediction accuracy of Multiclass Decision Angle was 76%, and the overall prediction accuracy of Multiclass Decision Forest was 96%. Given the latter being higher than the former, the decision was made to deploy the Multiclass Decision Forest algorithm in the production environment in the Cloud(Tab. 2).
Tab. 2 A comparison of predictive performances of two different machine learning algorithms
Following the retraining of machine learning and prediction of livability in human settlements throughout The Netherlands, livability could be visualized and monitored on the national map(Fig. 13). The darker green areas are more livable, while the red areas are less livable. The figure shows a relatively balanced distribution of key predicted livable regions across the country, with relatively dense concentrations in the megalopolis(Randstad) and in some livable regions near eastern Germany. The relatively low livability regions are concentrated in the new province of Flevoland, which had been formed by land reclamation in recent years. It may be a result of low population density and poor infrastructure and services.
In addition, this prediction tool is able to perform in-depth research and local prediction of urban clusters at the meso-level and community blocks at the micro-level. Fig. 14 shows the prediction of livability in the Greater Rotterdam Area and the Greater Hague Area. The results indicated that livability in downtown areas of some old cities is not high, while that of the suburbs is generally higher. In particular, the northern suburbs of Rotterdam and the Delft junction, as well as the coastal areas of northwest Hague, are relatively livable and densely-concentrated.
5 Conclusion and Prospect
The primary results of this research on the prediction of the livability of human settlements by machine learning are generated from the fourth paradigm. It showed that the AI algorithm was able to directly deduce the top 10 factors that affect the livability of human settlements based on available data sources and necessary data engineering. These 10 factors were simplified as the ratio of population receiving social relief, the ratio of non-western immigration, the ratio of government low-rent housing, the average market price of purchased housing for residence, the ratio of high-income people, the number of households with fixed incomes, the number of new houses, the overall crime rate, the annual consumption of natural gas, and the annual consumption of electricity. Furthermore, the latest variables can be updated according to the latest datasets and the improved model in machine learning developed by retraining to perform live predictions of environmental livability.
The results of this research can be applied in the four stages of livability analysis(Fig. 15). It thereby leads to the monitoring, diagnosis, prediction, and early intervention of livability in human settlements based on timely updated big data and retrained algorithms.
By comparing this research, which was based on the fourth paradigm, with that of aforementioned traditional paradigms, we have found that effective knowledge discovery and high-accuracy prediction models could be obtained without relying heavily on the traditional AI expert system or the long-term studies by professional researchers. These dominant factors were basically consistent with the relevant qualitative research of RIVM and RIGO on livability, either locally or globally focused. Furthermore, the research was able to rank the most decisive factors in a quantitative manner for forecasting, thus allowing scientific researches to be more efficient, faster, foreseeable, and more live data-driven.
In this research, the land-use cluster in the available datasets was relatively small, hence resulting in a sharp cone diagram. This shortcoming should be overcome in future research by collecting more related variables of land-use, carrying out enhanced learning and broadening the observational horizons of the prediction model.
In addition, a greater amount of data, greater processing capabilities, greater computing power, and a more complex computing environment are expected to collect and process large amounts of real-time data. With new technologies, such as the latest 5G, IoT, and quantum computing, we shall be able to gather even more complex and unstructured real-time data to expand the current research in the future, so as to provide it with broader prospects for the applications in Smart City.
Sources of Figures and Tables:
Fig.1 ? reference[12]; Fig. 2 ? reference[27]; Fig. 3, 5, 6, 8-14 ? Wu Jun; Fig. 4 ? reference[31] ; Fig. 7 ? reference[33]; Fig. 15 was drawn by Wu Jun according to Gartner concepts; Tab. 1-2 ? Wu Jun.