文章名称:How two scientists are using the New York Times archives to predict the future
& Z+ \( T. t1 j2 @* H0 \文章作者: Laura Hazard Owen
. ]* ]7 \ u& @) \7 {" |1 R文章来源:http://gigaom.com/2013/02/01/how ... predict-the-future/, w0 o, K2 N% r, C
文章导读:7 E" b/ t. A- F7 }1 u1 I& h8 Y& | e
【科学家利用数据挖掘预测未来】两名科学家利用自制软件分析纽约时报过去22年的报纸,维基百科和其他90家网站资源预测未来:疾病暴发、霍乱和死亡。这样的软件比人工更有优势:学习能力——学习各种类型的海量数据,控制各种信息资源,实时监测、预测和预警;无休止研究;无偏见;海量信息接入。但是由于之前纽约时报新闻报道的偏向性:如对非洲国家等重大事件报道过少等。新媒体的实时记录可能可以改变这种局面。1 c L" l" z, X( D, n+ I) b( O5 R: y3 `/ \
他们认为最近类似的研究比较少,采用探索性评估方式,回顾式研究居多而非预测性和对近期行动有所指导的研究。- U7 d. q) d* v& f" h
2 j. L: Z" r% `( g* R& s
& q& @( |( ]* w* m2 g B
6 D' r% U0 U4 F8 S' c, e d
; v% a6 x( o: |9 L. MResearchers at Microsoft and the Technion-Israel Institute of Technology are creating software that analyzes 22 years of New York Times archives, Wikipedia and about 90 other web resources to predict future disease outbreaks, riots and deaths — and hopefully prevent them.1 \! T) f* A( G
e# N/ \; G) R) U' x+ \* r- y6 j/ g; j8 Q ]
The new research is the latest in a number of similar initiatives that seek to mine web data to predict all kinds of events. Recorded Future, for instance, analyzes news, blogs and social media to “help identify predictive signals” for a variety of industries, including financial services and defense. Researchers are also using Twitter and Google to track flu outbreaks.
" t% m* e$ H4 X1 F9 h) f: m8 n5 f. G* M( J' o( J# A" ~9 v# a! G
; j# o8 A7 {3 S# x7 W
" c% i0 W* x; x+ C4 R* J# ]) l
- I' x0 s5 a! ~: S2 T$ A! D0 F3 N
from “Mining the Web to Predict Future Events,” Horvitz and Radinsky, http://research.microsoft.com/en ... uture_news_wsdm.pdf
& b4 Y! t3 Z% M8 BEric Horvitz of Microsoft Research and Kira Radinsky of the Technion-Israel Institute describe their work in a newly released paper, “Mining the Web to Predict Future Events” (PDF). For example, they examined the way that news about natural disasters like storms and droughts could be used to predict cholera outbreaks in Angola. Following those weather events, “alerts about a downstream risk of cholera could have been issued nearly a year in advance,” they write.% W F& U9 I" G7 K; \3 V
0 G& ^" ]5 T4 H, j9 P9 R
) w" d, M6 n8 A
Horvitz and Radinsky acknowledge that epidemiologists look at some of the same relationships, but “such studies are typically few in number, employ heuristic assessments, and are frequently retrospective analyses, rather than aimed at generating predictions for guiding near-term action.” They outline the advantages that software has over humans in this area:7 ` Y2 D# J1 o- h1 v7 \0 I
0 l" ?' d+ m7 Z" ~
& d8 w" c7 f! M$ v" T' m- @Learning: Software “has the ability to learn patterns from large amounts of data, can monitor numerous information sources, can learn new probabilistic associations over time, and can continue to do real-time monitoring, prediction, and alerting on increases in the likelihoods of forthcoming concerning events.”0 i" h2 V0 M( B( E# g) W. z
Tireless researching: Software, with its “long tentacles into historical corpora and real-time feeds,” can dig up data that humans might never find because they’re too focused on “knowledge that is easily discovered in studies or available from experts.”7 m+ |9 x6 J1 T1 ?) d8 F# ^# A
Lack of bias: Software can assist “when inferences from data run counter to expert expectations,” or when “there is a significantly lower likelihood of an event than expected by experts based on the large set of observations and feeds being considered in an automated manner.”) x9 t! c+ ~! c* Z( v7 M5 ~. m
Greater access to news: “A system monitoring likelihoods of concerning future events typically will have faster and more comprehensive access to news stories that may seem less important on the surface (e.g., a story about a funeral published in a local newspaper that does not reach the main headlines), but that might provide valuable evidence in the evolution of larger, more important stories (e.g., massive riots).”+ g- l' _4 u* H; l* k; \
One of the problems that the researchers faced in developing their software model is the fact that tragic events in poor African countries are often not widely reported. So they taught the software to generalize somewhat: “Instead of considering only ‘Rwanda cholera outbreak,’ an event with a small number of historical cases, we consider more general events of the form: “[Country in Africa] cholera outbreak.” We turn to world knowledge available on the Web…[that] maps Rwanda to the following concepts: Republics, African countries, Land- locked countries, Bantu countries, etc.”
% L% t2 I5 x) m! x5 U
: ?) C: A0 l' N: @9 x7 x
# X0 i! v( b4 r/ S7 CHorvitz and Radinsky also taught the software what to ignore: It “was able to recognize that the drought experienced in New York City on March 1989, published in the NYT under the title: ‘Emergency is declared over drought’ would not be associated with a disease outbreak…The system estimates that, for droughts to cause cholera with high probability, the drought needs to happen in dense populations (such as the refugee camps in Angola and Bangladesh) located in underdeveloped countries that are proximal to bodies of water.”
2 Y" A5 I( r- i# z
) { v; U5 v/ ~( K6 j8 m: k* f! S( ]: Y0 K; R
“I truly view this as a foreshadowing of what’s to come,” Horvitz told the MIT Technology Review. “Eventually this kind of work will start to have an influence on how things go for people.” He said Microsoft isn’t commercializing the research yet, but that it will continue, and he wants to get more “data further back in time.”/ K" V* [% p4 ^8 s2 O' V/ f
|